Apache Hadoop Training for Developers_Day#1
April 19, 2017 | Author: Manish Jain | Category: N/A
Short Description
Download Apache Hadoop Training for Developers_Day#1...
Description
Introduction
Chapter Topics ----------------------------------
› ›
About this course Course Logistics
Course Objective --------------------------------Refer the course objective and content PDF.
Introductions
About your instructor About you Experience with Hadoop? Experience as a developer? Expectations from the course?
Module 1
› › › › ›
Introduction to BIG Data and its Need
Lesson 1: Introduction to BIG Data Lesson 2: Big Data Analytics and why it’s a need now Lesson 3: Real Time Case Studies Lesson 4: Traditional Vs. Big Data Approach Lesson 5: Technologies within Big Data Eco System
Module Objectives At the end of this module, you will learn to:
› › › › › › ›
Introduction to BIG Data Few Examples on BIG Data Big Data Real time case studies Why Big Data is a BUZZ and why it’s a need now Big Data Analytics Comparison between Traditional and Big Data approach Technologies within Big Data Eco System
Lesson 1
Introduction to BIG Data
At the end of this lesson, you will learn to:
› › › ›
What is Big Data? The 3 V’s of BIG Data Few Example of Big Data Why Big Data is a BUZZ!
Introduction
What is BIG Data??? When you hear the term BIG Data… what is the first instant thought?
Volume!!!! right??? Massive, huge, enormous quantities of digital stuff. But it’s not just the volume that makes BIG Data difficult to manage and analyze – it’s also the Variety and Velocity!!!
Big Data : Insight
›
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
› ›
Big Data is equipped enough to handle day - to- day data explosion.
›
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
Big data is difficult to work with using most relational database management systems
What do we Mean by Big Data?
BIG Data has three defining attributes – 3 Vs. They are: Data Volume, Data Variety & Data Velocity Together 3 Vs constitute a comprehensive definition of BIG Data.
3Vs Of BIG Data Volume:
Using millions of transactions
& events to analyze trends
and perform forecast!!
Turning 12 terabytes of Tweets created each day into improved product sentiment analysis!!
Converting 350 billion annual meter readings to better predict power consumption!!
Velocity:
Using fast paced real time transactions for predictive analysis!!
Scrutinizing 5 million trade events created each day to identify potential fraud!!
Volume
Analyzing 500 million daily call detail records in real-time Terabytes Records Transactions
to predict customer churn faster!!
Variety:
BIG Data
Batch Load Near Time Data Real Time Data
Velocity
Collectively analyzing all forms of data (text, sensor data, audio, video, click streams, log files ) gives new insights!!
Structured Unstructured Semi Structured
Monitoring 100’s of live video feeds from surveillance Variety
cameras to target points of interest!!
Exploiting 80% data growth in images, video and documents to improve customer satisfaction!!
Understanding BIG Data - Summary
›
Data that’s an order of magnitude greater than you are accustomed to - Gartner Analyst Doug Laney
›
BIG Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand Database Management Tools - Wikipedia
›
3 Vs - Volume
- Velocity & - Variety
Few Examples of BIG Data Facebook handles 40 billion photos from its user base and has more than 901 million active users generating social interaction data.
RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems.
Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data
More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones
340 million tweets are sent per day. That's nearly 4,000 tweets per second.
Boeing jet engines produces terabytes of operational information every 30 minutes they turn. A four engine jumbo jet can create 640 terabytes of data on just one Atlantic crossing, multiply that by the more than 25,000 flights flown each day”
10,000 payment card transactions are made every second around the world.
Why BIG Data is a BUZZ!
›
BIG Data Platform can be used to analyze semi structured & unstructured data along with raw structured data…
›
BIG data solutions are ideal for iterative & exploratory analysis when business measures can not be pre-determined using structured data set…
›
Big Data can be used to supports Predictive Analytics and Provide Predictive Enterprise Solutions using all forms of real transactions in contrary to the Traditional DWBI.
So what BIG data means to a business?
Few of case study for BIG Data would be
• • • • •
• • • • •
Profile customers & Gain Customer Trust Determine pricing strategies Identify competitive advantages Better target advertising Strengthen customer service
Performing IT log Analytics Identifying Fraud detection pattern Sentiment Analytics using SocialMedia Feed Executing usage analytics in Energy Sector Analyzing competitor market penetration
Summary In this chapter you have learned
› › › ›
What is Big Data? The 3 V’s of BIG Data Few Example of Big Data Why Big Data is a BUZZ!
Lesson 2
BIG Data Analytics & Why it’s a Need Now?
At the end of this lesson, you will learn to:
› › › › ›
What is Big Data Analytics? Its advantages and challenges.
Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation.
What is BIG Data Analytics
›
Big data analytics is the process of examining large amounts of data of a variety of types (Structured, Semi-Structured or Unstructured), to uncover hidden patterns, unknown correlations and other useful information.
›
The primary goal of big data analytics is to help companies make better business decisions by analyzing data sources that may be left untapped by conventional business intelligence (BI) programs
›
Underlying data may include Web server logs, Internet clickstream data, social media activity reports, mobile-phone call detail records, information captured by sensors, IT Logs etc. Advantages?
Challenges?
• • • • •
• • • • •
Making sense out of unstructured Data Optimized usage of Organizational Data Value add to existing BI Solutions More accurate BI Results Best bet to make Better Business Decisions
Lack of skill set High Initial Cost Involvement Challenges in integrating BIG Data Little Awareness of Technologies Unavailability of matured BIG Data toolset
Why BIG Data Analytics is a ‘need’ now? Information is at center of a New Wave Opportunity…
44x
80% 35
As much data & content In coming decade
… and Organization needs Deeper Insights
1 in 3
business leaders frequently makes Business Decisions based on Information they do not trust or do not have!!!
1 in 2
business leaders say they do not have access to relevant Information they require to do their job!!!
Of World’s available data Is Unstructured or Semi Structured
83 %
of CIO cited “BI” as part of their visionary plans to be competitive and enhance their competitiveness
60 %
of CEOs need to do a better job capturing & understanding Information rapidly in order to make swift business decisions
Zeta bytes of data by 2020
BIG Data Platform helps you combine varied data forms for making decisions
Why BIG Data Analytics is a ‘need’ now? What people think about your company or product???
Call Centre
Web
Who are the BIGGEST influencers and what are they saying
Social Network
BIG Data Platform provides you multi-channel Customer Sentiment Analytics
Why BIG Data Analytics is a ‘need’ now? Future continues to bring new data sources with high data volume
2 Billion Internet users as of now 4.6 Billion mobile phones worldwide
New Information Sources
Facebook produces 10 TB data everyday Traditional Sources
Steady growth of traditional data
Twitter produces 7TB data everyday
Enormous Satellite data growth
New media channels emerging everyday
Digitization makes exponential data growth
BIG Data Platform ensures consolidation of ever growing varied data sets
Why BIG Data Analytics is a ‘need’ now?
Imagine if we could… … predict infections in pre-mature newborns 24 hours earlier?
Physician
… apply social relationships of customers to prevent churn?
Call Centre REP
… adjust credit lines as transactions are occurring to account for risk fluctuations
Loan Officer
… determine whom to offer discounts at time of sale instead of offering to all
Sales Associate
BIG Data Platform can be used across industry for making Analytic Decisions
BIG Data: The Solution The Solution – Bring together any data source @ any velocity to generate insight • Analyze variety of data @ enormous volume • Insight on streaming data • Large volume structured data analysis
Multi channel customer sentiment Analytics
Predict weather patterns to optimize capital expenditure Make risk decisions based on real time transactional data
BIG Data Platform • Velocity • Variety • Volume
Identify criminal & threats from disparate Audio/VDO Find life threatening conditions in time to intervene
Implementing BIG Data Analytics - Different Approaches Approach
Interactive Exploration
Operational Reporting
Indirect Batch Analysis
Use Cases
For Data Analysts & Data Scientists who wants to discover real time pattern as they emerge from their BIG Data Content
Foe executives & Operational Managers who wants summarized, prebuilt, periodic reports on BIG Data Content
For Data Analysts & Op Managers who want to analyze data trends based on predefined questions in their BIG Data Content
Low
Medium
High
Hbase, No-SQL, Analytic DBMS
Hive, No-SQL, Analytic DBMS
Hadoop, No-SQL, Analytic DBMS
Native
Native, SQL
ETL
Latency BIG Data Platform Connectivity Architecture
Multidimensional Analysis
Reports & Dashboards
In-Memory Engine
BI Platform Native
OLAP Engine
BI Platform Native
BIG Data
Multidimensional Analysis
BI Platform
SQL BIG Data
Data Mart
ETL
BIG Data
Summary In this chapter you have learned
› › › › ›
What is Big Data Analytics? Its advantages and challenges. Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation.
Traditional Analytics Lesson 3
Vs.
Big Data Analytics At the end of this lesson, you will learn to:
› › › › ›
The Traditional Approach The BIG Data Approach
Traditional Vs. Big Data Approach BIG Data Complements Traditional Enterprise Data Warehouse Traditional Analytics Vs. Big Data Analytics
The Traditional Approach: Business Requirement Drives Solution Design The Traditional Approach : Business Requirements drive solution design Business defines requirements – what questions should we ask
IT designs a solution with a set structure & functionality
New Requirements require redesign & rebuild
Business executes queries to answer questions over and over
Well suited to
Stretched by
• • • •
• • • •
High Value, Structured Data Repeated operations & processes Relatively stable sources Well understood requirements
Highly valuable data and content Exploratory analysis Volatile sources Changing requirements
The BIG Data Approach: Information Sources drive Creative Discovery The BIG Data Approach : Information sources drive Creative discovery Business & IT identify available Information Sources
New insights drive integration to traditional technology
IT delivers platform that enables creative exploration of all available data & content
Business determines what questions to ask by exploring data & relationships
Can be implemented for • • • •
Structured or Unstructured Data Exploratory operations & processes Relatively Unstable sources Unknown Business Requirements
Traditional and BIG Data Approaches Traditional approach vs BIG Data approach Traditional Approach
BIG Data Approach
Structured & Repeatable Analysis
Iterative & Exploratory Analysis
Business Users Determines what questions to ask
IT Delivers platform to enable creative discovery
IT
Business
Structures data to answer questions
Explores what questions could be asked
Monthly Sales Report - Profitability Analysis - Customer Surveys
Brand sentiment - Product Strategy - Maximizing Utilization
BIG Data Complements Traditional Enterprise Data Warehouse BIG Data shouldn’t be a silo, Must be an integrated part of your Enterprise Information Architecture Data Warehouse
BIG Data Platform
Enterprise Integration
Traditional Sources
New Sources
Traditional Analytics Platform v/s BIG Data Analytics Platform Traditional DW Analytics Platform
BIG Data Analytics Platform
Gigabytes to Terabytes
Petabytes to Exabyte
Centralized Data Structure
Distributed Data Structure
Structured
Semi Structured & Non Structured
Relational Data-model
Flat Schemas
Batch oriented data load process
Aimed at near real time analysis of the data
Analytics based on historical trends
BIG Data Analytics is based on real time data
Data generated using conventional method (Data Entry)
Data generated using unconventional methods like, RFID, Sensor networks etc.
Summary In this chapter you have learned
› › › ›
The Traditional Approach
›
Traditional Analytics Vs. Big Data Analytics
The BIG Data Approach Traditional Vs. Big Data Approach How Big Data Complements Traditional Enterprise Data Warehouse
Lesson 4
Real Time Case Studies
At the end of this lesson, you will learn to:
› › › ›
Big Data Analytics: Use Cases Big Data to Predict Your Customers' Behaviors When to consider for Big Data Solution Big Data Real Time Case Studies
BIG Data Analytics - Use Cases • Web/E-Commerce/Internet •
Integrated Website Analytics
• Retail • • • •
Competitive Pricing Customer Segmentation Predictive buying behavior Market Campaign Management
• Government • •
Defense Intelligence Analysis Threat Analytics
• Telecommunications •
• •
Customer Segmentation Service Response Optimization
• Financial Services • •
Fraud detection Analytics Risk Modeling & Analysis
• Manufacturing •
Inventory Optimization
• Energy & Utilities • •
Customer Experience Analytics Service Quality Optimization
Customer Experience Analytics
• Healthcare & Pharmaceutical •
• Insurance
Drug Discovery
• Media & Content • •
Customer Satisfaction Analytics Dispatch Optimization
Big Data to Predict Your Customers' Behaviors
›
Retailers like Wal-Mart and Kohl's are making use of sales, pricing, and economic data, combined with demographic and weather data, to fine-tune merchandising store by store and anticipate appropriate timing of store sales.
›
Online data services like eHarmony and Match.com are constantly observing activity on their sites to optimize their matching algorithms to predict who will hit it off with whom.
›
Google search queries on flu symptoms and treatments reveal weeks in advance what flu-related volumes hospital emergency departments can expect.
›
BIG Data provides capacity to predict the future before your rivals can — whether they're companies or criminals. Currently NYPD is using Big Data platform to fight crime in Manhattan.
When to consider for Big Data Solution
›
Big Data solutions are ideal for analyzing not only raw structured data, but semi structured and unstructured data from wide variety of sources.
›
Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn't nearly as effective as a larger set of data from which to derive analysis.
›
Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.
Big Data Real Time Case Study
›
TXU Energy – Smart Electric Meters:
Because of smart meters, electricity providers can read the meter once every 15 minutes rather than once a month. This not only eliminates the need to send some one for meter reading, but as the meter is read once every fifteen minutes, electricity can be priced differently for peak and off-peak hours. Pricing can be used to shape the demand curve during peak hours, eliminating the need for creating additional generating capacity just to meet peak demand, saving electricity providers millions of dollars worth of investment in generating capacity and plant maintenance costs.
Big Data Real Time Case Study ….(Cont’d)
›
T-Mobile USA:
T-Mobile USA has integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections. By leveraging social media data (Big Data) along with transaction data from CRM and Billing systems, T-Mobile USA has been able to “cut customer defections in half in a single quarter”.
Big Data Real Time Case Study ….(Cont’d)
›
US Xpress :
US Xpress, provider of a wide variety of transportation solutions collects about a thousand data elements ranging from fuel usage to tire condition to truck engine operations to GPS information, and uses this data for optimal fleet management and to drive productivity saving millions of dollars in operating costs.
Big Data Real Time Case Study ….(Cont’d)
›
McLaren’s Formula One racing team :
McLaren’s Formula One racing team uses real-time car sensor data during car races, identifies issues with its racing cars using predictive analytics and takes corrective actions proactively before it’s too late!
Summary In this chapter you have learned
› ›
Big Data Analytics: Use Cases
› ›
When to consider for Big Data Solution
Big Data to Predict Your Customers' Behaviors
Big Data Real Time Case Studies Like TXU smart meters, T- Mobile, US Xpress, McLaren’s Formula One racing team
Lesson 5
Technologies within Big Data Eco System
At the end of this lesson, you will learn to:
› › › › ›
BIG Data Landscape BIG Data Key Components
Components of Analytical Big-data Processing Hadoop at a glance Conclusion
BIG Data Landscape Hardware
• Storage • Servers • Networking
BIG Data Distributions • Open source Hadoop Distributions • Enterprise Hadoop Distributions • Non-Hadoop BIG Data Frameworks
Data Management Components • Distributed File Sources • No-SQL Databases • Hadoop Optimized Datawarehouse • Data Integration • Data Quality & Governance
Analytics Layer
• Analytic Application development Platforms • Advanced Analytic Applications
Application Layer
Services
• Data Visualization Tools • BI Applications
• • • •
Vendore include Datameter, ClickFox, Platfora, Tableau Software, Tresata, IBM, SAP, Microstrategy, Pentaho, QlikTech, Japersoft
Vendor include Trisata, Tidemark, Think Big Analytics, Amazon Web Services, Accenture, Cloudera, Hortonworks.
Consulting Training Technical Support Software /Hardware Maintenance • Hosting / BIG Data as a Service – Cloud
Vendors Vendors include DELL, HP, Arista, IBM, Cisco, EMC, NetApp
Vendors/Providers include Apache, Cloudera, Hortonworks, IBM,EMC, MapR, LexisNexis
Vendors/Providers include Apache, DataStax, Pervasive, Couchbase, IBM, Oracle, Informatica, Syncsort, Talend.
Vendors/Providers include Apache, Karmasphere, Hadapt, Attivio, 101Data, EMC, SAS Institute, Digital Reasoning, Revolution Analytics.
BIG Data Key Components Abstract Layers
Click Fox, Merced etc..
Analytic Applications
Eg. Greenplum Netizza
Fast Loading Analytic Database
ETL & Modeling
ETL (Extract Transform & Load) & Modeling Tools (CRX) Structured Source
Management & Security Higher Level Languages
Cascading
Kerberos
Pig
Hive (DW)
Map Reduce Engine
Job & Task Trackers Hadoop Location aware File System Processing & Original Data
File System, eg. HDFS
No SQL Database, eg. HBASE, Cassandra
Components of Analytical Big-data Processing
›
Raw massive data: Kept within cheap commodity machines/ servers. They are further categorized as Nodes and clusters.
›
File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.
›
Computation Engine: Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of computations.
›
Data warehouse Layer: Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers for BIG Data Platform.
›
Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and managed as a unit. It is widely used to develop special tools.
›
Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set up the database that will run the analytics.
›
Analytic Database : Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data for the analytic models .
›
ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business issues.
Hadoop at a Glance Hadoop: A big tool for BIG Data
It is not advisable to dig out the hole for a pool using only an ice cream scooper; you need a big tool – Hadoop is one among them!!!
Apache Hadoop is an open-source project which was inspired by BIG Data research of Google.
Hadoop is best available tool for processing and storing herculean amounts of big Data. Hadoop throws thousands of computers at big data problem, rather than using single computer.
In Hadoop parlance, group of coordinated computers is called cluster & individual computers in the cluster are called nodes.
Hadoop makes data mining, analytics, and processing of big data cheap and fast when compared with other toolsets.
Hadoop is cheap, fast, flexible & scales to large amounts of big data storage & computation.
Conclusion
Looking at the Data explosion, the real issue is not to acquire large amount of Data or storing those data, it is what you do with your BIG Data!!!
With BIG Data and BIG Data Analytics, it’s possible to:
Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory. Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk. Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign etc. Quickly identify customers who matter the most. Generate retail coupons at the point of sale based on the customer's current and past purchases Send tailored recommendations at just the right time, while customers are in the right location Analyze data from social media to detect new market trends and changes in demand. Use clickstream analysis and data mining to detect fraudulent behavior. Determine root causes of failures, issues & defects by investigating user sessions, network logs & sensors.
The working principle behind all big data platform is to move the query to the data to be processed, not the data to the query processor.
It’s time to move on and try to avoid just looking at the rear view mirror and drive the
car (Traditional BI) but also to look a step forward and get into the predictive analytics by using the power of BIG DATA and hence help the organization to take right decision at right point of time.
Summary In this chapter you have learned
› › ›
BIG Data Landscape
› ›
Hadoop at a glance
BIG Data Key Components Components of Analytical Big-data Processing
Conclusion
Module 2
› › › ›
Introduction to Apache Hadoop and its Ecosystem
Lesson 1: The Motivation for Hadoop Lesson 2: Hadoop: Concepts and Architecture Lesson 3: Hadoop and the Data Warehouse: When and Where to use which
Lesson 4: Introducing Hadoop Eco system components
Module Objectives At the end of this module, you will learn to:
› › › › › › ›
Introduction to Apache Hadoop The motivation for Hadoop The Basic concept of Hadoop Hadoop Architecture Hadoop Distributes File System (HDFS) and MapReduce Right usage and scenarios for Hadoop Introduction to key Hadoop Eco System Projects
Lesson 1
The Motivation for Hadoop
At the end of this lesson, you will learn to:
› › ›
What problems exist with traditional large scale computing systems What requirements an alternative approach should have How Hadoop addresses those requirements
Traditional Large Scale Computation
›
Traditionally, computation has been processor bound – Relatively small amounts of data – Significant amount of complex processing performed on that data
›
For decades, the primary push was to increase the computing power of a
› single machine – Faster processor, more RAM
›
Distributed systems evolved to allow developers to use multiple machines
for a single job – MPI – PVM – Condor MPI: Message Passing Interface PVM: Parallel Virtual Machine
Distributed Systems: Problems
Programming for traditional distributed systems is complex – Data exchange requires synchronization – Finite bandwidth is available – Temporal dependencies are complicated – It is difficult to deal with partial failures of the system
Ken Arnold, CORBA designer: – “Failure is the defining difference between distributed and local Programming, so you have to design distributed systems with the expectation of failure” – Developers spend more time designing for failure than they do actually working on the problem itself
CORBA: Common Object Request Broker Architecture
Distributed Systems: Data Storage
Typically, data for a distributed system is stored on a SAN At compute time, data is copied to the compute nodes Fine for relatively limited amounts of data
The Data Driven World
Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded
Examples: – Facebook – over 70PB of data – eBay – over 5PB of data
Many organizations are generating data at a rate of terabytes per day
Data Becomes the Bottleneck
Moore’s Law has held firm for over 40 years – Processing power doubles every two years – Processing speed is no longer the problem
Getting the data to the processors becomes the bottleneck Quick calculation – Typical disk data transfer rate: 75MB/sec – Time taken to transfer 100GB of data to the processor: approx. 22 minutes! – Assuming sustained reads – Actual time will be worse, since most servers have less than 100GB of RAM available
A new approach is needed
Partial Failure Support
The system must support partial failure – Failure of a component should result in a graceful degradation of application performance – Not complete failure of the entire system
Data Recoverability
If a component of the system fails, its workload should be assumed by still functioning units in the system – Failure should not result in the loss of any data
Component Recovery
If a component of the system fails and then recovers, it should be able to rejoin the system – Without requiring a full restart of the entire system
Consistency
Component failures during execution of a job should not affect the outcome of the job
Scalability
Adding load to the system should result in a graceful decline in performance of individual jobs – Not failure of the system
Increasing resources should support a proportional increase in load capacity
Hadoop’s History
Hadoop is based on work done by Google in the late 1990s/early 2000s – Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the problem of distributed computing – Meets all the requirements we have for reliability and scalability
Core concept: distribute the data as it is initially stored in the system – Individual nodes can work on data local to those nodes – No data transfer over the network is required for initial processing
Core Hadoop Concepts
Applications are written in high level code – Developers need not worry about network programming, temporal dependencies or low/level infrastructure
Nodes talk to each other as little as possible – Developers should not write code which communicates between nodes – ‘Shared nothing’ architecture
Data is spread among machines in advance – Computation happens where the data is stored, wherever possible – Data is replicated multiple times on the system for increased availability and reliability
Hadoop: Very High/Level Overview
When data is loaded into the system, it is split into ‘blocks’ – Typically 64MB or 128MB
Map tasks (the first part of the MapReduce system) work on relatively small portions of data – Typically a single block
A master program allocates work to nodes such that a Map task will work on a block of data stored locally on that node whenever possible – Many nodes work in parallel, each on their own part of the overall dataset
Fault Tolerance
If a node fails, the master will detect that failure and re-assign the work to a different node on the system
Restarting a task does not require communication with nodes working on other portions of the data
If a failed node restarts, it is automatically added back to the system and assigned new tasks
If a node appears to be running slowly, the master can redundantly execute another instance of the same task – Results from the first to finish will be used – Known as ‘speculative execution’
Summary In this chapter you have learned
›
What problems exist with traditional large-scale computing systems
›
What requirements an alternative approach should have
›
How Hadoop addresses those requirements
Lesson 2
Hadoop: Concepts and Architecture
At the end of this lesson, you will learn to:
› › › › ›
What Hadoop is all about Hadoop Components
What features the Hadoop Distributed File System (HDFS) provides HDFS Architecture The concepts behind MapReduce
The Hadoop Project
Hadoop is an open-source project overseen by the Apache Software Foundation Originally based on papers published by Google in 2003 and 2004 Hadoop committers work at several different organizations – Including Yahoo!, Facebook, LinkedIn
Hadoop Components
Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce
There are many other projects based around core Hadoop – Often referred to as the ‘Hadoop Ecosystem’ – Pig, Hive, HBase, Flume, Oozie, Sqoop etc – Many are discussed later in the course
A set of machines running HDFS and MapReduce is known as a Hadoop Cluster – Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands – More nodes = better performance
Hadoop Components: HDFS
HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster
Data is split into blocks and distributed across multiple nodes in the cluster – Each block is typically 64MB or 128MB in size
Each block is replicated multiple times – Default is to replicate each block three times – Replicas are stored on different nodes – This ensures both reliability and availability
HDFS
The Data File is broken up into 64MB or 128 MB blocks
The Data Blocks are replicated 3 times and scattered amongst the workers
Hadoop Components: MapReduce
MapReduce is the system used to process data in the Hadoop cluster
Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort
Each Map task operates on a discrete portion of the overall dataset – Typically one HDFS block of data
After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase – Much more on this later!
HDFS Basic Concepts
HDFS is a filesystem written in Java – Based on Google’s GFS
Sits on top of a native filesystem – Such as ext3, ext4 or xfs
Provides redundant storage for massive amounts of data – Using ‘commodity’ (relatively low/cost)
computers
HDFS Basic Concepts (Cont’d)
HDFS performs best with a ‘modest’ number of large files – Millions, rather than billions, of files – Each file typically 100MB or more
Files in HDFS are ‘write once’ – No random writes to files are allowed
HDFS is optimized for large, streaming reads of files – Rather than random reads
How Files Are Stored Files are split into blocks – Each block is usually 64MB or 128MB
Data is distributed across many machines at load time – Different blocks from the same file will be stored on different machines – This provides for efficient MapReduce processing (see later)
Blocks are replicated across multiple machines, known as DataNodes – Default replication is three/fold – Meaning that each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located
– Known as the metadata
How Files Are Stored. Example
NameNode holds metadata for the two files (Foo.txt and Bar.txt)
DataNodes hold the actual blocks – Each block will be 64MB or 128MB in size – Each block is replicated three times on the cluster
More On The HDFS NameNode The NameNode daemon must be running at all times – If the NameNode stops, the cluster becomes inaccessible – Your system administrator will take care to ensure that the NameNode hardware is reliable!
The NameNode holds all of its metadata in RAM for fast access – It keeps a record of changes on disk for crash recovery
A separate daemon known as the Secondary NameNode takes care of some housekeeping tasks for the NameNode – Be careful: The Secondary NameNode is not a backup NameNode!
CDH4 introduces NameNode High Availability – NameNode is not a single point of failure – Features an Active and a Standby NameNode
HDFS: Points To Note
Although files are split into 64MB or 128MB blocks, if a file is smaller than this the full 64MB/128MB will not be used
Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop’s configuration files – This will be set by the system administrator
Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster When a client application wants to read a file: – It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on – It then communicates directly with the DataNodes to read the data – The NameNode will not be a bottleneck
Accessing HDFS
Applications can read and write HDFS files directly via the Java API – Covered later in the course
Typically, files are created on a local filesystem and must be moved into HDFS Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem
Access to HDFS from the command line is achieved with the hadoop fs command
Hadoop fs Examples
hadoop fs Examples (cont’d)
hadoop fs Examples (cont’d)
Hands-On Exercise: Using HDFS
Aside: The Training Virtual Machine
During this course, you will perform numerous Hands-On Exercises using the Training Virtual Machine (VM)
The VM has Hadoop installed in pseudo-distributed mode – This essentially means that it is a cluster comprised of a single node – Using a pseudo/distributed cluster is the typical way to test your code before you run it on your full cluster – It operates almost exactly like a ‘real’ cluster – A key difference is that the data replication factor is set to 1, not 3
Hands-On Exercise: Using HDFS
In this Hands-On Exercise you will gain familiarity with manipulating files in HDFS
Please refer to the Hands-On Exercise Manual
What is MapReduce
MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node – Where possible
Consists of two phases: – Map – Reduce
Features of MapReduce
Automatic parallelization and distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers – MapReduce programs are usually written in Java – Can be written in any language using Hadoop Streaming – All of Hadoop is written in Java
MapReduce abstracts all the ‘housekeeping’ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions
Giant Data: MapReduce and Hadoop In 2010, Facebook sat on top of a mountain of data; just one year later it had grown from 21 to 30 petabytes. If you were to store all of this data on 1TB hard disks and stack them on top of one another, you would have a tower twice as high as the Empire State building in New York.
Enterprises like Google and Facebook use the map–reduce approach to process petabyte-range volumes of data. For some analyses, it is an attractive alternative to SQL databases, and Apache Hadoop exists as an open source implementation.
MapReduce: Automatically Distributed
›
Processing and analyzing such data need to take place in a distributed process on multiple machines. However, this kind of processing has always been very complex, and much time is spent solving recurring problems, like processing in parallel, distributing data to the compute nodes, and, in particular, handling errors during processing. To free developers from these repetitive tasks, Google introduced the MapReduce framework.
MapReduce Framework
The MapReduce framework breaks down data processing into map, shuffle, and reduce phases. Processing is mainly in parallel on multiple compute nodes.
MapReduce: Map Phase
›
The Map Phase
›
The Shuffle Phase
›
The Reduce Phase
MapReduce Programming Example: Search Engine
› ›
A web search engine is a good example for the use of MapReduce. Set of MapReduce programming is used to implement page Rank algorithm, that Google uses to evaluate the relevance of a page on the web.
Schematic process of a map–reduce computation
›
Map Method:
›
Reduce Method:
The use of a combiner The use of a combiner makes sense for arithmetic operations in particular.
MapReduce: The Big Picture
The Five Hadoop Daemons
Hadoop is comprised of five separate daemons
NameNode — Holds the metadata for HDFS
Secondary NameNode — Performs housekeeping functions for the NameNode — Is not a backup or hot standby for the NameNode
DataNode — Stores actual HDFS data blocks
JobTracker — Manages MapReduce jobs, distributes individual tasks to machines running the…
TaskTracker — Instantiates and monitors individual Map and Reduce tasks
The Five Hadoop Daemons (cont’d)
Each daemon runs in its own Java Virtual Machine (JVM) No node on a real cluster will run all five daemons – Although this is technically possible
We can consider nodes to be in two different categories: – Master Nodes – Run the NameNode, Secondary NameNode, JobTracker daemons – Only one of each of these daemons runs on the cluster – Slave Nodes – Run the DataNode and TaskTracker daemons • A slave node will run both of these daemons
Basic Cluster Configuration
Basic Cluster Configuration (Cont’d)
On very small clusters, the NameNode, JobTracker and Secondary NameNode can all reside on a single machine – It is typical to put them on separate machines as the cluster grows beyond 20/30 nodes
Each dotted box on the previous diagram represents a separate Java Virtual Machine (JVM)
Submitting A Job
When a client submits a job, its configuration information is packaged into an XML file
This file, along with the .jar file containing the actual program code, is handed to the JobTracker – The JobTracker then parcels out individual tasks to TaskTracker nodes – When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task – TaskTracker nodes can be configured to run multiple tasks at the same time – If the node has enough processing power and memory
MapReduce: The JobTracker
MapReduce jobs are controlled by a software daemon known as the JobTracker The JobTracker resides on a ‘master node’ – Clients submit MapReduce jobs to the JobTracker – The JobTracker assigns Map and Reduce tasks to other nodes on the cluster – These nodes each run a software daemon known as the TaskTracker – The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
MapReduce: Terminology
A job is a ‘full program’ – A complete execution of Mappers and Reducers over a dataset
A task is the execution of a single Mapper or Reducer over a slice of data A task attempt is a particular instance of an attempt to execute a task – There will be at least as many task attempts as there are tasks – If a task attempt fails, another will be started by the JobTracker – Speculative execution (see later) can also result in more task attempts than completed tasks
MapReduce: The Mapper
MapReduce: The Mapper (cont’d)
The Mapper may use or completely ignore the input key – For example, a standard pattern is to read a line of a file at a time – The key is the byte offset into the file at which the line starts – The value is the contents of the line itself – Typically the key is considered irrelevant
If the Mapper writes anything out, the output must be in the form of key/value pairs
Example Mapper: Upper Case Mapper
Example Mapper: Explode Mapper
Example Mapper: Filter Mapper
Example Mapper: Changing Keyspaces
MapReduce: The Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
This list is given to a Reducer – There may be a single Reducer, or multiple Reducers – This is specified as part of the job configuration (see later) – All values associated with a particular intermediate key are guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the Reducer in sorted key order – This step is known as the ‘shuffle and sort’
The Reducer outputs zero or more final key/value pairs – These
are written to HDFS – In practice, the Reducer usually emits a single key/value pair for each input key
Example Reducer: Sum Reducer
Example Reducer: Identity Reducer
MapReduce Example: Word Count
MapReduce Example: Word Count (Cont’d)
MapReduce Example: Word Count (Cont’d)
MapReduce: Data Locality
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the network as it processes that data
Once the Map tasks have finished, data is then transferred across the network to the Reducers – Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers – All Mappers will, in general, have to communicate with all Reducers
MapReduce: Is Shuffle and Sort a Bottleneck? It appears that the shuffle and sort phase is a bottleneck – The reduce method in the Reducers cannot start until all Mappers have finished
In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers finish work – This mitigates against a huge amount of data transfer starting as soon as the last Mapper finishes – Note that this behavior is configurable – The developer can specify the percentage of Mappers which should finish before Reducers start retrieving data – The developer’s reduce method still does not start until all intermediate data has been transferred and sorted
MapReduce: Is a Slow Mapper a Bottleneck? It is possible for one Map task to run more slowly than the others – Perhaps due to faulty hardware, or just a very slow machine
It would appear that this would create a bottleneck – The reduce method in the Reducer cannot start until every Mapper has finished
Hadoop uses speculative execution to mitigate against this – If a Mapper appears to be running significantly more slowly than the others, a new instance of the Mapper will be started on another machine, operating on the same data – The results of the first Mapper to finish will be used – Hadoop will kill off the Mapper which is still running
Hands/On Exercise: Running A MapReduce Job In this Hands-On Exercise, you will run a MapReduce job on your pseudodistributed Hadoop cluster
Please refer to the Hands-On Exercise Manual
Summary In this chapter you have learned
What Hadoop is all about? What are the components in Hadoop Concept and detailed architecture of HDFS.
What features the Hadoop Distributed File System (HDFS) provides
The concepts behind MapReduce Few illustrations on MapReduce and how it works in real time.
Lesson 3
Hadoop and the Data Warehouse: When and Where to use which
At the end of this lesson, you will learn to:
›
Find out the answer, when should I use Hadoop, and when should I put the data into a data warehouse?
› › ›
Hadoop Differentiators Data Warehouse Differentiators Where and where to use which?
Hadoop and the Data Warehouse
Figure 1. Before: Data flow of meter reading done manually
Figure 2. After: Meter reading every 5 or 60 minutes via smart meters
Hadoop Differentiators
›
Hadoop is the repository and refinery for raw data.
›
Hadoop is a powerful, economical and active archive.
Data Warehouse Differentiators
› › ›
Data warehouse performance Integrated data that provides business value Interactive BI tools for end users
When and Where to Use Which While there are certain use cases that are distinct to Hadoop or the data warehouse, there is also overlap where either technology could be effective. The following table is a good starting place for helping to decide which platform to use based on your requirements.
Summary In this chapter you have learned
Tried finding out the answer, when should I use Hadoop, and when should I put the data into a data warehouse?
Learn the Hadoop and Data Warehouse Differentiators
Also learn and discussed, where to use which?
Lesson 4
Introducing Hadoop Eco system components
At the end of this lesson, you will learn to:
›
A quick over view of few key Hadoop Eco system projects. Like Hive, Pig, Flume, Sqoop, Oozie, Hbase. The details of each one with demo and Hands-on will be covered in separate module.
Other Ecosystem Projects: Introduction The term ‘Hadoop core’ refers to HDFS and MapReduce Many other projects exist which use Hadoop core – Either both HDFS and MapReduce, or just HDFS
Most are Apache projects or Apache Incubator projects – Some others are not hosted by the Apache Software Foundation – These are often hosted on GitHub or a similar repository
We will investigate many of these projects later in the course Following is an introduction to some of the most significant projects
Hive
Hive is an abstraction on top of MapReduce Allows users to query data in the Hadoop cluster without knowing Java or MapReduce
Uses the HiveQL language – Very similar to SQL
The Hive Interpreter runs on a client machine – Turns HiveQL queries into MapReduce jobs – Submits those jobs to the cluster
Note: this does not turn the cluster into a relational database server – It is still simply running MapReduce jobs – Those jobs are created by the Hive Interpreter
Hive (cont’d)
Pig
Pig is an alternative abstraction on top of MapReduce Uses a dataflow scripting language – Called PigLatin
The Pig interpreter runs on the client machine – Takes the PigLatin script and turns it into a series of MapReduce jobs – Submits those jobs to the cluster
As with Hive, nothing ‘magical’ happens on the cluster – It is still simply running MapReduce jobs
Pig (Cont’d)
Flume
Flume provides a method to import data into HDFS as it is generated – Rather than batch/processing the data later – For example, log files from a Web server
A high level diagram
Sqoop
›
Sqoop provides a method to import data from tables in a relational database into HDFS – Does this very efficiently via a Map/only MapReduce job – Can also ‘go the other way’ – Populate database tables from files in HDFS
›
We will investigate Sqoop later in the course.
Oozie
Oozie allows developers to create a workflow of MapReduce jobs – Including dependencies between jobs
The Oozie server submits the jobs to the server in the correct sequence We will investigate Oozie later in the course
HBase
HBase is ‘the Hadoop database’ A ‘NoSQL’ datastore Can store massive amounts of data – Gigabytes, terabytes, and even petabytes of data in a table
Scales to provide very high write throughput – Hundreds of thousands of inserts per second
Copes well with sparse data – Tables can have many thousands of columns – Even if most columns are empty for any given row
Has a very constrained access model – Insert a row, retrieve a row, do a full or partial table scan – Only one column (the ‘row key’) is indexed
Hbase vs Traditional RDBMSs
Summary In this chapter you have learned
Different Hadoop
Eco System
Projects namely Hive Pig Sqoop Flume Oozie HBase (a Hadoop datastore)
View more...
Comments