Apache Hadoop Training for Developers_Day#1

April 19, 2017 | Author: Manish Jain | Category: N/A
Share Embed Donate

Short Description

Download Apache Hadoop Training for Developers_Day#1...



Chapter Topics ----------------------------------

› ›

About this course Course Logistics

Course Objective --------------------------------Refer the course objective and content PDF.


 About your instructor  About you Experience with Hadoop? Experience as a developer? Expectations from the course?

Module 1

› › › › ›

Introduction to BIG Data and its Need

Lesson 1: Introduction to BIG Data Lesson 2: Big Data Analytics and why it’s a need now Lesson 3: Real Time Case Studies Lesson 4: Traditional Vs. Big Data Approach Lesson 5: Technologies within Big Data Eco System

Module Objectives At the end of this module, you will learn to:

› › › › › › ›

Introduction to BIG Data Few Examples on BIG Data Big Data Real time case studies Why Big Data is a BUZZ and why it’s a need now Big Data Analytics Comparison between Traditional and Big Data approach Technologies within Big Data Eco System

Lesson 1

Introduction to BIG Data

At the end of this lesson, you will learn to:

› › › ›

What is Big Data? The 3 V’s of BIG Data Few Example of Big Data Why Big Data is a BUZZ!


What is BIG Data??? When you hear the term BIG Data… what is the first instant thought?

Volume!!!! right??? Massive, huge, enormous quantities of digital stuff. But it’s not just the volume that makes BIG Data difficult to manage and analyze – it’s also the Variety and Velocity!!!

Big Data : Insight

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

› ›

Big Data is equipped enough to handle day - to- day data explosion.

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.

Big data is difficult to work with using most relational database management systems

What do we Mean by Big Data?

BIG Data has three defining attributes – 3 Vs. They are:  Data Volume,  Data Variety &  Data Velocity Together 3 Vs constitute a comprehensive definition of BIG Data.

3Vs Of BIG Data Volume:

 Using millions of transactions

& events to analyze trends

and perform forecast!!

 Turning 12 terabytes of Tweets created each day into improved product sentiment analysis!!

 Converting 350 billion annual meter readings to better predict power consumption!!


 Using fast paced real time transactions for predictive analysis!!

 Scrutinizing 5 million trade events created each day to identify potential fraud!!


 Analyzing 500 million daily call detail records in real-time Terabytes Records Transactions

to predict customer churn faster!!


BIG Data

Batch Load Near Time Data Real Time Data


 Collectively analyzing all forms of data (text, sensor data, audio, video, click streams, log files ) gives new insights!!

Structured Unstructured Semi Structured

 Monitoring 100’s of live video feeds from surveillance Variety

cameras to target points of interest!!

 Exploiting 80% data growth in images, video and documents to improve customer satisfaction!!

Understanding BIG Data - Summary

Data that’s an order of magnitude greater than you are accustomed to - Gartner Analyst Doug Laney

BIG Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand Database Management Tools - Wikipedia

3 Vs - Volume

- Velocity & - Variety

Few Examples of BIG Data Facebook handles 40 billion photos from its user base and has more than 901 million active users generating social interaction data.

RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems.

Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data

More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones

340 million tweets are sent per day. That's nearly 4,000 tweets per second.

Boeing jet engines produces terabytes of operational information every 30 minutes they turn. A four engine jumbo jet can create 640 terabytes of data on just one Atlantic crossing, multiply that by the more than 25,000 flights flown each day”

10,000 payment card transactions are made every second around the world.

Why BIG Data is a BUZZ!

BIG Data Platform can be used to analyze semi structured & unstructured data along with raw structured data…

BIG data solutions are ideal for iterative & exploratory analysis when business measures can not be pre-determined using structured data set…

Big Data can be used to supports Predictive Analytics and Provide Predictive Enterprise Solutions using all forms of real transactions in contrary to the Traditional DWBI.

So what BIG data means to a business?

Few of case study for BIG Data would be

• • • • •

• • • • •

Profile customers & Gain Customer Trust Determine pricing strategies Identify competitive advantages Better target advertising Strengthen customer service

Performing IT log Analytics Identifying Fraud detection pattern Sentiment Analytics using SocialMedia Feed Executing usage analytics in Energy Sector Analyzing competitor market penetration

Summary In this chapter you have learned

› › › ›

What is Big Data? The 3 V’s of BIG Data Few Example of Big Data Why Big Data is a BUZZ!

Lesson 2

BIG Data Analytics & Why it’s a Need Now?

At the end of this lesson, you will learn to:

› › › › ›

What is Big Data Analytics? Its advantages and challenges.

Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation.

What is BIG Data Analytics

Big data analytics is the process of examining large amounts of data of a variety of types (Structured, Semi-Structured or Unstructured), to uncover hidden patterns, unknown correlations and other useful information.

The primary goal of big data analytics is to help companies make better business decisions by analyzing data sources that may be left untapped by conventional business intelligence (BI) programs

Underlying data may include Web server logs, Internet clickstream data, social media activity reports, mobile-phone call detail records, information captured by sensors, IT Logs etc. Advantages?


• • • • •

• • • • •

Making sense out of unstructured Data Optimized usage of Organizational Data Value add to existing BI Solutions More accurate BI Results Best bet to make Better Business Decisions

Lack of skill set High Initial Cost Involvement Challenges in integrating BIG Data Little Awareness of Technologies Unavailability of matured BIG Data toolset

Why BIG Data Analytics is a ‘need’ now? Information is at center of a New Wave Opportunity…


80% 35

As much data & content In coming decade

… and Organization needs Deeper Insights

1 in 3

business leaders frequently makes Business Decisions based on Information they do not trust or do not have!!!

1 in 2

business leaders say they do not have access to relevant Information they require to do their job!!!

Of World’s available data Is Unstructured or Semi Structured

83 %

of CIO cited “BI” as part of their visionary plans to be competitive and enhance their competitiveness

60 %

of CEOs need to do a better job capturing & understanding Information rapidly in order to make swift business decisions

Zeta bytes of data by 2020

BIG Data Platform helps you combine varied data forms for making decisions

Why BIG Data Analytics is a ‘need’ now? What people think about your company or product???

Call Centre


Who are the BIGGEST influencers and what are they saying

Social Network

BIG Data Platform provides you multi-channel Customer Sentiment Analytics

Why BIG Data Analytics is a ‘need’ now? Future continues to bring new data sources with high data volume

2 Billion Internet users as of now 4.6 Billion mobile phones worldwide

New Information Sources

Facebook produces 10 TB data everyday Traditional Sources

Steady growth of traditional data

Twitter produces 7TB data everyday

Enormous Satellite data growth

New media channels emerging everyday

Digitization makes exponential data growth

BIG Data Platform ensures consolidation of ever growing varied data sets

Why BIG Data Analytics is a ‘need’ now?

Imagine if we could… … predict infections in pre-mature newborns 24 hours earlier?


… apply social relationships of customers to prevent churn?

Call Centre REP

… adjust credit lines as transactions are occurring to account for risk fluctuations

Loan Officer

… determine whom to offer discounts at time of sale instead of offering to all

Sales Associate

BIG Data Platform can be used across industry for making Analytic Decisions

BIG Data: The Solution The Solution – Bring together any data source @ any velocity to generate insight • Analyze variety of data @ enormous volume • Insight on streaming data • Large volume structured data analysis

Multi channel customer sentiment Analytics

Predict weather patterns to optimize capital expenditure Make risk decisions based on real time transactional data

BIG Data Platform • Velocity • Variety • Volume

Identify criminal & threats from disparate Audio/VDO Find life threatening conditions in time to intervene

Implementing BIG Data Analytics - Different Approaches Approach

Interactive Exploration

Operational Reporting

Indirect Batch Analysis

Use Cases

For Data Analysts & Data Scientists who wants to discover real time pattern as they emerge from their BIG Data Content

Foe executives & Operational Managers who wants summarized, prebuilt, periodic reports on BIG Data Content

For Data Analysts & Op Managers who want to analyze data trends based on predefined questions in their BIG Data Content




Hbase, No-SQL, Analytic DBMS

Hive, No-SQL, Analytic DBMS

Hadoop, No-SQL, Analytic DBMS


Native, SQL


Latency BIG Data Platform Connectivity Architecture

Multidimensional Analysis

Reports & Dashboards

In-Memory Engine

BI Platform Native

OLAP Engine

BI Platform Native

BIG Data

Multidimensional Analysis

BI Platform


Data Mart


BIG Data

Summary In this chapter you have learned

› › › › ›

What is Big Data Analytics? Its advantages and challenges. Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation.

Traditional Analytics Lesson 3


Big Data Analytics At the end of this lesson, you will learn to:

› › › › ›

The Traditional Approach The BIG Data Approach

Traditional Vs. Big Data Approach BIG Data Complements Traditional Enterprise Data Warehouse Traditional Analytics Vs. Big Data Analytics

The Traditional Approach: Business Requirement Drives Solution Design The Traditional Approach : Business Requirements drive solution design Business defines requirements – what questions should we ask

IT designs a solution with a set structure & functionality

New Requirements require redesign & rebuild

Business executes queries to answer questions over and over

Well suited to

Stretched by

• • • •

• • • •

High Value, Structured Data Repeated operations & processes Relatively stable sources Well understood requirements

Highly valuable data and content Exploratory analysis Volatile sources Changing requirements

The BIG Data Approach: Information Sources drive Creative Discovery The BIG Data Approach : Information sources drive Creative discovery Business & IT identify available Information Sources

New insights drive integration to traditional technology

IT delivers platform that enables creative exploration of all available data & content

Business determines what questions to ask by exploring data & relationships

Can be implemented for • • • •

Structured or Unstructured Data Exploratory operations & processes Relatively Unstable sources Unknown Business Requirements

Traditional and BIG Data Approaches Traditional approach vs BIG Data approach Traditional Approach

BIG Data Approach

Structured & Repeatable Analysis

Iterative & Exploratory Analysis

Business Users Determines what questions to ask

IT Delivers platform to enable creative discovery



Structures data to answer questions

Explores what questions could be asked

Monthly Sales Report - Profitability Analysis - Customer Surveys

Brand sentiment - Product Strategy - Maximizing Utilization

BIG Data Complements Traditional Enterprise Data Warehouse BIG Data shouldn’t be a silo, Must be an integrated part of your Enterprise Information Architecture Data Warehouse

BIG Data Platform

Enterprise Integration

Traditional Sources

New Sources

Traditional Analytics Platform v/s BIG Data Analytics Platform Traditional DW Analytics Platform

BIG Data Analytics Platform

Gigabytes to Terabytes

Petabytes to Exabyte

Centralized Data Structure

Distributed Data Structure


Semi Structured & Non Structured

Relational Data-model

Flat Schemas

Batch oriented data load process

Aimed at near real time analysis of the data

Analytics based on historical trends

BIG Data Analytics is based on real time data

Data generated using conventional method (Data Entry)

Data generated using unconventional methods like, RFID, Sensor networks etc.

Summary In this chapter you have learned

› › › ›

The Traditional Approach

Traditional Analytics Vs. Big Data Analytics

The BIG Data Approach Traditional Vs. Big Data Approach How Big Data Complements Traditional Enterprise Data Warehouse

Lesson 4

Real Time Case Studies

At the end of this lesson, you will learn to:

› › › ›

Big Data Analytics: Use Cases Big Data to Predict Your Customers' Behaviors When to consider for Big Data Solution Big Data Real Time Case Studies

BIG Data Analytics - Use Cases • Web/E-Commerce/Internet •

Integrated Website Analytics

• Retail • • • •

Competitive Pricing Customer Segmentation Predictive buying behavior Market Campaign Management

• Government • •

Defense Intelligence Analysis Threat Analytics

• Telecommunications •

• •

Customer Segmentation Service Response Optimization

• Financial Services • •

Fraud detection Analytics Risk Modeling & Analysis

• Manufacturing •

Inventory Optimization

• Energy & Utilities • •

Customer Experience Analytics Service Quality Optimization

Customer Experience Analytics

• Healthcare & Pharmaceutical •

• Insurance

Drug Discovery

• Media & Content • •

Customer Satisfaction Analytics Dispatch Optimization

Big Data to Predict Your Customers' Behaviors

Retailers like Wal-Mart and Kohl's are making use of sales, pricing, and economic data, combined with demographic and weather data, to fine-tune merchandising store by store and anticipate appropriate timing of store sales.

Online data services like eHarmony and Match.com are constantly observing activity on their sites to optimize their matching algorithms to predict who will hit it off with whom.

Google search queries on flu symptoms and treatments reveal weeks in advance what flu-related volumes hospital emergency departments can expect.

BIG Data provides capacity to predict the future before your rivals can — whether they're companies or criminals. Currently NYPD is using Big Data platform to fight crime in Manhattan.

When to consider for Big Data Solution

Big Data solutions are ideal for analyzing not only raw structured data, but semi structured and unstructured data from wide variety of sources.

Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn't nearly as effective as a larger set of data from which to derive analysis.

Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.

Big Data Real Time Case Study

TXU Energy – Smart Electric Meters:

Because of smart meters, electricity providers can read the meter once every 15 minutes rather than once a month. This not only eliminates the need to send some one for meter reading, but as the meter is read once every fifteen minutes, electricity can be priced differently for peak and off-peak hours. Pricing can be used to shape the demand curve during peak hours, eliminating the need for creating additional generating capacity just to meet peak demand, saving electricity providers millions of dollars worth of investment in generating capacity and plant maintenance costs.

Big Data Real Time Case Study ….(Cont’d)

T-Mobile USA:

T-Mobile USA has integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections. By leveraging social media data (Big Data) along with transaction data from CRM and Billing systems, T-Mobile USA has been able to “cut customer defections in half in a single quarter”.

Big Data Real Time Case Study ….(Cont’d)

US Xpress :

US Xpress, provider of a wide variety of transportation solutions collects about a thousand data elements ranging from fuel usage to tire condition to truck engine operations to GPS information, and uses this data for optimal fleet management and to drive productivity saving millions of dollars in operating costs.

Big Data Real Time Case Study ….(Cont’d)

McLaren’s Formula One racing team :

McLaren’s Formula One racing team uses real-time car sensor data during car races, identifies issues with its racing cars using predictive analytics and takes corrective actions proactively before it’s too late!

Summary In this chapter you have learned

› ›

Big Data Analytics: Use Cases

› ›

When to consider for Big Data Solution

Big Data to Predict Your Customers' Behaviors

Big Data Real Time Case Studies Like TXU smart meters, T- Mobile, US Xpress, McLaren’s Formula One racing team

Lesson 5

Technologies within Big Data Eco System

At the end of this lesson, you will learn to:

› › › › ›

BIG Data Landscape BIG Data Key Components

Components of Analytical Big-data Processing Hadoop at a glance Conclusion

BIG Data Landscape Hardware

• Storage • Servers • Networking

BIG Data Distributions • Open source Hadoop Distributions • Enterprise Hadoop Distributions • Non-Hadoop BIG Data Frameworks

Data Management Components • Distributed File Sources • No-SQL Databases • Hadoop Optimized Datawarehouse • Data Integration • Data Quality & Governance

Analytics Layer

• Analytic Application development Platforms • Advanced Analytic Applications

Application Layer


• Data Visualization Tools • BI Applications

• • • •

Vendore include Datameter, ClickFox, Platfora, Tableau Software, Tresata, IBM, SAP, Microstrategy, Pentaho, QlikTech, Japersoft

Vendor include Trisata, Tidemark, Think Big Analytics, Amazon Web Services, Accenture, Cloudera, Hortonworks.

Consulting Training Technical Support Software /Hardware Maintenance • Hosting / BIG Data as a Service – Cloud

Vendors Vendors include DELL, HP, Arista, IBM, Cisco, EMC, NetApp

Vendors/Providers include Apache, Cloudera, Hortonworks, IBM,EMC, MapR, LexisNexis

Vendors/Providers include Apache, DataStax, Pervasive, Couchbase, IBM, Oracle, Informatica, Syncsort, Talend.

Vendors/Providers include Apache, Karmasphere, Hadapt, Attivio, 101Data, EMC, SAS Institute, Digital Reasoning, Revolution Analytics.

BIG Data Key Components Abstract Layers

Click Fox, Merced etc..

Analytic Applications

Eg. Greenplum Netizza

Fast Loading Analytic Database

ETL & Modeling

ETL (Extract Transform & Load) & Modeling Tools (CRX) Structured Source

Management & Security Higher Level Languages




Hive (DW)

Map Reduce Engine

Job & Task Trackers Hadoop Location aware File System Processing & Original Data

File System, eg. HDFS

No SQL Database, eg. HBASE, Cassandra

Components of Analytical Big-data Processing

Raw massive data: Kept within cheap commodity machines/ servers. They are further categorized as Nodes and clusters.

File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.

Computation Engine: Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of computations.

Data warehouse Layer: Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers for BIG Data Platform.

Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and managed as a unit. It is widely used to develop special tools.

Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set up the database that will run the analytics.

Analytic Database : Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data for the analytic models .

ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business issues.

Hadoop at a Glance Hadoop: A big tool for BIG Data

It is not advisable to dig out the hole for a pool using only an ice cream scooper; you need a big tool – Hadoop is one among them!!!

Apache Hadoop is an open-source project which was inspired by BIG Data research of Google.

Hadoop is best available tool for processing and storing herculean amounts of big Data. Hadoop throws thousands of computers at big data problem, rather than using single computer.

In Hadoop parlance, group of coordinated computers is called cluster & individual computers in the cluster are called nodes.

Hadoop makes data mining, analytics, and processing of big data cheap and fast when compared with other toolsets.

Hadoop is cheap, fast, flexible & scales to large amounts of big data storage & computation.


Looking at the Data explosion, the real issue is not to acquire large amount of Data or storing those data, it is what you do with your BIG Data!!!

With BIG Data and BIG Data Analytics, it’s possible to:         

Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory. Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk. Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign etc. Quickly identify customers who matter the most. Generate retail coupons at the point of sale based on the customer's current and past purchases Send tailored recommendations at just the right time, while customers are in the right location Analyze data from social media to detect new market trends and changes in demand. Use clickstream analysis and data mining to detect fraudulent behavior. Determine root causes of failures, issues & defects by investigating user sessions, network logs & sensors.

The working principle behind all big data platform is to move the query to the data to be processed, not the data to the query processor.

It’s time to move on and try to avoid just looking at the rear view mirror and drive the

car (Traditional BI) but also to look a step forward and get into the predictive analytics by using the power of BIG DATA and hence help the organization to take right decision at right point of time.

Summary In this chapter you have learned

› › ›

BIG Data Landscape

› ›

Hadoop at a glance

BIG Data Key Components Components of Analytical Big-data Processing


Module 2

› › › ›

Introduction to Apache Hadoop and its Ecosystem

Lesson 1: The Motivation for Hadoop Lesson 2: Hadoop: Concepts and Architecture Lesson 3: Hadoop and the Data Warehouse: When and Where to use which

Lesson 4: Introducing Hadoop Eco system components

Module Objectives At the end of this module, you will learn to:

› › › › › › ›

Introduction to Apache Hadoop The motivation for Hadoop The Basic concept of Hadoop Hadoop Architecture Hadoop Distributes File System (HDFS) and MapReduce Right usage and scenarios for Hadoop Introduction to key Hadoop Eco System Projects

Lesson 1

The Motivation for Hadoop

At the end of this lesson, you will learn to:

› › ›

What problems exist with traditional large scale computing systems What requirements an alternative approach should have How Hadoop addresses those requirements

Traditional Large Scale Computation

Traditionally, computation has been processor bound – Relatively small amounts of data – Significant amount of complex processing performed on that data

For decades, the primary push was to increase the computing power of a

› single machine – Faster processor, more RAM

Distributed systems evolved to allow developers to use multiple machines

for a single job – MPI – PVM – Condor MPI: Message Passing Interface PVM: Parallel Virtual Machine

Distributed Systems: Problems

 Programming for traditional distributed systems is complex – Data exchange requires synchronization – Finite bandwidth is available – Temporal dependencies are complicated – It is difficult to deal with partial failures of the system

 Ken Arnold, CORBA designer: – “Failure is the defining difference between distributed and local Programming, so you have to design distributed systems with the expectation of failure” – Developers spend more time designing for failure than they do actually working on the problem itself

CORBA: Common Object Request Broker Architecture

Distributed Systems: Data Storage

 Typically, data for a distributed system is stored on a SAN  At compute time, data is copied to the compute nodes  Fine for relatively limited amounts of data

The Data Driven World

 Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded

 Examples: – Facebook – over 70PB of data – eBay – over 5PB of data

 Many organizations are generating data at a rate of terabytes per day

Data Becomes the Bottleneck

 Moore’s Law has held firm for over 40 years – Processing power doubles every two years – Processing speed is no longer the problem

 Getting the data to the processors becomes the bottleneck  Quick calculation – Typical disk data transfer rate: 75MB/sec – Time taken to transfer 100GB of data to the processor: approx. 22 minutes! – Assuming sustained reads – Actual time will be worse, since most servers have less than 100GB of RAM available

 A new approach is needed

Partial Failure Support

 The system must support partial failure – Failure of a component should result in a graceful degradation of application performance – Not complete failure of the entire system

Data Recoverability

 If a component of the system fails, its workload should be assumed by still functioning units in the system – Failure should not result in the loss of any data

Component Recovery

 If a component of the system fails and then recovers, it should be able to rejoin the system – Without requiring a full restart of the entire system


 Component failures during execution of a job should not affect the outcome of the job


 Adding load to the system should result in a graceful decline in performance of individual jobs – Not failure of the system

 Increasing resources should support a proportional increase in load capacity

Hadoop’s History

 Hadoop is based on work done by Google in the late 1990s/early 2000s – Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004

 This work takes a radical new approach to the problem of distributed computing – Meets all the requirements we have for reliability and scalability

 Core concept: distribute the data as it is initially stored in the system – Individual nodes can work on data local to those nodes – No data transfer over the network is required for initial processing

Core Hadoop Concepts

 Applications are written in high level code – Developers need not worry about network programming, temporal dependencies or low/level infrastructure

 Nodes talk to each other as little as possible – Developers should not write code which communicates between nodes – ‘Shared nothing’ architecture

 Data is spread among machines in advance – Computation happens where the data is stored, wherever possible – Data is replicated multiple times on the system for increased availability and reliability

Hadoop: Very High/Level Overview

When data is loaded into the system, it is split into ‘blocks’ – Typically 64MB or 128MB

Map tasks (the first part of the MapReduce system) work on relatively small portions of data – Typically a single block

A master program allocates work to nodes such that a Map task will work on a block of data stored locally on that node whenever possible – Many nodes work in parallel, each on their own part of the overall dataset

Fault Tolerance

If a node fails, the master will detect that failure and re-assign the work to a different node on the system

Restarting a task does not require communication with nodes working on other portions of the data

If a failed node restarts, it is automatically added back to the system and assigned new tasks

If a node appears to be running slowly, the master can redundantly execute another instance of the same task – Results from the first to finish will be used – Known as ‘speculative execution’

Summary In this chapter you have learned

What problems exist with traditional large-scale computing systems

What requirements an alternative approach should have

How Hadoop addresses those requirements

Lesson 2

Hadoop: Concepts and Architecture

At the end of this lesson, you will learn to:

› › › › ›

What Hadoop is all about Hadoop Components

What features the Hadoop Distributed File System (HDFS) provides HDFS Architecture The concepts behind MapReduce

The Hadoop Project

 Hadoop is an open-source project overseen by the Apache Software Foundation  Originally based on papers published by Google in 2003 and 2004  Hadoop committers work at several different organizations – Including Yahoo!, Facebook, LinkedIn

Hadoop Components

Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce

There are many other projects based around core Hadoop – Often referred to as the ‘Hadoop Ecosystem’ – Pig, Hive, HBase, Flume, Oozie, Sqoop etc – Many are discussed later in the course

A set of machines running HDFS and MapReduce is known as a Hadoop Cluster – Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousands – More nodes = better performance

Hadoop Components: HDFS

 HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster

 Data is split into blocks and distributed across multiple nodes in the cluster – Each block is typically 64MB or 128MB in size

Each block is replicated multiple times – Default is to replicate each block three times – Replicas are stored on different nodes – This ensures both reliability and availability


The Data File is broken up into 64MB or 128 MB blocks

The Data Blocks are replicated 3 times and scattered amongst the workers

Hadoop Components: MapReduce

 MapReduce is the system used to process data in the Hadoop cluster

 Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort

 Each Map task operates on a discrete portion of the overall dataset – Typically one HDFS block of data

After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase – Much more on this later!

HDFS Basic Concepts

 HDFS is a filesystem written in Java – Based on Google’s GFS

Sits on top of a native filesystem – Such as ext3, ext4 or xfs

Provides redundant storage for massive amounts of data – Using ‘commodity’ (relatively low/cost)


HDFS Basic Concepts (Cont’d)

 HDFS performs best with a ‘modest’ number of large files – Millions, rather than billions, of files – Each file typically 100MB or more

 Files in HDFS are ‘write once’ – No random writes to files are allowed

 HDFS is optimized for large, streaming reads of files – Rather than random reads

How Files Are Stored  Files are split into blocks – Each block is usually 64MB or 128MB

 Data is distributed across many machines at load time – Different blocks from the same file will be stored on different machines – This provides for efficient MapReduce processing (see later)

 Blocks are replicated across multiple machines, known as DataNodes – Default replication is three/fold – Meaning that each block exists on three different machines

 A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located

– Known as the metadata

How Files Are Stored. Example

 NameNode holds metadata for the two files (Foo.txt and Bar.txt)

 DataNodes hold the actual blocks – Each block will be 64MB or 128MB in size – Each block is replicated three times on the cluster

More On The HDFS NameNode  The NameNode daemon must be running at all times – If the NameNode stops, the cluster becomes inaccessible – Your system administrator will take care to ensure that the NameNode hardware is reliable!

 The NameNode holds all of its metadata in RAM for fast access – It keeps a record of changes on disk for crash recovery

 A separate daemon known as the Secondary NameNode takes care of some housekeeping tasks for the NameNode – Be careful: The Secondary NameNode is not a backup NameNode!

 CDH4 introduces NameNode High Availability – NameNode is not a single point of failure – Features an Active and a Standby NameNode

HDFS: Points To Note 

Although files are split into 64MB or 128MB blocks, if a file is smaller than this the full 64MB/128MB will not be used

Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop’s configuration files – This will be set by the system administrator

 

Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster When a client application wants to read a file: – It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on – It then communicates directly with the DataNodes to read the data – The NameNode will not be a bottleneck

Accessing HDFS

 Applications can read and write HDFS files directly via the Java API – Covered later in the course

 Typically, files are created on a local filesystem and must be moved into HDFS  Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem

 Access to HDFS from the command line is achieved with the hadoop fs command

Hadoop fs Examples

hadoop fs Examples (cont’d)

hadoop fs Examples (cont’d)

Hands-On Exercise: Using HDFS

Aside: The Training Virtual Machine

 During this course, you will perform numerous Hands-On Exercises using the Training Virtual Machine (VM)

 The VM has Hadoop installed in pseudo-distributed mode – This essentially means that it is a cluster comprised of a single node – Using a pseudo/distributed cluster is the typical way to test your code before you run it on your full cluster – It operates almost exactly like a ‘real’ cluster – A key difference is that the data replication factor is set to 1, not 3

Hands-On Exercise: Using HDFS

 In this Hands-On Exercise you will gain familiarity with manipulating files in HDFS

 Please refer to the Hands-On Exercise Manual

What is MapReduce

 MapReduce is a method for distributing a task across multiple nodes  Each node processes data stored on that node – Where possible

 Consists of two phases: – Map – Reduce

Features of MapReduce

   

Automatic parallelization and distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers – MapReduce programs are usually written in Java – Can be written in any language using Hadoop Streaming – All of Hadoop is written in Java

MapReduce abstracts all the ‘housekeeping’ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions

Giant Data: MapReduce and Hadoop In 2010, Facebook sat on top of a mountain of data; just one year later it had grown from 21 to 30 petabytes. If you were to store all of this data on 1TB hard disks and stack them on top of one another, you would have a tower twice as high as the Empire State building in New York.

Enterprises like Google and Facebook use the map–reduce approach to process petabyte-range volumes of data. For some analyses, it is an attractive alternative to SQL databases, and Apache Hadoop exists as an open source implementation.

MapReduce: Automatically Distributed

Processing and analyzing such data need to take place in a distributed process on multiple machines. However, this kind of processing has always been very complex, and much time is spent solving recurring problems, like processing in parallel, distributing data to the compute nodes, and, in particular, handling errors during processing. To free developers from these repetitive tasks, Google introduced the MapReduce framework.

MapReduce Framework

The MapReduce framework breaks down data processing into map, shuffle, and reduce phases. Processing is mainly in parallel on multiple compute nodes.

MapReduce: Map Phase

The Map Phase

The Shuffle Phase

The Reduce Phase

MapReduce Programming Example: Search Engine

› ›

A web search engine is a good example for the use of MapReduce. Set of MapReduce programming is used to implement page Rank algorithm, that Google uses to evaluate the relevance of a page on the web.

Schematic process of a map–reduce computation

Map Method:

Reduce Method:

The use of a combiner The use of a combiner makes sense for arithmetic operations in particular.

MapReduce: The Big Picture

The Five Hadoop Daemons

Hadoop is comprised of five separate daemons

NameNode — Holds the metadata for HDFS

Secondary NameNode — Performs housekeeping functions for the NameNode — Is not a backup or hot standby for the NameNode

DataNode — Stores actual HDFS data blocks

JobTracker — Manages MapReduce jobs, distributes individual tasks to machines running the…

TaskTracker — Instantiates and monitors individual Map and Reduce tasks

The Five Hadoop Daemons (cont’d)

 Each daemon runs in its own Java Virtual Machine (JVM) No node on a real cluster will run all five daemons – Although this is technically possible

 We can consider nodes to be in two different categories: – Master Nodes – Run the NameNode, Secondary NameNode, JobTracker daemons – Only one of each of these daemons runs on the cluster – Slave Nodes – Run the DataNode and TaskTracker daemons • A slave node will run both of these daemons

Basic Cluster Configuration

Basic Cluster Configuration (Cont’d)

 On very small clusters, the NameNode, JobTracker and Secondary NameNode can all reside on a single machine – It is typical to put them on separate machines as the cluster grows beyond 20/30 nodes

 Each dotted box on the previous diagram represents a separate Java Virtual Machine (JVM)

Submitting A Job

 When a client submits a job, its configuration information is packaged into an XML file

 This file, along with the .jar file containing the actual program code, is handed to the JobTracker – The JobTracker then parcels out individual tasks to TaskTracker nodes – When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task – TaskTracker nodes can be configured to run multiple tasks at the same time – If the node has enough processing power and memory

MapReduce: The JobTracker

 MapReduce jobs are controlled by a software daemon known as the JobTracker  The JobTracker resides on a ‘master node’ – Clients submit MapReduce jobs to the JobTracker – The JobTracker assigns Map and Reduce tasks to other nodes on the cluster – These nodes each run a software daemon known as the TaskTracker – The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker

MapReduce: Terminology

 A job is a ‘full program’ – A complete execution of Mappers and Reducers over a dataset

 A task is the execution of a single Mapper or Reducer over a slice of data  A task attempt is a particular instance of an attempt to execute a task – There will be at least as many task attempts as there are tasks – If a task attempt fails, another will be started by the JobTracker – Speculative execution (see later) can also result in more task attempts than completed tasks

MapReduce: The Mapper

MapReduce: The Mapper (cont’d)

 The Mapper may use or completely ignore the input key – For example, a standard pattern is to read a line of a file at a time – The key is the byte offset into the file at which the line starts – The value is the contents of the line itself – Typically the key is considered irrelevant

 If the Mapper writes anything out, the output must be in the form of key/value pairs

Example Mapper: Upper Case Mapper

Example Mapper: Explode Mapper

Example Mapper: Filter Mapper

Example Mapper: Changing Keyspaces

MapReduce: The Reducer

 After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list

 This list is given to a Reducer – There may be a single Reducer, or multiple Reducers – This is specified as part of the job configuration (see later) – All values associated with a particular intermediate key are guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the Reducer in sorted key order – This step is known as the ‘shuffle and sort’

 The Reducer outputs zero or more final key/value pairs – These

are written to HDFS – In practice, the Reducer usually emits a single key/value pair for each input key

Example Reducer: Sum Reducer

Example Reducer: Identity Reducer

MapReduce Example: Word Count

MapReduce Example: Word Count (Cont’d)

MapReduce Example: Word Count (Cont’d)

MapReduce: Data Locality

 Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data stored locally on that node via HDFS

 If this is not possible, the Map task will have to transfer the data across the network as it processes that data

 Once the Map tasks have finished, data is then transferred across the network to the Reducers – Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers – All Mappers will, in general, have to communicate with all Reducers

MapReduce: Is Shuffle and Sort a Bottleneck?  It appears that the shuffle and sort phase is a bottleneck – The reduce method in the Reducers cannot start until all Mappers have finished

 In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers finish work – This mitigates against a huge amount of data transfer starting as soon as the last Mapper finishes – Note that this behavior is configurable – The developer can specify the percentage of Mappers which should finish before Reducers start retrieving data – The developer’s reduce method still does not start until all intermediate data has been transferred and sorted

MapReduce: Is a Slow Mapper a Bottleneck?  It is possible for one Map task to run more slowly than the others – Perhaps due to faulty hardware, or just a very slow machine

 It would appear that this would create a bottleneck – The reduce method in the Reducer cannot start until every Mapper has finished

 Hadoop uses speculative execution to mitigate against this – If a Mapper appears to be running significantly more slowly than the others, a new instance of the Mapper will be started on another machine, operating on the same data – The results of the first Mapper to finish will be used – Hadoop will kill off the Mapper which is still running

Hands/On Exercise: Running A MapReduce Job  In this Hands-On Exercise, you will run a MapReduce job on your pseudodistributed Hadoop cluster

 Please refer to the Hands-On Exercise Manual

Summary In this chapter you have learned

What Hadoop is all about? What are the components in Hadoop Concept and detailed architecture of HDFS.

What features the Hadoop Distributed File System (HDFS) provides

The concepts behind MapReduce Few illustrations on MapReduce and how it works in real time.

Lesson 3

Hadoop and the Data Warehouse: When and Where to use which

At the end of this lesson, you will learn to:

Find out the answer, when should I use Hadoop, and when should I put the data into a data warehouse?

› › ›

Hadoop Differentiators Data Warehouse Differentiators Where and where to use which?

Hadoop and the Data Warehouse

Figure 1. Before: Data flow of meter reading done manually

Figure 2. After: Meter reading every 5 or 60 minutes via smart meters

Hadoop Differentiators

Hadoop is the repository and refinery for raw data.

Hadoop is a powerful, economical and active archive.

Data Warehouse Differentiators

› › ›

Data warehouse performance Integrated data that provides business value Interactive BI tools for end users

When and Where to Use Which While there are certain use cases that are distinct to Hadoop or the data warehouse, there is also overlap where either technology could be effective. The following table is a good starting place for helping to decide which platform to use based on your requirements.

Summary In this chapter you have learned

Tried finding out the answer, when should I use Hadoop, and when should I put the data into a data warehouse?

Learn the Hadoop and Data Warehouse Differentiators

Also learn and discussed, where to use which?

Lesson 4

Introducing Hadoop Eco system components

At the end of this lesson, you will learn to:

A quick over view of few key Hadoop Eco system projects. Like Hive, Pig, Flume, Sqoop, Oozie, Hbase. The details of each one with demo and Hands-on will be covered in separate module.

Other Ecosystem Projects: Introduction  The term ‘Hadoop core’ refers to HDFS and MapReduce  Many other projects exist which use Hadoop core – Either both HDFS and MapReduce, or just HDFS

 Most are Apache projects or Apache Incubator projects – Some others are not hosted by the Apache Software Foundation – These are often hosted on GitHub or a similar repository

 We will investigate many of these projects later in the course  Following is an introduction to some of the most significant projects


 Hive is an abstraction on top of MapReduce  Allows users to query data in the Hadoop cluster without knowing Java or MapReduce

Uses the HiveQL language – Very similar to SQL

The Hive Interpreter runs on a client machine – Turns HiveQL queries into MapReduce jobs – Submits those jobs to the cluster

Note: this does not turn the cluster into a relational database server – It is still simply running MapReduce jobs – Those jobs are created by the Hive Interpreter

Hive (cont’d)


 Pig is an alternative abstraction on top of MapReduce  Uses a dataflow scripting language – Called PigLatin

 The Pig interpreter runs on the client machine – Takes the PigLatin script and turns it into a series of MapReduce jobs – Submits those jobs to the cluster

 As with Hive, nothing ‘magical’ happens on the cluster – It is still simply running MapReduce jobs

Pig (Cont’d)


 Flume provides a method to import data into HDFS as it is generated – Rather than batch/processing the data later – For example, log files from a Web server

A high level diagram


Sqoop provides a method to import data from tables in a relational database into HDFS – Does this very efficiently via a Map/only MapReduce job – Can also ‘go the other way’ – Populate database tables from files in HDFS

We will investigate Sqoop later in the course.


 Oozie allows developers to create a workflow of MapReduce jobs – Including dependencies between jobs

 The Oozie server submits the jobs to the server in the correct sequence  We will investigate Oozie later in the course


 HBase is ‘the Hadoop database’  A ‘NoSQL’ datastore  Can store massive amounts of data – Gigabytes, terabytes, and even petabytes of data in a table

 Scales to provide very high write throughput – Hundreds of thousands of inserts per second

 Copes well with sparse data – Tables can have many thousands of columns – Even if most columns are empty for any given row

 Has a very constrained access model – Insert a row, retrieve a row, do a full or partial table scan – Only one column (the ‘row key’) is indexed

Hbase vs Traditional RDBMSs

Summary In this chapter you have learned

Different Hadoop

Eco System

Projects namely  Hive  Pig  Sqoop  Flume  Oozie  HBase (a Hadoop datastore)

View more...


Copyright ©2017 KUPDF Inc.