Big Data & Hadoop Training Material 0 1.pdf

December 20, 2016 | Author: haranadh | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Big Data & Hadoop Training Material 0 1.pdf...

Description

Big Data & Hadoop Architecture and Development

Raghavan Solium Big Data Consultant [email protected]

Day - 1

• Understanding Big Data • • • •

What is Big Data Challenges with Big Data Why not RDBMS / EDW? Distributed Computing & MapReduce Model

• What is Apache Hadoop • • • • •

Hadoop & its eco system Components of Hadoop (Architecture) Hadoop deployment modes Install & Configure Hadoop Hands on with Standalone mode

Day - 1 • HDFS – The Hadoop DFS • • • • •

Building Blocks Name Node & Data Node Starting HDFS Services HDFS Commands Hands on • Configure HDFS • Start & Examine the daemons • Export & Import files into HDFS

• Map Reduce Anatomy • • • •

MapReduce Workflow Job Tracker & Task Tracker Starting MapReduce Services Hands on • Configure MapReduce • Start & Examine the daemons

Day - 2 • MapReduce Programming • • • •

Java API Data Types Input & Output Formats Hands on

• Advance Topics • • • • • • • • •

Combiner Partitioner Counters Compression, Speculative Execution, Zero & One Reducer Distributed Cache Job Chaining HDFS Federation HDFS HA Hadoop Cluster Administration

Day - 3 • Pig • • • • •

What is Pig Latin? Pig Architecture Install & Configure Pig Data Types & Common Query algorithms Hands On

• Hive • • • •

What is Hive? Hive Architecture Install & Configure Hive Hive Data Models • Hive Metastore • Partitioning and Bucketing

• Hands On

Day - 4 • Sqoop • • • •

What is Sqoop Install & Configure Sqoop Import & Export Hands On

• Introduction to Amazon Cloud • What is AWS • EC2, S3 • How to leverage AWS for Hadoop

Day - 4 • Hadoop Administration • • • • • • • • • • •

HDFS Persistent Data Structure HDFS Safe Mode HDFS File system Check HDFS Block Scanner HDFS Balancer Logging Hadoop Routine Maintenance Commissioning & Decommissioning of nodes Cluster Machine considerations Network Topology Security

What the “BIG” hype about Big Data? • May be it is in the hype, but the problems are big, real and big value. How?.... • We are in the age of advanced analytics (that’s where all the problem is, we want to analyze the data) where valuable business insight is mined out of historical data • But we also live in the age of crazy data where every individuals, enterprises, and machines leave so much data behind summing up to many Terabytes and many times, Petabytes, and it is only expected to grow • Good news. Blessing in disguise. More data means better precision • More data usually beats better algorithms • But How are we going to analyze? • Traditional database or warehouse systems crawl or crack at these volumes • Inflexible to handle most of these formats • This is the very characteristic of Big Data

• Nature of Big Data • Huge volumes of data that can not be handled by traditional database or warehouse systems, it’s mostly machine produced, most of it is unstructured and grows at high velocity 7

Let’s Define

Variety • Sensor Data • Machine logs • Social media data • Scientific data • RFID readers • sensor networks • vehicle GPS traces • Retail transactions

Volume • The New York Stock Exchange has several petabytes of data for analysis • Facebook hosts approximately 10 billion photos, taking up one petabytes of storage. • At the end of 2010 The Large Hadron Collider near Geneva, Switzerland has about 150 petabytes of data

Velocity • The New York Stock Exchange generates about one terabyte of new trade data every day • The Large Hadron Collider produce s about 15 petabytes of data per year • Weather sensors collect data every hour at many locations across the globe and gather a large volume of log data 8

Inflection Points • Data Storage • Big Data ranges from several Terabytes to Petabytes. • At these volumes access speed of the data devices will dominate overall analysis time. • A Terabyte of data requires 2.5 hours to be read from a 100 MBPS drive • Writing will even be slower

• Is divide the data and rule a solution here? • Have multiple disk drives, split your data file into small enough pieces across the drives and do parallel reads and processing • Hardware Reliability (Failure of any drive) is a challenge • Resolving Data interdependency between drives is a notorious challenge • Number of disk drives that can be added to a server is limited

• Analysis • Much of Big Data is unstructured. Traditional RDBMS/ EDW cannot handle them • Lot of Big Data analysis is adhoc in nature, involves whole data scan, referencing itself, joining, combing etc • Traditional RDBMS/ EDW cannot handle these with their limited scalability options and architectural limitations • You can incorporate better servers, processors and throw in more RAM but there is a limit to it 9

Inflection Points • We need a Drastically different approach • A distributed file system with high capacity and high reliability • A process engine that can handle structure / Unstructured data • A computation model that can operate on distributed data and abstracts data dispersion • PRAM, MapReduce are such models

Let us see what MapReduce is

10

What is MapReduce Model

(K2, V2) Intermediate Key/Value pairs

(K1, V1)

(K3, V3)

Input file Split s

Output files

Computer 1 Split 1

Input File / Data

Sort

Map

Reduce

Computer 2 Split 2

Computer 1 Sort

Map Computer 2 Reduce

Computer 3 Split 3

Part 1

Part 2

Sort

Map

11

What is MapReduce Model • MapReduce is a computation model that supports parallel processing on distributed data using a cluster of computers. • The MapReduce model expects the input data to be split and distributed to the machines on the cluster so the each split can be processed independently and in parallel. • There are two stages of processing in MapReduce model to achieve the final result. ‘Map’ and ‘Reduce’. Every computer in the cluster can run independent map and reduce processes. • Map processes the input splits. The output of map is distributed again to the reduce processes to combine the map output to give final expected result. • The model treats data at every stage as Key and Values pairs, transforming one set of Key/ Value pairs into different set of Key/ value pairs to arrive at the end result. • Map processes transforms input key/ value pairs to an intermediate key/value pairs. MapReduce framework passes this output to reduce processes which will transform this to get the final result which again will be in the form of key/ Value pairs. 12

MapReduce Model • MapReduce should have • Ability to initiate and monitor parallel processes and coordinate between them • A mechanism to pass the same key outputs from map processes to a single reduce process • Recover from any failures transparently

13

Big Data Universe Evolving and expanding………..

14

So what’s going to happen to our good friend RDBMS? • We don’t know! As of now it looks like they are going to coexists • Hadoop is a batch oriented analysis system. It’s not suitable for low-latency data operations • MapReduce systems can output the analysis outcome to the RDBMS/EDWs for online access and point queries RDBMS / EDW compared to MapReduce

Data size Access Updates Structure Integrity Scaling

Traditional RDBMS Gigabytes Interactive and batch Read and write many times Static schema High Nonlinear

MapReduce Petabytes Batch Write once, read many times Dynamic schema Low Linear

(Some of these things are debatable as the Big Data and Hadoop eco systems are fast evolving and moving to higher degree of maturity and flexibility. For example Hbase brings in the ability to point queries ) 15

Some Use Cases • Web/Content Indexing • Finance & Insurance • Fraud detection • Sentiment Analysis

• Retail • Trend analysis, Personalized promotions

• Scientific simulation & analysis • Aeronautics, Particle physics, DNA Analysis

• Machine Learning • Log Analytics

16

What is Apache Hadoop and how it can help with Big Data? • It is an open source Apache project for handling Big Data • It addresses Data storage issue and Analysis (processing) issues through it’s HDFS file system and implementing MapReduce computation model • It is designed for massive scalability and reliability • The model enables leveraging cheap commodity servers keeping the cost in check

Who Loves it? • • • • • •

Yahoo! runs 20,000 servers running Hadoop Largest Hadoop cluster is 4000 servers, 16 PB raw storage (Is it Yahoo?) Facebook runs 2000 Hadoop servers 24 PB raw storage and 100 TB raw log / day eBay and LinkedIn has production use of Hadoop Sears retail uses Hadoop 17

Hadoop & It’s ecosystem

Apache Oozie (Workflow) HBase Zoo Keeper (Coordination Service)

PIG

Hive

Mahout

MapReduce Framework Structured Data

Sqoop

HDFS (Hadoop Distributed File System)

Flume Log Files

Unstructured Data

18

Hadoop & It’s ecosystem Avro: A serialization system for efficient, cross-language RPC and persistent data storage.

MapReduce:

A distributed data processing model and execution environment that runs on large clusters of commodity machines.

HDFS: A distributed file system that runs on large clusters of commodity machines.

Pig:

A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

Hive:

A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

Hbase:

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

Sqoop:

A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS.

Oozie:

A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs). 19

Hadoop Requirements • Supported Platforms • GNU/Linux is supported as a development and production • Win32 supported as development only • cygwin is required for running on Windows

• Required Software • JavaTM 1.6.x • ssh to be installed, sshd must be running (for launching the daemons on the cluster with password less entry)

• Development Environment • Eclipse 3.5 or above

20

Lab Requirements • Windows 7 - 64 bit OS, Min 4 GB Ram • VMWare Player 5.0.0 • Linux VM - Ubuntu 12.04 LTS • User: hadoop, Password: hadoop123 • Java 6 installed on Linux VM • Open SSH installed on Linux VM • Putty - For opening Telnet sessions to the Linux VM • WinSCP - For transferring files between Windows / VM • Eclipse 3.5 21

Hands On • Using the VM • Install & Configure hadoop • • • • •

Install & Configure ssh Set up Putty & WinScp Set up lab directories Install open JDK Install & Verify hadoop

22

Starting VM

23

Starting VM

Enter user ID/ Password : hadoop / hadoop123

24

Install & Configure ssh

• Install ssh >>sudo apt-get

install

ssh

• Check ssh installation >>which ssh >>which sshd >>which ssh-keygen

• Generate ssh Key >>ssh-keygen -t

rsa

-P

‘’

-f

~/.ssh/id_rsa

• Copy public key as an authorized key (equivalent to slaves) >>cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys >>chmod 700 ~/.ssh >>chmod 600 ~/.ssh/authorized_keys 25

Verify ssh • Verify SSH by logging into target (localhost here) >>ssh localhost • This command should log you into the machine localhost

26

Accessing VM Putty & WinSCP • Get IP address of the Linux VM >>ifconfig

• Use Putty to telnet to VM • Use WinSCP to FTP to VM

27

Lab – VM Directory Structure • User Home Directory for user “hadoop” (Created default by OS) /home/hadoop

• Working directory for the lab session /home/hadoop/lab

• Downloads directory (installables downloaded and stored under this) /home/hadoop/lab/downloads

• Data directory (sample data is stored under this) /home/hadoop/lab/data

• Create directory for installing the tools /home/hadoop/lab/install

28

Install & Configure Java • Install Open JDK >>sudo

apt-get

install

openjdk-6-jdk

• Check Installation >>java

-version

• Configure Java Home in environment • Add a line to .bash_profile to set Java Home export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 • Hadoop will use this during runtime

Install Hadoop

• Download Hadoop Jar • http://apache.techartifact.com/mirror/hadoop/common/hado op-1.0.3/hadoop-1.0.3.tar.gz • FTP the file to Linux VM into ~/lab/downloads folder

• Untar (execute the following commands) >>cd ~/lab/install >>tar xvf ~/lab/downloads/hadoop-1.0.3.tar.gz • Check the extracted directory “hadoop-1.0.3” >>ls -l hadoop-1.0.3

• Configure environment in .bash_profile • Add below two lines and execute bash profile >>export HADOOP_INSTALL=~/lab/install/hadoop-1.0.3 >>export PATH=$PATH:$HADOOP_INSTALL/bin >>. .bash_profile

30

Run an Example • Verify Hadoop installation >> hadoop version

• Try the following >>hadoop • Will provide command usage

>>cd $HADOOP_INSTALL >>hadoop jar hadoop-examples-1.0.3.jar • Will provide the list of classes in the above jar file

>>hadoop jar hadoop-examples-1.0.3.jar wordcount

31

Component of Core Hadoop Client

Name Node

Job Tracker Networked

Secondary Name Node Data Node

Data Node

Data Node

Data Node

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Map

Map

Map

Map

Red

Red

Red

Red

(Hadoop supports many other file systems other than HDFS itself . For one to leverage Hadoop’s abilities completely HDFS is one of the most reliable file systems) 32

Components of Core Hadoop At a high-level Hadoop architectural components can be classified into two categories • Distributed File management system – HDFS This has central and distributed sub components • NameNode – Centrally Monitors and controls the whole file system • DataNode – Take care of the local file segments and constantly communicates with NameNode • Secondary NameNode – Do not confuse. This is not a NameNode Backup. This just backs up the file system status from the NameNode periodically • Distributed computing system – MapReduce Framework

This again has central and distributed sub components • Job Tracker – Centrally Monitors the submitted Job and controls all processes running on the nodes (computers) of the cluster. This communicated with Name Node for file system access

• Task Tracker – Take care of the local job execution on the local file segment. This talks to DataNode for file information. This constantly communicates with Job Tracker daemon to report the task progress When the Hadoop system is running in a distributed mode all these daemons would be running in the respective computer 33

Hadoop Operational Modes

Hadoop can be run in one of the three modes • Standalone (Local) Mode • No daemons launched • Everything runs in single JVM • Suitable for development

• Pseudo Distributed Mode • All daemons are launched on a single machine thus simulating a cluster environment • Suitable for testing & debugging

• Fully Distributed Mode • The Hadoop daemons run in a cluster environment • Each daemons run on machines respectively assigned to them • Suitable for Integration Testing / Production A typical distributed mode runs Name Node on a separate machine, Job Tracker & Secondary Name Node on a separate machine. 34 Rest of the machines in the cluster run a Data Node and Task Tracker Daemons

Hadoop Configuration Files

The configuration files can be found under “conf” Directory File Name

Format

Description

hadoop-env.sh

Bash script

Environment variables that are used in the scripts to run Hadoop

core-site.xml

Hadoop configuration XML

Configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce

hdfs-site.xml

Hadoop configuration XML

Configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes

mapred-site.xml

Hadoop configuration XML

Configuration settings for MapReduce daemons: the jobtracker and the tasktrackers

masters

Plain text

List of machines (one per line) that run a secondary namenode

slaves

Plain text

List of machines (one per line) that each run a datanode and a tasktracker

hadoop-metrics .properties

Java Properties

Properties for controlling how metrics are published in Hadoop

log4j.properties

Java Properties

Properties for system logfiles, the namenode audit log, and the task log for the tasktracker child process 35

Key Configuration Properties

Property Name

Conf File

Standalone

Pseudo Distributed

Fully Distributed

fs.default.name

core-site.xml

file:/// (default)

hdfs://localhost/ hdfs://namenode/

dfs.replication

hdfs-site.xml

NA

1

3 (default)

mapred.job.tracker

mapredsite.xml

local (default)

Localhost:8021

Jobtracket:8021

36

HDFS 37

Design of HDFS • HDFS is hadoop’s Distributed File System • Designed for storing very large files (of sized petabytes) • Single file can be stored across several the disks • Designed for streaming data access patterns • Not suitable for low-latency data access • Designed to be highly fault tolerant hence can run on commodity hardware

38

HDFS Concepts • Like in any file system HDFS stores files by breaking it into smallest units called Blocks • The default HDFS block size is 64 MB • The large block size helps in maintaining high throughput • Each Block is replicated across multiple machines in the cluster for redundancy

39

Design of HDFS - Daemons

Get block information for the file

Secondary Name Node

Name Node

Networked Client

Read Blocks

Data Node

Hadoop Cluster

Data Node

Data Node

Data Node

Data Blocks

40

Design of HDFS - Daemons The HDFS file system is managed by two daemons • NameNode & DataNode • NameNode & DataNode function in master/ slave fashion • NameNode Manages File system namespace • Maintains file system tree and the metadata of all the files and directories • Filesystem Image • Edit log

• Datanodes store and retrieve the blocks for the files when they are told by NameNode • NameNode maintains the information on which DataNodes all the blocks for a given file are located • DataNodes report to NameNode periodically with the list of blocks they are storing • With NameNode off, the HDFS is inaccessible

• Secondary NameNode • Not a backup for the NameNode • Just helps in merging filesystem image with edit log to avoid edit log 41 becoming too large

Hands On • Configure HDFS file system for hadoop • Format HDFS • Start & Verify HDFS services • Verify HDFS • Stop HDFS services • Change replication

42

Configuring HDFS – core-site.xml (Pseudo Distributed Mode) • Set JAVA_HOME in conf/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 • The property is used on the remote machines

• Set up core-site.xml fs.default.name hdfs://localhost/ Add “fs.default.name” property under configuration tag to specify NameNode location. “localhost” for Pseudo distributed mode. Name node runs at port 8020 by default if no port is specified

43

Starting HDFS • Format NameNode >>hadoop namenode -format • Creates empty file system with storage directories and persistent data structures • Data nodes are not involved

• Start dfs services & verify daemons >>start-dfs.sh >>jps

• List / Check HDFS >>hadoop >>hadoop >>hadoop

fs -ls fsck / -files -blocks fs -mkdir testdir 44

Verify HDFS • List / Check HDFS again >>hadoop >>hadoop

fs -ls fsck /

-files -blocks

• Stop dfs services >>stop-dfs.sh >>jps • No java processes should be running

45

Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode) dfs.replication 1

Add “dfs.replication” property under configuration tag. value is set to 1 so that no replication is done 46

Configuring HDFS - hdfs-site.xml (Pseudo Distributed Mode)

Property Name

Description

Default Value

dfs.name.dir

Directories for NameNode to store it’s persistent data (Comma separated directory names). A copy of metadata is stored in each of the listed directory

${hadoop.tmp.dir}/dfs/ name

dfs.data.dir

Directories where DataNode stores blocks. Each block is stored in only one of these directories

${hadoop.tmp.dir}/dfs/ data

fs.checkpoint.dir Directories where secondary NameNode stores checkpoints. A copy of the checkpoint is stored in each of the listed directory

${hadoop.tmp.dir}/dfs/ namesecondary

47

Basic HDFS Commands • Creating Directory • hadoop fs -mkdir

• Removing Directory • hadoop fs -rm

• Copying files to HDFS from local filesystem • hadoop fs -copyFromLocal / /

• Copying files from HDFS to local filesystem • hadoop fs -copyToLocal /

/

• List files and directories • hadoop

fs

-ls

• List the blocks that make up each file in HDFS • hadoop

fsck

/

-files

-blocks

48

Hands On • Create data directories for • NameNode • Secondary NameNode • DataNode

• Configure the nodes • Format HDFS • Start DFS service and verify daemons • Create directory “retail” in HDFS • Copy files from lab/data/retail directory to HDFS retail directory • Verify the blocks created • Do fsck on HDFS to check the health of HDFS file system 49

Create data directories for HDFS • Create directory for NameNode >>cd ~/lab >>mkdir hdfs >>cd hdfs >>mkdir namenode >>mkdir secondarynamenode >>mkdir datanode >>chmod 755 datanode

50

Configuring data directories for HDFS • Configure HDFS directories • Add the following properties in hdfs-site.xml dfs.name.dir /home/hadoop/lab/hdfs/namenode true dfs.data.dir /home/hadoop/lab/hdfs/datanode true fs.checkpoint.dir /home/hadoop/ lab/hdfs/secondarynamenode true 51

HDFS Web UI

• Hadoop provides a web UI for viewing HDFS • Available at http://:50070/ • Browse file system • Log files

52

MapReduce

53

MapReduce • A distributed parallel processing engine of Hadoop • Processes the data in sequential parallel steps called • Map • Reduce

• Best run with a DFS supported by hadoop to exploit it’s parallel processing abilities • Has the ability to run on a cluster of computers • Each computer called as a node

• Input/output data at every stage is handled in terms of key/value pairs • Key/ Value can be chosen by programmer

• Mapper output with the same key are sent to the same reducer • Input to Reducer is always sorted by key • Number of mappers and reducers per node can be configured 54

MapReduce Workflow – Word count

(K1, V1)

Input file Split s on the DFS Computer 1 If you go up and down

Input file If you go up and down The weight go down and the health go up

Map

Computer 2 The weight go down and

Map

Computer 3 the health go up

Map

Intermediate Key/Value pairs

Output

(K2, V2)

(K3, V3)

and, 1 down, 1 go, 1 if, 1 up, 1 you, 1 and, 1 down, 1 go, 1 the, 1 weight, 1 go, 1 health, 1 the, 1 up, 1

and, 1 down, 1 and, 1 down, 1

go, 1 go, 1 go, 1

if, 1 Computer 1 Reduce

Computer 2 Reduce

up, 1 up, 1

you, 1

health, 1

the, 1 the, 1

and Down go If

2 2 3 1

up you the Health Weight

2 1 2 1 1

weight, 1

55

Design of MapReduce - Daemons The MapReduce system is managed by two daemons • JobTracker & TaskTracker • JobTracker & TaskTracker function in master/ slave fashion • JobTracker coordinates the entire job execution • TaskTracker runs the individual tasks of map and reduce • JobTracker does the bookkeeping of all the tasks run on the cluster • One map task is created for each input split • Number of reduce tasks is configurable • mapred.reduce.tasks 56

Design of MapReduce - Daemons Client

Job Tracker

Networked

HDFS

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

Map

Red

57

Hands On • Configure MapReduce • Start MapReduce daemons • Verify the daemons • Stop the daemons

58

mapred-site.xml - Pseudo Distributed Mode

mapred.job.tracker localhost:8021 Add “mapred.job.tracker” property under configuration tag to specify JobTracker location. “localhost:8021” for Pseudo distributed mode. 59

Starting Hadoop – MapReduce Daemons

• Start MapReduce Services >>start-mapred.sh >>jps

• Stop MapReduce Services >>stop-mapred.sh >>jps

60

MapReduce Programming 61

MapReduce Programming

• Having seen the functioning of MapReduce, to perform a job in hadoop a programmer needs to create • A MAP function • A REDUCE function • A Driver to communicate with the framework, configure and launch the job Execution Environment Map Reduce Framework

Execution Parameters

Map

Map Reduce Framework

Map

Red uce

Map Reduce Framework

Output

Red

Driver 62

Retail Use case • Set of transactions in txn.csv Txn ID TXN Date Cust ID Amt Category Sub-Cat Addr-1 Addr-2 Credit/ Cash 00999990,08-19-2011,4003754,147.66,Team Sports,Lacrosse,Bellevue,Washington,credit 00999991,10-09-2011,4006641,126.19,Water Sports,Surfing,San Antonio,Texas,credit 00999992,06-09-2011,4005497,097.78,Water Sports,Windsurfing,San Diego,California,credit

• Customer details in custs.csv Cust ID Fst Nam Lst Nam Age Profession 4009983,Jordan,Tate,35,Coach 4009984,Justin,Melvin,43,Loan officer 4009985,Rachel,Corbett,66,Human resources assistant

63

Map Function • The Map function is represented by Mapper class, which declares an abstract method map() • Mapper class is generic type with four type parameters for the input and output key/ value pairs • Mapper • K1, V1 are the types of the input key / value pair • K2, V2 are the types of the output key / value pair

• Hadoop provides it’s own types that are optimized for network serialization • Text • LongWritable • IntWritable

Corresponds to Java String Corresponds to Java Long Corresponds to Java Int

• The map() method must be implemented to achieve the input key/ value transformation • map() is called by MapReduce framework passing the input key/ values from the input split • map() is provided with a context object in it’s call, to which the transformed key/ values can be written to 64

Mapper – Word Count public static class TokenizerMapper extends Mapper {

private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString(), " \t\n\r\f,.;:?![]'");

while (itr.hasMoreTokens()) { word.set(itr.nextToken().toLowerCase()); context.write(word, one); } }

} 65

Reduce Function • The Reduce function is represented by Reducer class, which declares an abstract method reduce() • Reducer class is generic type with four type parameters for the input and output key/ value pairs • Reducer • K2, V2 are the types of the input key/ value pair, this type of this pair must match the output types of Mapper • K3, V3 are the types of the output key/ value pair

• The reduce() method must be implemented to achieve the desired transformation of input key/ value • reduce() method is called by MapReduce framework passing the input key/ values from out of map phase • MapReduce framework guarantees that the records with the same key from all the map tasks will reach a single reduce task • Similar to the map, reduce method is provided with a context object to which the transformed key/ values can be written to 66

Reducer – Word Count public static class IntSumReducer extends Reducer< Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum) ); } }

67

Driver – MapReduce Job • The job object forms the specification of a job Configuration conf = new Configuration(); Job job = new Job(conf, “Word Count”);

• Job object gives you control over how the job is run • Set the jar file containing mapper and reducer for distribution around the cluster Job.setJarByClass(wordCount.class);

• Set Mapper and Reducer classes job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class);

• Input/ output location is specified by calling static methods on FileInputFormat and FileOutputFormat classes by passing the job FileInputFormat.addInputPath(job, path); FileOutputFormat.setOutputPath(job, path);

• Set Mapper and Reducer output types • Set Input and Output formats • Input key/ value types are controlled by the Input Formats 68

MapReduce Job – Word Count Public class WordCount

{

public static void main(String args[]) throws Exception { if (args.length != 2) { System.err.println(“Usage: WordCount ”); System.exit(-1); } Configuration conf = new Configuration(); Job job = new Job(conf, “Word Count”); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0]) ); FileOutputFormat.setOutputPath(job, new Path(args[1]) ); System.exit(job.waitForCompletion(true) ? 0 : 1);

} }

69

The MapReduce Web UI • Hadoop provides a web UI for viewing job information • • • • •

Available at http://:50030/ follow job’s progress while it is running find job statistics View job logs Task Details

70

Combiner • Combiner function helps to aggregate the map output before passing on to reduce function • Reduces intermediate data to be written to disk • Reduces data to be transferred over network

• Combiner is represented by same interface as Reducer • Combiner for a job is specified as job.setCombinerClass(.class);

71

Word Count – With Combiner Public class WordCount

{

public static void main(String args[]) throws Exception { if (args.length != 2) { System.err.println(“Usage: WordCount ”); System.exit(-1); } Configuration conf = new Configuration(); Job job = new Job(conf, “Word Count”); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

In case of cumulative & associative functions the reducer can work as combiner. Otherwise a separate combiner needs to be created

FileInputFormat.addInputPath(job, new Path(args[0]) ); FileOutputFormat.setOutputPath(job, new Path(args[1]) );

System.exit(job.waitForCompletion(true) ? 0 : 1); } }

72

Partitioning • Map tasks partition their output keys by the number of reducers • There can be many keys in a partition • All records for a given key will be in a single partition • A Partitioner class controls partitioning based on the Key • Hadoop uses hash partition by default (HashPartitioner)

• The default behavior can be changed by implementing the getPartition() method in the Partitioner (abstract) class public abstract class Partitioner { public abstract int getPartition(KEY key, VALUE value, int numPartitions); }

• A custom partitioner for a job can be set as job.setPartitionerClass(.class); 73

Partitioner Example public class WordPartitioner extends Partitioner {

@Override public int getPartition(Text key, IntWritable value, int numPartitions) { String ch = key.toString().substring(0,1); /*if (ch.matches("[abcdefghijklm]")) { return 0; } else if (ch.matches("[nopqrstuvwxyz]")) { return 1; } return 2;*/ //return (ch.charAt(0) % numPartitions); //round robin based on ASCI value return 0; // default behavior } }

74

One or Zero Reducers • Number of reducers is to be set by the developer job.setNumReduceTasks();

OR

mapred.reduce.tasks=10

• One Reducer • Maps output data is not partitioned, all key /values will reach the only reducer • Only one output file is created • Output file is sorted by Key • Good way of combining files or producing a sorted output for small amounts of data

• Zero Reducers or Map-only • The job will have only map tasks • Each mapper output is written into a separate file (similar to multiple reducers case) into HDFS • Useful in cases where the input split can be processed independent of other parts 75

Data Types

• Hadoop provides it’s own data types • Data types implement Writable interface public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; }

• Optimized for network serialization

• Key data types implement WritableComparable interface which enables key comparison public interface WritableComparable extends Writable, Comparable { }

• Keys are compared with each other during the sorting phase • Respective registered RawComparator is used comparison public interface RawComparator extends Comparator { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }

76

Data Types Writable wrapper classes for Java primitives Java primitive

Writable Serialized size implementation (bytes)

Boolean Byte Short Int

BooleanWritable ByteWritable ShortWritable IntWritable VIntWritable FloatWritable LongWritable VLongWritable DoubleWritable

Float Long Double

1 1 2 4 1–5 4 8 1–9 8

NullWritable • Special writable class with zero length serialization • Used as a place holder for a key/ value when you do not need to use that position 77

Data Types (Custom) • Custom Data types (Custom Writables) • Custom and complex data types can be implemented per need to be used as key and values • key data types must implement WritableComparable • Values data types need to implement at least Writable

• Custom types can implement raw comparators for speed public static class CustComparator extends WritableComparator { public CustComparator () { super(.class); } @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { } } static { WritableComparator.define(.class, new CustComparator()); }

• WritableComparator is a general purpose RawComparator • Custom comparators for a job can also be set as job.setSortComparatorClass(KeyComparator.class); job.setGroupingComparatorClass(GroupComparator.class);

implementation

of

78

Input Formats • An Input Format determines how the input data to be interpreted and passed on to the mapper • Based on an Input Format, the input data is divided into equal chunks called splits • Each split is processed by a separate map task • Each split in turn is divided into records based on Input Format and passed with each map call • The Key and the value from the input record is determined by the Input Format (including types)

• All input Formats implement InputFormat interface • Input format for a job is set as follows • job.setInputFormatClass(.class);

• Two categories of Input Formats • File Based • Non File Based

79

Input Formats

80

File Input Formats • FileInputFormat is the base class for all file based data sources • Implements InputFormat interface • FileInputFormat offers static convenience methods for setting a Job’s input paths FileInputFormart.addInputPath(job, path)

• Each Split corresponds to either all or part of a single file except for CombineFileInputFormat • File Input Formats • Text Based • • • •

TextInputFormat KeyValueTextInputFormat NLineInputFormat CombineFileInputFormat (meant for lot of small files to avoid too many splits)

• Binary • SequenceFileInputFormat

81

File Input Formats - TextInputFormat • Each line is treated as a record • Key is byte offset of the line from beginning of the file • Value is the entire line Input File

2001220,John ,peter,35,320,1st Main,lombard,NJ,manager 2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin 2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer

LongWritable

Input to Mapper

Text

K1 = 0 V1 = “2001220,John ,peter,35,320,1st Main,lombard,NJ,manager” K2 = 54 V2 = “2001221,Karev,Yong,39,450,2nd Main,Hackensack,NJ,Sys Admin” K3 = 102 V3 = “2001223,Vinay,Kumar,27,325,Sunny Brook,2nd Block,Lonely Beach,NY,Web Designer”

• TextInputFormat is the default input format if none specified 82

File Input Formats - Others • KeyValueTextInputFormat • Splits each line into key/ value based on specified delimiter • Key is the part of the record before the first appearance of the delimiter and rest is the value • Default delimiter is tab character • A different delimiter can be set through the property mapreduce.input.keyvaluelinerecordreader.key.value.separator

• NLineInputFormat • Each File Splits contains fixed number of lines • The default is one, which can be changed by setting the property mapreduce.input.lineinputformat.linespermap

• CobineFileInputFormat • A Splits can consist of multiple files (based on max split size) • Typically used for lot of small files • This is an abstracts class and one need to implement to use 83

File Input Formats - SequenceFileInputFormat • Sequence File • provides persistent data structure for binary key-value pairs • Provides sync points in the file at regular intervals, which makes a sequence file splittable • The key / values can be stored compressed or without • Two types of compressions • Record • Block

• SequenceFileInputFormat • Enables reading data from a Sequence File • Can read MapFiles as well • Variants of SequnceFileInputFormat • SequnceFileAsTextInputFormat • Converts key, values into Text Objects

• SequnceFileAsBinaryInputFormat • Retrieves the keys and values as BytesWritable Objects

84

Non File Input Formats - DBInputFormat • DBInputFormat is an input format to read data from RDBMS through JDBC

85

Output Formats OutputFormat class hierarchy

86

Output Formats - Types • Output Format for a job is set as job.setOutputFormatClass(TextOutputFormat.class);

• FileBased • FileOutputFormat is the Base class • FileOutputFormat offers static method for setting output path FileOutputFormat.setOutputPath(job, path); • One file per reducer is created (default file name : part-r-nnnnn), nnnnn is an designating the part number, starting from zero

• TextOutputFormat • SequenceFileOutputFormat • SequenceFileAsBinaryOutputFormat

• MapFileOutputFormat

• NullOutputFormat • DBOutputFormat • Output format to dump output data to RDBMS through JDBC 87

Lazy Output • FileOutputFormat subclasses will create output files, even if there is no record to write • LazyOutputFormat can be used to delay output file creation until there is a record to write LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class) Instead of Job.setOutputFormatClass(TextOutputFormat.class)

88

Unit Testing - MRUnit • MRUnit is a unit testing library for MapReduce program • Mapper and Reducer can be tested independently by passing inputs • MapDriver has methods to run a mapper by passing input key value and expected key values • ReduceDriver< MapDriver has methods to run a Reducer by passing input key value and expected key values

89

Counters • Useful means of • Monitoring job progress • Gathering statistics • Problem diagnosis

• Built-in-counters fall into below groups. • • • • •

MapReduce task counters Filesystem counters FileInput-Format counters FileOutput-Format counters Job counters

• Each counter will either be task counter or job counter • Counters are global. MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job 90

User Defined Counters • Counters are defined in a job by Java enum enum Temperature { MISSING, MALFORMED }

• Counters are set and incremented as context.getCounter(Temperature.MISSING).increment(1);

• Dynamic counters • Counters can also be set without predefining as enums context.getCounter(“grounName”, “counterName”).increment(1);

• Counters are retrieved as Counters cntrs = job.getCounters(); long total = cntrs.getCounter(Task.Counter.MAP_INPUT_RECORDS); long missing = cntrs.getCounter(MaxTemperatureWithCounters.Temperature.MISSING); 91

Side Data Distribution • Side data: typically the read only data needed by the job for processing the main dataset • Two methods to make such data available to task trackers • Using Job Configuration • Using Distributed Cache

• Using Job Configuration • Small amount of metadata can be set as key value pairs in the job configuration Configuration conf = new Configuration(); conf.set(“Product”, “Binoculars”); conf.set(“Conversion”, “54.65”);

• The same can be retrieved in the map or reduce tasks Configuration conf = context.getConfiguration(); String product = conf.get(“Product”).trim();

• Effective only for small amounts of data (few KB). Else will put pressure on memory of daemons 92

Side Data Distribution – Distributed Cache • A mechanism for copying read only data in files/ archives to the task nodes just in time • Can be used to provide 3rd party jar files • Hadoop copies these files to DFS then tasktracker copies them to the local disk relative to task’s working directory • Distributed cache for a job can be set up by calling methods on Job Job.addCacheFile(new URI(/); Job.addCacheArchives(new URI(/); Job.addFileToClasspath(new Path(/);

• The files can be retrieved from the distributed cache through the methods on JobContext Path[] localPaths = context.getLocalCacheFiles(); Path[] localArchives = context.getLocalCacheArchives(); 93

Multiple Inputs • Often in real life you get the related data from different sources in different formats • Hadoop provide MultipleInputs class to handle the situation • MultipleInputs.addInputPath(job, inputPath1, .class); • MultipleInputs.addInputPath(job, inputPath2, .class);

• No need to set input path, InputFormat class separately • You can even have separate Mapper class for each input file • MultipleInputs.addInputPath(job, inputPath1, .class, MapperClass1.class); • MultipleInputs.addInputPath(job, inputPath2, .class, MapperClass2.class); • Both Mappers must emit same key/ value types 94

Joins • More than one record sets to be joined based on a key • Two techniques for joining data in MapReduce • Map side join (Replicated Join) • Possible only when • one of the data sets is small enough to be distributed across the data nodes and fits into the memory for maps to independently join OR • Both the data sets are portioned in such a way that they have equal number of partitions, sorted by same key and all records for a given key must reside in the same partition

• The smaller data set is used for the look up using the join key • Faster as the data is loaded into the memory

95

Joins • Reduce side join • Mapper will tag the records from both the data sets distinctly • Join key is used as map’s output key • The records for the same key are brought together in the reducer and reducer will complete the joining process • Less efficient as both the data sets have to go through mapreduce shuffle

96

Job Chaining • Multiple jobs can be run in a linear or complex dependent fashion • Simple way is to call the job drivers one after the other with respective configurations JobClient.runJob(conf1); JobClient.runJob(conf2);

• Here the second job is not launched until first job is completed

• For complex dependencies you can use JobControl, and ControlledJob classes ControlledJob cjob1 = new ControlledJob(conf1); ControlledJob cjob2 = new ControlledJob(conf2);

cjob2.addDependingJob(cjob1); JobControl jc = new JobControl(“Chained Job”); jc.addjob(cjob1); jc.addjob(cjob2);

jc.run();

• JobControl can run jobs in parallel if there is no dependency or the dependencies are met

97

Speculative Execution • MapReduce job’s execution time is typically determined by the slowest running task • Job is not complete until all tasks are completed • One slow job could bring down overall performance of the job

• Tasks could be slow due to various reasons • Hardware degradation • Software issues

• Hadoop Strategy – Speculative Execution • • • •

Determines when a task is running longer than expected Launches another equivalent task as backup Output is taken from the task whichever completes first Any duplicate tasks running are killed post that 98

Speculative Execution - Settings • Is ON by default • The behavior can be controlled independently for map and reduce tasks • For Map Tasks • mapred.map.tasks.speculative.execution

to true/ false

• For Reduce Tasks • mapred.reduce.tasks.speculative.execution

to true/ false

99

Skipping Bad Records • While handling a large datasets you may not anticipate every possible error scenario • This will result in unhandled exception leading to task failure • Hadoop retries failed tasks(task can fail due to other reasons) up to four times before marking the whole job as failed • Hadoop provides skipping mode for automatically skipping bad records • The mode is OFF by default • Can be enabled by setting mapred.skip.mode.enabled = true 100

Skipping Bad Records • When skipping mode is enabled, if the task fails for two times, the record is noted during the third time and skipped during the fourth attempt • The number of total attempts for map and reduce tasks can be increased by setting mapred.map.max.attempts mapred.reduce.max.attempts

• Bad records are stored under _logs/skip directory as sequence file

101

Hadoop Archive Files

• HDFS stores small files (size >tar xvf pig-x.y.z.tar.gz

• Configure Environment Variables - add in .bash_profile • export PIG_INSTALL=//pig-x.y.z • export PATH=$PATH:$PIG_INSTALL/bin >>. .bash_profile

• Verify Installation >>pig -help • Displays command usage

>>pig • Takes you into Grunt shell grunt>

108

PIG Execution Modes • Local Mode • • • •

Runs in a single JVM Operates on local file system Suitable for small datasets and for development To run PIG in local mode >>pig -x

local

• MapReduce Mode • In this mode the queries are translated into MapReduce jobs and run on hadoop cluster • PIG version must be compatible with hadoop version • Set HADOOP_HOME environment variable to indicate pig which hadoop client to use • export HADOOP_HOME=$HADOOP_INSTALL • If not set it will uses the bundled version of hadoop

109

Ways of Executing PIG programs

• Grunt • An interactive shell for running Pig commands • Grunt is started when the pig command is run without any options

• Script • Pig commands can be executed directly from a script file >>pig pigscript.pig

• It is also possible to run Pig scripts from Grunt shell using run and exec.

• Embedded • You can run Pig programs from Java using the PigServer class, much like you can use JDBC • For programmatic access to Grunt, use PigRunner

110

An Example A Sequence of transformation steps to get the end result

grunt> transactions = LOAD 'retail/txn.csv' USING PigStorage(',') AS (txn_id, txn_dt, cust_id, amt, cat, sub_cat, adr1, adr2, trans_type);

LOAD

grunt> txn_100plus = FILTER transactions BY amt > 100.00;

FILTER

grunt> txn_grpd = GROUP txn_100plus BY cat;

grunt> txn_cnt_bycat = FOREACH txn_grpd GENERATE group, COUNT(txn_100plus);

GROUP

AGGREGATE

grunt> DUMP txn_cnt_bycat;

A relation is created with every statement 111

Data Types

Simple Types Category Numeric Text Binary

Type int long float double chararray bytearray

Description 32-bit signed integer 64-bit signed integer 32-bit floating-point number 64-bit floating-point number Character array in UTF-16 format Byte array

112

Data Types

Complex Types Type Tuple Bag map

Description Sequence of fields of any type An unordered collection of tuples, possibly with duplicates A set of key-value pairs; keys must be character arrays, but values may be any type

Example (1,'pomegranate') {(1,'pomegranate'),(2)} ['a'#'pomegranate']

113

LOAD Operator

= LOAD ‘’ [USING UDF()] [AS (:dataType, :dataType, ………,:dataType)]

• Loads data from a file into a relation • Uses the PigStorage load function as default unless specified otherwise with the USING option • The data can be given a schema using the AS option. • The default data type is bytearray if not specified records=LOAD ‘sales.txt’; records=LOAD ‘sales.txt’ AS (f1:chararray, f2:int, f3:float); records=LOAD ‘sales.txt’ USING PigStorage(‘\t’); records=LOAD ‘sales.txt’ USING PigStorage(‘\t’) AS (f1:chararray, f2:int, f3:float);

114

Diagnostic Operators

• DESCRIBE • Describes the schema of a relation

• EXPLAIN • Display the execution plan used to compute a relation

• ILLUSTRATE • Illustrate step-by-step how data is transformed • Uses sample of the input data to simulate the execution.

115

Data Write Operators

• LIMIT • Limits the number of tuples from a relation

• DUMP • Display the tuples from a relation

• STORE • Store the data from a relation into a directory. • The directory must not exists

116

Relational Operators

• FILTER • Selects tuples based on Boolean expression teenagers = FILTER cust BY age < 20;

• ORDER • Sort a relation based on one or more fields • Further processing (FILTER, DISTINCT, etc.) may destroy the ordering ordered_list = ORDER cust BY name DESC;

• DISTINCT • Removes duplicate tuples unique_custlist = DISTINCT cust;

117

Relational Operators

• GROUP BY • Within a relation, group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL

• FOREACH • Loop through each tuple in nested_alias and generate new tuple(s). countByProfession=FOREACH groupByProfession GENERATE group, count(cust); • Built in aggregate functions AVG, COUNT, MAX, MIN, SUM

118

Relational Operators

• GROUP BY • Within a relation, group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL

• FOREACH • Loop through each tuple in nested_alias and generate new tuple(s). • At least one of the fields of nested_alias should be a bag • DISTINCT, FILTER, LIMIT, ORDER, and SAMPLE are allowed operations in nested_op to operate on the inner bag(s). • countByProfession=FOREACH groupByProfession GENERATE group, count(cust); • Built in aggregate functions AVG, COUNT, MAX, MIN, SUM 119

Operating on Multiple datasets

• JOIN • Compute inner join of two or more relations based on common field values. X = JOIN A BY a1, B BY b1; DUMP X;

DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)

DUMP B; (2,4) (8,9) (1,3) (2,7) (7,9)

(1,2,3,1,3) (8,3,4,8,9) (7,2,5,7,9)

120

Operating on Multiple datasets

• COGROUP • Group tuples from two or more relations, based on common group values.

>>DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)

>>DUMP B; (2,4) (8,9) (1,3) (2,7) (7,9)

>>X = COGROUP A BY a1, B BY b1; >>DUMP X; (1, (8, (7, (2, (4,

{(1,2,3)}, {(1,3)} ) {(8,3,4)}, {(8,9)} ) {(7,2,5)}, {(7,9)} ) {}, {(2,4),(2,7)} ) {(4,2,1), (4,3,3)}, {} ) 121

Operating on Multiple datasets

• UNION • Creates the union of two or more relations >>X = UNION A, B; >>DUMP X;

• SPLIT

>>DUMP A; (1,2,3) (4,2,1) (8,3,4)

>>DUMP B; (2,4) (8,9)

(1,2,3) (4,2,1) (8,3,4) (2,4) (8,9)

• Splits a relation into two or more relations, based on a Boolean expressions. >>Y = SPLIT X INTO C IF a1 5; >>DUMP C; >>DUMP D; (1,2,3) (4,2,1) (2,4)

(8,3,4) (8,9) 122

Operating on Multiple datasets

• SAMPLE • Randomly samples a relation as per given sampling factor. • There is no guarantee that the same number of tuples are returned every time.

>>sample_data = SAMPLE large_data 0.01; • Above statement generates a 1% sample of data in relation large_data

123

UDFs • PIG lets users define their own functions and lets them be used in the statements • The UDFs can be developed in Java, Python or Javascript • Filter UDF • To be subclassed of FilterFunc which is a subclass of EvalFunc

• Eval UDF • To be subclassed of EvalFunc public abstract class EvalFunc { public abstract T exec(Tuple input) throws IOException; }

• Load UDF • To be subclassed of LoadFunc

• Define and use an UDF • REGISTER pig-examples.jar; • DEFINE com.training.myfunc.isCustomerTeen() • filtered= FILTER cust BY isCustomerTeen(age) 124

Macros

• Package reusable pieces of Pig Latin code • Define a Macro DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); }; max_temp = max_by_group(filtered_records, year, temperature);

• Macros can be defined in separate files to Pig scripts for reuse, in which case they need to be imported IMPORT

‘/';

125

HIVE 126

HIVE • A datawarehousing framework built on top of hadoop • Abstracts MapReduce complexity behind • Target users are generally data analysts who are comfortable with SQL • SQL Like Language and called HiveQL • Hive meant only for structured data • You can interact with Hive using several methods • CLI (Command Line Interface) • A Web GUI • JDBC

127

HIVE Architecture

CLI

Hive Metastore

WEB

JDBC

Parser/ Planner/ Optimizer

Map

Red

Map

Red

Hadoop Cluster 128

Install & Configure HIVE • Download a version of HIVE compatible with your hadoop installation • http://hive.apache.org/releases.html

• Untar into a designated folder. This will be HIVE’s home directory >>tar xvf hive-x.y.z.tar.gz

• Configure • Environment Variables – add in .bash_profile • export HIVE_INSTALL=//hive-x.y.z • export PATH=$PATH:$HIVE_INSTALL/bin

• Verify Installation >>hive -help • Displays command usage

>>hive • Takes you into hive shell hive>

129

Install & Configure HIVE • Hadoop needs to be running • Configure to hadoop • Create hive-site.xml under conf directory • specify the filesystem and jobtracker using the hadoop properties • fs.default.name • mapred.job.tracker

• If not set, they default to the local file system and the local (in-process) job runner - just like they do in Hadoop

• Create following directories under HDFS • /tmp • /user/hive/warehouse • chmod g+w for both above directories

130

Install & Configure HIVE • Data store • Hive stores data under /user/hive/warehouse by default

• Metastore • Out-of-the-box hive comes with light weight SQL database Derby to store and manage meta data • This can be configured to other databases like MySQL

131

Hive Data Models • Databases • Tables • Partitions • Buckets

132

Hive Data Types • TINYINT – 1 byte integer • SMALLINT – 2 byte integer • INT – 4 byte integer • BIGINT – 8 byte integer • BOOLEAN – true / false • FLOAT – single precision • DOUBLE – double precision • STRING – sequence of characters • STRUCT • A column can be of type STRUCT with data {a INT, b STRING}

• MAPS • ARRAYS • *‘a’, ‘b’, ‘c’+

133

Tables • A Hive table is logically made up of the data being stored and the associated metadata • Creating a Table CREATE TABLE emp_table (id INT, name String, address STRING) PARTITIONED BY (designation STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS SEQUENCEFILE;

• Loading Data LOAD DATA INPATH ‘/home/hadoop/employee.csv’ OVERWRITE INTO TABLE emp_table;

• View Table Schema SHOW TABLES; DESCRIBE emp_table;

• An external table is a table which is outside the warehouse directory 134

Hands On • Create retail, customers tables hive> CREATE DATABSE retail; hive> USE retail; hive> CREATE TABLE retail_trans (txn_id INT, txn_date STRING, Cust_id INT, Amount FLOAT, Category STRING, Sub_Category STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> CREATE TABLE customers (Cust_id INT, FirstName STRING, LastName STRING, Profession STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; hive> SHOW TABLES; hive> DESCRIBE retail_trans;

135

Hands On • Load data and run queries hive> LOAD DATA INPATH 'retail/txn.csv' INTO TABLE retail_trans; hive> LOAD DATA INPATH 'retail/custs.csv' INTO TABLE customers; hive> SELECT Category, count(*) FROM retail_trans GROUP BY Category; hive> SELECT Category, count(*) FROM retail_trans WHERE Amount > 100 GROUP BY Category; hive> SELECT Concat (cu.FirstName, ' ', cu.LastName), rt.Category, count(*) FROM retail_trans rt JOIN customers cu ON rt.cust_id = cu.cust_id GROUP BY cu.FirstName, cu.LastName, rt.Category;

136

Queries • SELECT SELECT id, naem FROM emp_table WHERE designation = ‘manager’; SELECT count(*) FROM emp_table; SELECT designation, count(*) FROM emp_table GROUP BY designation;

• INSERT • INSERT OVERWRITE TABLE new_emp (SELECT * FROM emp_table WHERE id > 100); • Inserting local directory • INSERT OVERWRITE LOCAL DIRECTORY ‘tmp/results’ (SELECT * FROM emp_table WHERE id > 100);

• JOIN SELECT emp_table.*, detail.age FROM emp_table JOIN detail ON (emp_table.id = detail.id); 137

Partitioning & Bucketing • HIVE can organize tables into partitions based on columns • Partitioned are specified during the table creation time • When we load data into a partitioned table, the partition values are specified explicitly: LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

• Bucketing • Bucketing imposes extra structure on the table • make sampling more efficient CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; 138

UDFs • UDFs have to be written in java • Have to be subclased UDF (org.apache.hadoop.hive.ql.exec.UDF) • A UDF must implement at least one evaluate() method. public class Strip extends UDF { public Text evaluate(Text str) { ----------return str1; } } ADD JAR /path/to/hive-examples.jar; CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip'; SELECT strip(' bee ') FROM dummy 139

SQOOP 140

SQOOP

• sqoop allows users to extract data from a structured data store into Hadoop for analysis • Sqoop can also export the data back the structured stores • Installing & Configuring SQOOP • Download a version of SQOOP compatible with your hadoop installation • Untar into a designated folder. This will be SQOOP’s home directory >>tar xvf sqoop-x.y.z.tar.gz

• Configure • Environment Variables – add in .bash_profile • export SQOOP_HOME=//sqoop-x.y.z • export PATH=$PATH:$SQOOP_HOME /bin

• Verify Installation >>sqoop >>sqoop help

141

Importing Data RDBMS 1. Examine the schema 2. Generate Code

Sqoop Client

MyClass.java

3. Launch Multiple maps on the cluster

4. Use Generate Code

Map

Map

Map

Hadoop Cluster 142

Importing Data • Copy mysql jdbc driver to sqoop’s lib directory • Sqoop does not come with the jdbc driver

• Sample import >>sqoop import --connect jdbc:mysql://localhost/retail --table transactions -m 1 >>hadoop fs -ls transactions • The Import tool will run a MapReduce job that connects to the database and reads the table • By default, four map tasks are used • The output is written to a directory by the table name, under user’s HDFS home directory

• Generates comma-delimited text files by default • In addition to downloading data, the import tool also generates a java class as per the table schema 143

Codegen • The code can also be generated without import action >>sqoop codegen --connect jdbc:mysql://localhost/hadoopguide --table widgets --class-name Widget • The generated class can hold a single record retrieved from the table • The generated code can be used in MapReduce programs to manipulate the data

144

Working along with Hive • Importing data into Hive • Generate Hive table definition directly from the source >>sqoop create-hive-table --connect jdbc:mysql://localhost/retail --table transactions --fields-terminated-by ',‘

• Generate table definition and import data into Hive >>sqoop import --connect jdbc:mysql://localhost/retail --table transactions -m 1 --hive-import

• Exporting data from Hive • Create the table in MySQL database >>sqoop export --connect jdbc:mysql://localhost/retail -m 1 --table customers --export-dir /user/hive/warehouse/retail.db/customers --input-fields-terminated-by ',' 145

Administration 146

NameNode Persistent Data Structure • A newly formatted Namenode creates the shown directory structure • VERSION: Java properties file with HDFS version • edits: Any write operation such as creating, moving a file is logged into edits • fsimage: Persistent checkpoint of file system metadata. This is update whenever edit log rolls over • fstime: Records the time when fsimage was last updated

${dfs.name.dir}/ current/ VERSION edits fsimage

fstime

147

Persistent Data Structure – DataNode & SNN • Secondary Namenode directory structure

• Datanode directory structure • Need not be formatted explicitly • They create their directories on startup ${dfs.data.dir}/ current/ blk_ blk_.meta

${fs.checkpoint.dir}/ current/ VERSION edits fsimage fstime previous.checkpoint/ VERSION edits fsimage fstime

blk_ blk_.meta Subdir0/ Subdir1/ 148

HDFS Safe Mode • When a Namenode starts it will enter Safe Mode • Loads fsimage to memory and applies the edits from edit log • During this time it does not listen to any requests • Safe mode is exited when minimal replication condition is met, plus an extension time of 30 seconds

• Check whether Namenode is in safe mode hadoop dfsadmin -safemode get

• Wait until the safe mode is off hadoop dfsadmin -safemode wait

• Enter or leave safe mode hadoop dfsadmin -safemode enter / leave 149

HDFS Filesystem Check

• Hadoop provides fsck utility to check the health of HDFS hadoop fsck /

• Option to either move (to lost+found) or delete affected files hadoop fsck / -move hadoop fsck / -delete

• Finding Blocks for a given file hadoop fsck /user/hadoop/weather/1901 -files -blocks -racks 150

HDFS Block Scanner • Datanodes run Block Scanner utility periodically to verify the blocks stored on it to guard against the disk errors • The default is 3 weeks (dfs.datanode.scan.period.hours) • Corrupt blocks are reported to Namenode for fixing • The Block scan report for a datanode can be accessed at • http://:50 075/blockScannerRep ort • List of Blocks can be accessed by appending ?listblocks to the above URL

151

HDFS Balancer

• Over a period of time, the block distribution across the cluster may become unbalanced • This will affect the data locality for MapReduce and puts strain on highly utilized datanodes • Hadoop’s Balancer daemon redistributes the blocks to achieve the balance • The balancing act can be initiated through start-balancer.sh

• It produces a log file in the standard log directory • The bandwidth for the balancer cab be changed by setting the dfs.balance.bandwidthPerSec property in hdfs-site.xml • Default bandwidth 1 MB / Sec 152

Logging • All Hadoop Daemons produce respective log files • Log files are stored under $HADOOP_INSTALL/logs • The location can be changed by setting the property HADOOP_LOG_DIR in hadoop-env.sh

• The log levels can be set under log4j.properties • Name Node • log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

• Job Tracker • log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

• Stack Traces • The stack traces for all the hadoop daemons can be obtained at /stacks page under daemons expose a web UI • Job tracker stack trace can be found at • http://:50030/stacks

153

Hadoop Routine Maintenance • Metadata Backup • Good practice to keep copies of different ages(one hour, one day, one week etc.,) • One way is to periodically archive secondary namenode’s previous.checkpoint directory to an offsite location • Test the integrity of the copy regularly

• Data Backup • HDFS replication is not a substitute for data back up • As the data volume is very high, it is a good practice to prioritize the data to be backed up • Business critical data • Data that can not be regenerated

• distcp is a good tool to backup from HDFS to other filesystems

• Run filesystem check (fsck) and balancer tools regularly 154

Commissioning of New Nodes • The datanodes that are permitted to connect to Namenode are specified in a file pointed by the property dfs.hosts • The tasktrackers that are permitted to connect to Jobtracker are specified in a file pointed by the property mapred.hosts • This restricts an arbitrary machine connecting into the cluster and compromising on data integrity and security • To add a new nodes • Add the network address of the new node in the above files • Run the commands to refresh Namenode and Jobtracker hadoop dfsadmin -refreshNodes hadoop mradmin -refreshNodes

• Update the slaves file with the new nodes • Note that the slaves file is not used by hadoop daemons. It is used by control scripts for cluster-wide operations 155

Decommissioning of Nodes

• For removing nodes from the cluster, the Namenode and the Jobtracker must be informed • The decommissioning process is controlled by an exclude file. The file location is set through a property • For HDFS it is dfs.hosts.exclude • For MapReduce mapred.hosts.exclude

• To remove nodes from the cluster • add the network addresses to the respective exclude files • Run the commands to update Namenode and JobTracker • hadoop dfsadmin -refreshNodes • hadoop mradmin -refreshNodes

• During decommission process Namenode will replicate the data to other datanodes • Remove the nodes from the include file as well as slaves file 156

Cluster & Machines Considerations

• Several options • Build your own cluster from scratch • Use offerings that provide hadoop as a service on cloud

• While building you own, choose server grade commodity machines (Commodity does not mean low-end) • Unix / Linux platform • Hadoop is designed to use multiple cores and disks • A typical machine for running a datanode and tasktracker Processor

Memory

Storage

Network

Two quad-core 2-2.5 GHz CPUs

16-24 GB ECC RAM1

Four 1 TB SATA disks

Gigabit Ethernet

• Cluster size is typically estimated based on the storage capacity and it’s expected growth 157

Master Node Scenarios

• The machine running the master daemons should be resilient as failure of these would lead to data lose and unavailability of the cluster • On a small cluster (few 10s of nodes) you run all master daemons on a single machine • As the cluster grows their memory requirement grows and needs to be run on separate machines • The control scripts should be run as follows • Run HDFS control scripts from the namenode machine • masters file should contain the address of the secondary namenode

• Run MapReduce scripts from the Job tracker machine • slaves file on both machines should be in sync so that each node will run one Datanode and a task tracker 158

Network Topology

• A common architecture consists of two level network topology 1GB + Switch

1GB Switch

30 to 40 servers per rack 159

Network Topology • For multirack cluster, the admin needs to map nodes to racks so hadoop is network aware to place data as well as mapreduce tasks as close as possible to the data • Two ways to define the network map • Implement java interface DNSToSwitchMapping public interface DNSToSwitchMapping { public List resolve(List names); } • Have the property topology.node.switch.mapping.impl point to the implemented class. The namenode and jobtracker will make use of this

• User based script pointed by the property topology.script.file.name

• The default behavior is to map all nodes to the same rack 160

Cluster Setup and Installation • Use automated installation tools such as kickstart or Debian to install software on nodes • Create one master script and use the same to automate

• Following steps to be carried to complete cluster setup • Install Java (6 or later) on all nodes • Create user account on all nodes for Hadoop activities • Have the same user name on all nodes • Having NFS drive as home directory makes SSH key distribution simple

• Install Hadoop on all nodes and change the owner of files • Install SSH. Hadoop control scripts (not the daemons) rely on SSH to perform cluster-wide operations

• Configure • Generate an RSA key pair, share public key on all nodes • Configure Hadoop. Better way of doing it is by using tools like Chef 161 or Puppet

Memory Requirements – Worker Node • The memory allocated to each daemon is controlled by HADOOP_HEAPSIZE setting in hadoop-env.sh • The default value is 1 GB

• The task tracker launches separate JVMs to run map and reduce tasks • The memory for the child JVM is set by mapred.child.java.opts. Default value is 200 MB

• The number of map and reduce tasks that can be run at any time is set by the property • Map - mapred.tasktracker.map.tasks.maximum • Reduce - mapred.tasktracker.reduce.tasks.maximum • The default is two for both map and reduce tasks 162

Memory Requirements – Worker Node

• The number of tasks that can be run simultaneously on a tasks tracker depends on the number of processors • a good rule of thumb is to have a factor of between one and two more tasks than processors • If you have a eight core processor • One core for Datanode and tasktracker • On remaining 7 cores we can have 7 maps and 7 reduce tasks • Increasing the memory for the JVM to 400 MB the total memory required is 7.6 MB

163

Other Properties to consider • Cluster Membership • Buffer Size • HDFS Block size • Reserved storage space • Trash • Job Scheduler • Reduce slow start • Task Memory Limits

164

Security • Hadoop uses Kerberos for authentication • Kerberos do not manage the permissions for hadoop • To enable Kerberos authentication set the property hadoop.security.authentication in core-site.xml to kerberos • Enable service-level authorization by setting hadoop.security.authorization to true in the same file • To control which users and groups can do what, configure Access Control Lists (ACLs) in the hadoop-policy.xml 165

Security Policies • Allow only alice, bob and users in the mapreduce group to submit the jobs security.job.submission.protocol.acl

alice, bob mapreduce

• Allow only users in the datanode group to communicate with Namenode security.datanode.protocol.acl datanode

• Allow any user to talk to HDFS cluster as a DFSClient security.client.protocol.acl * 166

Recommended Readings

167

Big Data & Hadoop Training Material 0 1.pdf

Short Description

Description

Comments

We need your help!