Cloudera_Academic_Partnership_8.pdf

July 16, 2018 | Author: yo2k9 | Category: Apache Hadoop, Map Reduce, Information Technology, Computer Engineering, Computer Architecture

Share Embed Donate

Report this link

Short Description

Download Cloudera_Academic_Partnership_8.pdf...

Description

Apache Hadoop – A course for undergraduates Lecture 8

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐1

Hadoop Tools for Data AcquisiHon Chapter 8.1

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐2

Data AcquisiHon with Hadoop § How to load data from an exis6ng RDBMS into HDFS using Sqoop § How to manage real-‐6me data such as log files using Flume

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐3

Chapter Topics Hadoop Tools for Data Acquisi6on §   Loading Data into HDFS from an RDBMS Using Sqoop §   Managing Real-‐Hme Data Using Flume

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐4

ImporHng Data From an RDBMS to HDFS § Typical scenario: data stored in an RDBMS is needed in a MapReduce job – Lookup tables – Legacy data § Possible to read directly from an RDBMS in your Mapper – Can lead to the equivalent of a distributed denial of service (DDoS) a>ack on your RDBMS – In pracHce – don’t do it! § BeOer idea: use Sqoop to import the data into HDFS beforehand

RDBMS

sqoop

HDFS

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐5

Sqoop: SQL to Hadoop (1) § Sqoop: open source tool originally wriOen at Cloudera – Now a top-‐level Apache SoWware FoundaHon project § Imports tables from an RDBMS into HDFS – Just one table – All tables in a database – Just porHons of a table – Sqoop supports a WHERE clause § Uses MapReduce to actually import the data – ‘Thro>les’ the number of Mappers to avoid DDoS scenarios – Uses four Mappers by default – Value is configurable § Uses a JDBC interface – Should work with virtually any JDBC-‐compaHble database © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐6

Sqoop: SQL to Hadoop (2) § Imports data to HDFS as delimited text files or SequenceFiles – Default is a comma-‐delimited text file § Can be used for incremental data imports – First import retrieves all rows in a table – Subsequent imports retrieve just rows created since the last import § Generates a class file which can encapsulate a row of the imported data – Useful for serializing and deserializing data in subsequent MapReduce jobs

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐7

Custom Sqoop Connectors § Cloudera has partnered with other organiza6ons to create custom Sqoop connectors – Use a database’s naHve protocols rather than JDBC – Provides much faster performance § Current systems supported by custom connectors include: – Netezza – Teradata – Oracle Database (connector developed with Quest SoWware) § Others are in development § Custom connectors are not open source, but are free – Available from the Cloudera Web site

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐8

Sqoop: Basic Syntax § Standard syntax: sqoop tool-name [tool-options]

§ Tools include: import import-all-tables list-tables

§ Op6ons include: --connect --username --password © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐9

Sqoop: Example § Example: import a table called employees from a database called personnel in a MySQL RDBMS $ sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \ --table employees

§ Example: as above, but only records with an ID greater than 1000 $ sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \ --table employees \ --where "id > 1000"

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐10

ImporHng An EnHre Database with Sqoop § Import all tables from the database (fields will be tab-‐delimited) $ sqoop import-all-tables \ --connect jdbc:mysql://localhost/company \ --username twheeler --password bigsecret \ --fields-terminated-by '\t' \ --warehouse-dir /mydata

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐11

Incremental Imports with Sqoop § What if new records are added to the database? – Could re-‐import all records, but this is inefficient § Sqoop’s incremental append mode imports only new records – Based on value of last record in specified column $ sqoop import \ --connect jdbc:mysql://localhost/company \ --username twheeler --password bigsecret \ --warehouse-dir /mydata \ --table orders \ --incremental append \ --check-column order_id \ --last-value 6713821

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐12

Sqoop: Other OpHons § Sqoop can take data from HDFS and insert it into an already-‐exis6ng table in an RDBMS with the command $ sqoop export [options]

§ For general Sqoop help: $ sqoop help

§ For help on a par6cular command: $ sqoop help command

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐13

Chapter Topics Hadoop Tools for Data Acquisi6on §   Loading Data into HDFS from an RDBMS Using Sqoop §   Managing Real-‐6me Data Using Flume

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐14

Flume: Basics § Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced – Ideally suited to gathering logs from mulHple systems and inserHng them into HDFS as they are generated § Flume is Open Source – IniHally developed by Cloudera § Flume’s design goals: – Reliability – Scalability – Extensibility

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐15

Flume: High-‐Level Overview

Agent

Agent

Agent

Agent

encrypt

•  Writes to multiple HDFS file formats (text, SequenceFile, JSON, Avro, others) •  Parallelized writes across many collectors – as much write throughput as required

Agent

Agent

compress

batch encrypt

•   Optionally process incoming data: perform transformations, suppressions, metadata enrichment •   Each agent can be configured with an in-memory or durable channel

Agent(s)

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐16

Flume Agent CharacterisHcs § Each Flume agent has a source, a sink and a channel § Source – Tells the node where to receive data from § Sink – Tells the node where to send data to § Channel – A queue between the Source and Sink – Can be in-‐memory only or ‘Durable’ – Durable channels will not lose data if power is lost

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐17

Flume’s Design Goals: Reliability § Channels provide Flume’s reliability § Memory Channel – Data will be lost if power is lost § File Channel – Data stored on disk – Guarantees durability of data in face of a power loss § Data transfer between Agents and Channels is transac6onal – A failed data transfer to a downstream agent rolls back and retries § Can configure mul6ple Agents with the same task – e.g., two Agents doing the job of one “collector” – if one agent fails then upstream agents would fail over

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐18

Flume’s Design Goals: Scalability § Scalability – The ability to increase system performance linearly by adding more resources to the system – Flume scales horizontally – As load increases, more machines can be added to the configuraHon

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐19

Flume’s Design Goals: Extensibility § Extensibility – The ability to add new funcHonality to a system § Flume can be extended by adding Sources and Sinks to exis6ng storage layers or data plaeorms – General Sources include data from files, syslog, and standard output from a process – General Sinks include files on the local filesystem or HDFS – Developers can write their own Sources or Sinks

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐20

Flume: Usage Pa>erns § Flume is typically used to ingest log files from real-‐6me systems such as Web servers, firewalls and mailservers into HDFS § Currently in use in many large organiza6ons, inges6ng millions of events per day – At least one organizaHon is using Flume to ingest over 200 million events per day § Flume is typically installed and configured by a system administrator – Check the Flume documentaHon if you intend to install it yourself

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐21

Key Points § Sqoop is a tool to load data from a database into HDFS § Flume is a tool for managing real-‐6me data – e.g. imporHng data from log files into HDFS

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐22

Bibliography The following offer more informa6on on topics discussed in this chapter § Incremental impor6ng is described in the Sqoop documenta6on: – http://archive.cloudera.com/cdh/3/sqoop/ SqoopUserGuide.html#_incremental_imports

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐23

An IntroducHon to Oozie Chapter 8.2

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐24

An IntroducHon to Oozie § What is Oozie? § Crea6ng Oozie workflows

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐25

Chapter Topics An Introduc6on to Oozie §   Introduc6on to Oozie §   CreaHng Oozie workflows

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐26

The MoHvaHon for Oozie (1) § Many problems cannot be solved with a single MapReduce job

Start Data

§ Instead, a workflow of jobs must be created § Simple workflow: – Run Job A – Use output of Job A as input to Job B – Use output of Job B as input to Job C – Output of Job C is the final required output § Easy if the workflow is linear like this – Can be created as standard Driver code

Job A

Job B

Job C

Final Result

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐27

The MoHvaHon for Oozie (2) § If the workflow is more complex, Driver code becomes much more difficult to maintain § Example: running mul6ple jobs in parallel, using the output from all of those jobs as the input to the next job § Example: including Hive or Pig jobs as part of the workflow

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐28

What is Oozie? § Oozie is a ‘workflow engine’ § Runs on a server – Typically outside the cluster § Runs workflows of Hadoop jobs – Including Pig, Hive, Sqoop jobs – Submits those jobs to the cluster based on a workflow definiHon § Workflow defini6ons are submiOed via HTTP § Jobs can be run at specific 6mes – One-‐off or recurring jobs § Jobs can be run when data is present in a directory

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐29

Chapter Topics An Introduc6on to Oozie §   IntroducHon to Oozie §   Crea6ng Oozie workflows

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐30

Oozie Workflow Basics § Oozie workflows are wriOen in XML § Workflow is a collec6on of ac6ons – MapReduce jobs, Pig jobs, Hive jobs etc. § A workflow consists of control flow nodes and ac2on nodes § Control flow nodes define the beginning and end of a workflow – They provide methods to determine the workflow execuHon path – Example: Run mulHple jobs simultaneously § Ac6on nodes trigger the execu6on of a processing task, such as – A MapReduce job – A Hive query – A Sqoop data import job

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐31

Simple Oozie Example (1) § Simple example workflow for WordCount:

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐32

Simple Oozie Example (2) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐33

Simple Oozie Example (3) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

A workflow is wrapped in the workflow-app enHty

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐34

Simple Oozie Example (4) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

The start node is the control node which tells Oozie which workflow node should be run first. There must be one start node in an Oozie workflow. In our example, we are telling Oozie to start by transiHoning to the wordcount workflow node.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐35

Simple Oozie Example (5) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

The wordcount acHon node defines a mapreduce acHon – a standard Java MapReduce job.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐36

Simple Oozie Example (6) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

Within the acHon, we define the job’s properHes.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐37

Simple Oozie Example (7) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

We specify what to do if the acHon ends successfully, and what to do if it fails. In this example, if the job is successful we go to the end node. If it fails we go to the kill node.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐38

Simple Oozie Example (9) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

If the workflow reaches a kill node, it will kill all running acHons and then terminate with an error. A workflow can have zero or more kill nodes.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐39

Simple Oozie Example (8) ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

Every workflow must have an end node. This indicates that the workflow has completed successfully.

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐40

Other Oozie Control Nodes § A decision control node allows Oozie to determine the workflow execu6on path based on some criteria – Similar to a switch-‐case statement §  fork and join control nodes split one execu6on path into mul6ple execu6on paths which run concurrently – fork splits the execuHon path – join waits for all concurrent execuHon paths to complete before proceeding – fork and join are used in pairs

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐41

Oozie Workflow AcHon Nodes Node Name

Descrip6on

map-reduce

Runs either a Java MapReduce or Streaming job

fs

Create directories, move or delete files or directories

java

Runs the main() method in the specified Java class as a single-‐ Map, Map-‐only job on the cluster

pig

Runs a Pig script

hive

Runs a Hive query

sqoop

Runs a Sqoop job

email

Sends an e-‐mail message

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐42

Submisng an Oozie Workflow § To submit an Oozie workflow using the command-‐line tool:

$ oozie job -oozie http:///oozie \ -config config_file -run

§ Oozie can also be called from within a Java program – Via the Oozie client API

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐43

More on Oozie Informa6on

Resource

Oozie installaHon and configuraHon

CDH InstallaHon Guide http://docs.cloudera.com

Oozie workflows and acHons

https://oozie.apache.org

The procedure of running a MapReduce job using Oozie

https://cwiki.apache.org/OOZIE/ map-reduce-cookbook.html

Oozie examples

Oozie examples are included in the Oozie distribuHon. InstrucHons for running them: http://oozie.apache.org/docs/ 3.2.0-incubating/DG_Examples.html

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐44

Key Points § Oozie is a workflow engine for Hadoop § Supports Java and Streaming MapReduce jobs, Sqoop jobs, Hive queries, Pig scripts, and HDFS file manipula6on

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐45

Bibliography The following offer more informa6on on topics discussed in this chapter § “Introduc6on to Oozie” ar6cle – http://www.infoq.com/articles/introductionOozie

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐46

IntroducHon to Pig Chapter 8.3

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐47

IntroducHon to Pig § The key features Pig offers § How to use Pig for data processing and analysis § How to use Pig interac6vely and in batch mode

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐48

Chapter Topics Introduc6on to Pig §   What is Pig? §   Pig’s Features §   Pig Use Cases §   InteracHng with Pig

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐49

Apache Pig Overview § Apache Pig is a plaeorm for data analysis and processing on Hadoop – It offers an alternaHve to wriHng MapReduce code directly § Originally developed as a research project at Yahoo – Goals: flexibility, producHvity, and maintainability – Now an open-‐source Apache project

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐50

The Anatomy of Pig § Main components of Pig – The data flow language (Pig LaHn) – The interacHve shell where you can type Pig LaHn statements (Grunt) – The Pig interpreter and execuHon engine Pig Latin Script

AllSales = LOAD 'sales' AS (cust, price); BigSales = FILTER AllSales BY price > 100; STORE BigSales INTO 'myreport';

Pig Interpreter / Execution Engine

MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n !"Check"data"types !"Make"op0miza0ons !"Plan"execu0on !"Generate"MapReduce"jobs !"Submit"job(s)"to"Hadoop !"Monitor"progress

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐51

Where to Get Pig § CDH (Cloudera’s Distribu6on including Apache Hadoop) is the easiest way to install Hadoop and Pig – A Hadoop distribuHon which includes core Hadoop, Pig, Hive, Sqoop, HBase, Oozie, and other ecosystem components – Available as RPMs, Ubuntu/Debian/SuSE packages, or a tarball – Simple installaHon – 100% free and open source § Installa6on is outside the scope of this course – Cloudera offers a training course for System Administrators, Cloudera Administrator Training for Apache Hadoop

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐52

Chapter Topics Introduc6on to Pig §   What is Pig? §   Pig’s Features §   Pig Use Cases §   InteracHng with Pig

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐53

Pig Features § Pig is an alterna6ve to wri6ng low-‐level MapReduce code § Many features enable sophis6cated analysis and processing – HDFS manipulaHon – UNIX shell commands – RelaHonal operaHons – PosiHonal references for fields – Common mathemaHcal funcHons – Support for custom funcHons and data formats – Complex data structures

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐54

Chapter Topics Introduc6on to Pig §   What is Pig? §   Pig’s Features §   Pig Use Cases §   InteracHng with Pig

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐55

How Are OrganizaHons Using Pig? § Many organiza6ons use Pig for data analysis – Finding relevant records in a massive data set – Querying mulHple data sets – CalculaHng values from input data § Pig is also frequently used for data processing – Reorganizing an exisHng data set – Joining data from mulHple sources to produce a new data set

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐56

Use Case: Web Log SessionizaHon § Pig can help you extract valuable informa6on from Web server log files ...

Web Server Log Data

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129" 10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622" 10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"

...

Clickstream Data for User Sessions Process Logs Recent Activity for John Smith May 3, 2013

May 12, 2013

Search for 'Widget'

Track Order

Widget Results

Contact Us

Details for Widget X

Send Complaint

Order Widget X

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐57

Use Case: Data Sampling § Sampling can help you explore a representa6ve por6on of a large data set – Allows you to examine this porHon with tools that do not scale well – Supports faster iteraHons during development of analysis jobs

100 TB

50 MB

Random Sampling

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐58

Use Case: ETL Processing § Pig is also widely used for Extract, Transform, and Load (ETL) processing

Operations

Pig Jobs Running on Hadoop Cluster

Data Warehouse

Accounting Validate data

Fix errors

Remove duplicates

Encode values

Call Center

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐59

Chapter Topics Introduc6on to Pig §   What is Pig? §   Pig’s Features §   Pig Use Cases §   Interac6ng with Pig

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐60

Using Pig InteracHvely § You can use Pig interac6vely, via the Grunt shell – Pig interprets each Pig LaHn statement as you type it – ExecuHon is delayed unHl output is required – Very useful for ad hoc data inspecHon § Example of how to start, use, and exit Grunt $ pig grunt> grunt> grunt> grunt>

allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 100; STORE bigsales INTO 'myreport'; quit;

§ Can also execute a Pig La6n statement from the UNIX shell via the -e op6on © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐61

InteracHng with HDFS § You can manipulate HDFS with Pig, via the fs command

grunt> grunt> grunt> grunt> grunt> grunt>

fs -mkdir sales/; fs -put europe.txt sales/; allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 100; STORE bigsales INTO 'myreport'; fs -getmerge myreport/ bigsales.txt;

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐62

InteracHng with UNIX § The sh command lets you run UNIX programs from Pig

grunt> sh date; Fri May 10 13:05:31 PDT 2013 grunt> fs -ls; grunt> sh ls;

-- lists HDFS files -- lists local files

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐63

Running Pig Scripts § A Pig script is simply Pig La6n code stored in a text file – By convenHon, these files have the .pig extension § You can run a Pig script from within the Grunt shell via the run command – This is useful for automaHon and batch execuHon grunt> run salesreport.pig;

§ It is common to run a Pig script directly from the UNIX shell $ pig salesreport.pig

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐64

MapReduce and Local Modes § As described earlier, Pig turns Pig La6n into MapReduce jobs – Pig submits those jobs for execuHon on the Hadoop cluster § It is also possible to run Pig in ‘local mode’ using the -x flag – This runs MapReduce jobs on the local machine instead of the cluster – Local mode uses the local filesystem instead ofHDFS – Can be helpful for tesHng before deploying a job to producHon $ pig –x local

-- interactive

$ pig -x local salesreport.pig

-- batch

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐65

Client-‐Side Log Files § If a job fails, Pig may produce a log file to explain why – These log files are typically produced in your current working directory – On the local (client) machine

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐66

EssenHal Points § Pig offers an alterna6ve to wri6ng MapReduce code directly – Pig interprets Pig LaHn code in order to create MapReduce jobs – It then submits these MapReduce jobs to the Hadoop cluster § You can execute Pig La6n code interac6vely through Grunt – Pig delays job execuHon unHl output is required § It is also common to store Pig La6n code in a script for batch execu6on – Allows for automaHon and code reuse

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐67

Bibliography The following offer more informa6on on topics discussed in this chapter § Apache Pig Web Site – http://pig.apache.org/ § Process a Million Songs with Apache Pig – http://tiny.cloudera.com/dac03a § Powered By Pig – http://tiny.cloudera.com/dac03b § LinkedIn: User Engagement Powered By Apache Pig and Hadoop – http://tiny.cloudera.com/dac03c § Programming Pig (book) – http://tiny.cloudera.com/dac03d

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐68

Bibliography (cont’d) The following offer more informa6on on topics discussed in this chapter § Programming Pig (book) – http://tiny.cloudera.com/dac03d § The original paper on Pig published by Yahoo in 2008: – http://www.research.yahoo.com/files/sigmod08.pdf

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

8-‐69

Cloudera_Academic_Partnership_8.pdf

Short Description

Description

Comments

We need your help!