Cloudera_Academic_Partnership_8.pdf

Share Embed Donate


Short Description

Download Cloudera_Academic_Partnership_8.pdf...

Description

Apache  Hadoop  –  A  course  for  undergraduates   Lecture  8  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐1  

Hadoop  Tools  for  Data  AcquisiHon   Chapter  8.1  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐2  

Data  AcquisiHon  with  Hadoop   § How  to  load  data  from  an  exis6ng  RDBMS  into  HDFS  using  Sqoop   § How  to  manage  real-­‐6me  data  such  as  log  files  using  Flume  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐3  

Chapter  Topics   Hadoop  Tools  for  Data  Acquisi6on   §   Loading  Data  into  HDFS  from  an  RDBMS  Using  Sqoop   §   Managing  Real-­‐Hme  Data  Using  Flume  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐4  

ImporHng  Data  From  an  RDBMS  to  HDFS   § Typical  scenario:  data  stored  in  an  RDBMS  is  needed  in  a   MapReduce  job   – Lookup  tables   – Legacy  data   § Possible  to  read  directly  from  an  RDBMS  in  your  Mapper   – Can  lead  to  the  equivalent  of  a  distributed  denial  of  service   (DDoS)  a>ack  on  your  RDBMS   – In  pracHce  –  don’t  do  it!   § BeOer  idea:  use  Sqoop  to  import  the  data  into  HDFS  beforehand    

RDBMS  

sqoop  

HDFS  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐5  

Sqoop:  SQL  to  Hadoop  (1)   § Sqoop:  open  source  tool  originally  wriOen  at  Cloudera   – Now  a  top-­‐level  Apache  SoWware  FoundaHon  project   § Imports  tables  from  an  RDBMS  into  HDFS   – Just  one  table   – All  tables  in  a  database   – Just  porHons  of  a  table   – Sqoop  supports  a  WHERE  clause   § Uses  MapReduce  to  actually  import  the  data   – ‘Thro>les’  the  number  of  Mappers  to  avoid  DDoS  scenarios   – Uses  four  Mappers  by  default   – Value  is  configurable   § Uses  a  JDBC  interface   – Should  work  with  virtually  any  JDBC-­‐compaHble  database   ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐6  

Sqoop:  SQL  to  Hadoop  (2)   § Imports  data  to  HDFS  as  delimited  text  files  or  SequenceFiles   – Default  is  a  comma-­‐delimited  text  file   § Can  be  used  for  incremental  data  imports   – First  import  retrieves  all  rows  in  a  table   – Subsequent  imports  retrieve  just  rows  created  since  the  last  import   § Generates  a  class  file  which  can  encapsulate  a  row  of  the  imported  data   – Useful  for  serializing  and  deserializing  data  in  subsequent  MapReduce   jobs  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐7  

Custom  Sqoop  Connectors   § Cloudera  has  partnered  with  other  organiza6ons  to  create  custom  Sqoop   connectors   – Use  a  database’s  naHve  protocols  rather  than  JDBC   – Provides  much  faster  performance   § Current  systems  supported  by  custom  connectors  include:   – Netezza   – Teradata   – Oracle  Database  (connector  developed  with  Quest  SoWware)   § Others  are  in  development   § Custom  connectors  are  not  open  source,  but  are  free   – Available  from  the  Cloudera  Web  site  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐8  

Sqoop:  Basic  Syntax   § Standard  syntax:   sqoop tool-name [tool-options]

§ Tools  include:   import import-all-tables list-tables

§ Op6ons  include:   --connect --username --password ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐9  

Sqoop:  Example   § Example:  import  a  table  called  employees  from  a  database  called   personnel  in  a  MySQL  RDBMS   $ sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \ --table employees

§ Example:  as  above,  but  only  records  with  an  ID  greater  than  1000   $ sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \ --table employees \ --where "id > 1000"

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐10  

ImporHng  An  EnHre  Database  with  Sqoop   § Import  all  tables  from  the  database  (fields  will  be  tab-­‐delimited)   $ sqoop import-all-tables \ --connect jdbc:mysql://localhost/company \ --username twheeler --password bigsecret \ --fields-terminated-by '\t' \ --warehouse-dir /mydata

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐11  

Incremental  Imports  with  Sqoop   § What  if  new  records  are  added  to  the  database?   – Could  re-­‐import  all  records,  but  this  is  inefficient   § Sqoop’s  incremental  append  mode  imports  only  new  records   – Based  on  value  of  last  record  in  specified  column   $ sqoop import \ --connect jdbc:mysql://localhost/company \ --username twheeler --password bigsecret \ --warehouse-dir /mydata \ --table orders \ --incremental append \ --check-column order_id \ --last-value 6713821

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐12  

Sqoop:  Other  OpHons   § Sqoop  can  take  data  from  HDFS  and  insert  it  into  an  already-­‐exis6ng  table   in  an  RDBMS  with  the  command   $ sqoop export [options]

§ For  general  Sqoop  help:   $ sqoop help

§ For  help  on  a  par6cular  command:   $ sqoop help command

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐13  

Chapter  Topics   Hadoop  Tools  for  Data  Acquisi6on   §   Loading  Data  into  HDFS  from  an  RDBMS  Using  Sqoop   §   Managing  Real-­‐6me  Data  Using  Flume  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐14  

Flume:  Basics   § Flume  is  a  distributed,  reliable,  available  service  for   efficiently  moving  large  amounts  of  data  as  it  is  produced   – Ideally  suited  to  gathering  logs  from  mulHple  systems   and  inserHng  them  into  HDFS  as  they  are  generated   § Flume  is  Open  Source   – IniHally  developed  by  Cloudera   § Flume’s  design  goals:   – Reliability   – Scalability   – Extensibility  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐15  

Flume:  High-­‐Level  Overview  

Agent  

Agent    

Agent  

Agent  

encrypt  

•  Writes to multiple HDFS file formats (text, SequenceFile, JSON, Avro, others) •  Parallelized writes across many collectors – as much write throughput as required

Agent  

Agent  

compress  

batch   encrypt  

•   Optionally process incoming data: perform transformations, suppressions, metadata enrichment •   Each agent can be configured with an in-memory or durable channel

Agent(s)  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐16  

Flume  Agent  CharacterisHcs   § Each  Flume  agent  has  a  source,  a  sink  and  a  channel   § Source   – Tells  the  node  where  to  receive  data  from   § Sink   – Tells  the  node  where  to  send  data  to   § Channel   – A  queue  between  the  Source  and  Sink   – Can  be  in-­‐memory  only  or  ‘Durable’   – Durable  channels  will  not  lose  data  if  power  is  lost  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐17  

Flume’s  Design  Goals:  Reliability   § Channels  provide  Flume’s  reliability   § Memory  Channel   – Data  will  be  lost  if  power  is  lost   § File  Channel   – Data  stored  on  disk   – Guarantees  durability  of  data  in  face  of  a  power  loss   § Data  transfer  between  Agents  and  Channels  is  transac6onal   – A  failed  data  transfer  to  a  downstream  agent  rolls  back  and  retries   § Can  configure  mul6ple  Agents  with  the  same  task   – e.g.,  two  Agents  doing  the  job  of  one  “collector”  –  if  one  agent  fails   then  upstream  agents  would  fail  over  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐18  

Flume’s  Design  Goals:  Scalability   § Scalability   – The  ability  to  increase  system  performance  linearly  by  adding  more   resources  to  the  system   – Flume  scales  horizontally   – As  load  increases,  more  machines  can  be  added  to  the   configuraHon  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐19  

Flume’s  Design  Goals:  Extensibility   § Extensibility   – The  ability  to  add  new  funcHonality  to  a  system   § Flume  can  be  extended  by  adding  Sources  and  Sinks  to  exis6ng  storage   layers  or  data  plaeorms   – General  Sources  include  data  from  files,  syslog,  and  standard  output   from  a  process   – General  Sinks  include  files  on  the  local  filesystem  or  HDFS   – Developers  can  write  their  own  Sources  or  Sinks  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐20  

Flume:  Usage  Pa>erns   § Flume  is  typically  used  to  ingest  log  files  from  real-­‐6me  systems  such  as   Web  servers,  firewalls  and  mailservers  into  HDFS   § Currently  in  use  in  many  large  organiza6ons,  inges6ng  millions  of  events   per  day   – At  least  one  organizaHon  is  using  Flume  to  ingest  over  200  million   events  per  day   § Flume  is  typically  installed  and  configured  by  a  system  administrator   – Check  the  Flume  documentaHon  if  you  intend  to  install  it  yourself  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐21  

Key  Points   § Sqoop  is  a  tool  to  load  data  from  a  database  into  HDFS   § Flume  is  a  tool  for  managing  real-­‐6me  data   – e.g.  imporHng  data  from  log  files  into  HDFS  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐22  

Bibliography   The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter § Incremental  impor6ng  is  described  in  the  Sqoop  documenta6on:   – http://archive.cloudera.com/cdh/3/sqoop/ SqoopUserGuide.html#_incremental_imports

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐23  

An  IntroducHon  to  Oozie   Chapter  8.2  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐24  

An  IntroducHon  to  Oozie   § What  is  Oozie?   § Crea6ng  Oozie  workflows  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐25  

Chapter  Topics   An  Introduc6on  to  Oozie   §   Introduc6on  to  Oozie   §   CreaHng  Oozie  workflows  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐26  

The  MoHvaHon  for  Oozie  (1)   § Many  problems  cannot  be  solved  with  a  single     MapReduce  job  

Start Data

§ Instead,  a  workflow  of  jobs  must  be  created   § Simple  workflow:   – Run  Job  A   – Use  output  of  Job  A  as  input  to  Job  B   – Use  output  of  Job  B  as  input  to  Job  C   – Output  of  Job  C  is  the  final  required  output   § Easy  if  the  workflow  is  linear  like  this   – Can  be  created  as  standard  Driver  code  

Job A

Job B

Job C

Final Result

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐27  

The  MoHvaHon  for  Oozie  (2)   § If  the  workflow  is  more  complex,  Driver  code  becomes  much  more   difficult  to  maintain   § Example:  running  mul6ple  jobs  in  parallel,  using  the  output  from  all  of   those  jobs  as  the  input  to  the  next  job   § Example:  including  Hive  or  Pig  jobs  as  part  of  the  workflow  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐28  

What  is  Oozie?   § Oozie  is  a  ‘workflow  engine’   § Runs  on  a  server   – Typically  outside  the  cluster   § Runs  workflows  of  Hadoop  jobs   – Including  Pig,  Hive,  Sqoop  jobs   – Submits  those  jobs  to  the  cluster  based  on  a  workflow  definiHon   § Workflow  defini6ons  are  submiOed  via  HTTP   § Jobs  can  be  run  at  specific  6mes   – One-­‐off  or  recurring  jobs   § Jobs  can  be  run  when  data  is  present  in  a  directory  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐29  

Chapter  Topics   An  Introduc6on  to  Oozie   §   IntroducHon  to  Oozie   §   Crea6ng  Oozie  workflows  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐30  

Oozie  Workflow  Basics   § Oozie  workflows  are  wriOen  in  XML     § Workflow  is  a  collec6on  of  ac6ons   – MapReduce  jobs,  Pig  jobs,  Hive  jobs  etc.   § A  workflow  consists  of  control  flow  nodes  and  ac2on  nodes   § Control  flow  nodes  define  the  beginning  and  end  of  a  workflow   – They  provide  methods  to  determine  the  workflow  execuHon  path   – Example:  Run  mulHple  jobs  simultaneously   § Ac6on  nodes  trigger  the  execu6on  of  a  processing  task,  such  as   – A  MapReduce  job   – A  Hive  query   – A  Sqoop  data  import  job  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐31  

Simple  Oozie  Example  (1)   § Simple  example  workflow  for  WordCount:  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐32  

Simple  Oozie  Example  (2)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐33  

Simple  Oozie  Example  (3)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

A  workflow  is  wrapped  in  the  workflow-app   enHty  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐34  

Simple  Oozie  Example  (4)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

The  start  node  is  the  control  node  which  tells   Oozie  which  workflow  node  should  be  run  first.  There   must  be  one  start  node  in  an  Oozie  workflow.  In   our  example,  we  are  telling  Oozie  to  start  by   transiHoning  to  the  wordcount  workflow  node.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐35  

Simple  Oozie  Example  (5)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

The  wordcount  acHon  node  defines  a  mapreduce  acHon  –  a  standard  Java  MapReduce  job.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐36  

Simple  Oozie  Example  (6)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

Within  the  acHon,  we  define  the  job’s  properHes.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐37  

Simple  Oozie  Example  (7)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

We  specify  what  to  do  if  the  acHon  ends  successfully,   and  what  to  do  if  it  fails.  In  this  example,  if  the  job  is   successful  we  go  to  the  end  node.  If  it  fails  we  go  to   the  kill  node.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐38  

Simple  Oozie  Example  (9)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

If  the  workflow  reaches  a  kill  node,  it  will  kill  all   running  acHons  and  then  terminate  with  an  error.  A   workflow  can  have  zero  or  more  kill  nodes.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐39  

Simple  Oozie  Example  (8)   ${jobTracker} ${nameNode} mapred.mapper.class org.myorg.WordCount.Map mapred.reducer.class org.myorg.WordCount.Reduce mapred.input.dir ${inputDir} mapred.output.dir ${outputDir} Something went wrong: ${wf:errorCode('wordcount')}

Every  workflow  must  have  an  end  node.  This   indicates  that  the  workflow  has  completed   successfully.  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐40  

Other  Oozie  Control  Nodes   § A  decision  control  node  allows  Oozie  to  determine  the  workflow   execu6on  path  based  on  some  criteria   – Similar  to  a  switch-­‐case  statement   §  fork  and  join  control  nodes  split  one  execu6on  path  into  mul6ple   execu6on  paths  which  run  concurrently   – fork  splits  the  execuHon  path   – join  waits  for  all  concurrent  execuHon  paths  to  complete  before   proceeding   – fork  and  join  are  used  in  pairs  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐41  

Oozie  Workflow  AcHon  Nodes   Node  Name  

Descrip6on  

map-reduce

Runs  either  a  Java  MapReduce  or  Streaming  job  

fs

Create  directories,  move  or  delete  files  or  directories  

java

Runs  the  main()  method  in  the  specified  Java  class  as  a  single-­‐ Map,  Map-­‐only  job  on  the  cluster  

pig

Runs  a  Pig  script  

hive

Runs  a  Hive  query  

sqoop

Runs  a  Sqoop  job  

email

Sends  an  e-­‐mail  message  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐42  

Submisng  an  Oozie  Workflow   § To  submit  an  Oozie  workflow  using  the  command-­‐line  tool:  

 

$ oozie job -oozie http:///oozie \ -config config_file -run

§ Oozie  can  also  be  called  from  within  a  Java  program   – Via  the  Oozie  client  API  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐43  

More  on  Oozie   Informa6on  

Resource  

Oozie  installaHon  and   configuraHon  

CDH  InstallaHon  Guide   http://docs.cloudera.com

Oozie  workflows  and  acHons  

https://oozie.apache.org

The  procedure  of  running  a   MapReduce  job  using  Oozie  

https://cwiki.apache.org/OOZIE/ map-reduce-cookbook.html

Oozie  examples  

Oozie  examples  are  included  in  the  Oozie   distribuHon.  InstrucHons  for  running  them:   http://oozie.apache.org/docs/ 3.2.0-incubating/DG_Examples.html

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐44  

Key  Points   § Oozie  is  a  workflow  engine  for  Hadoop   § Supports  Java  and  Streaming  MapReduce  jobs,  Sqoop  jobs,  Hive  queries,   Pig  scripts,  and  HDFS  file  manipula6on    

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐45  

Bibliography   The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter § “Introduc6on  to  Oozie”  ar6cle   – http://www.infoq.com/articles/introductionOozie

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐46  

IntroducHon  to  Pig   Chapter  8.3  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐47  

IntroducHon  to  Pig   § The  key  features  Pig  offers   § How  to  use  Pig  for  data  processing  and  analysis   § How  to  use  Pig  interac6vely  and  in  batch  mode  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐48  

Chapter  Topics   Introduc6on  to  Pig   §   What  is  Pig?   §   Pig’s  Features   §   Pig  Use  Cases   §   InteracHng  with  Pig  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐49  

Apache  Pig  Overview   § Apache  Pig  is  a  plaeorm  for  data  analysis  and  processing  on  Hadoop   – It  offers  an  alternaHve  to  wriHng  MapReduce  code  directly   § Originally  developed  as  a  research  project  at  Yahoo     – Goals:  flexibility,  producHvity,  and  maintainability   – Now  an  open-­‐source  Apache  project  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐50  

The  Anatomy  of  Pig   § Main  components  of  Pig   – The  data  flow  language  (Pig  LaHn)   – The  interacHve  shell  where  you  can  type  Pig  LaHn  statements  (Grunt)   – The  Pig  interpreter  and  execuHon  engine   Pig Latin Script

AllSales = LOAD 'sales' AS (cust, price); BigSales = FILTER AllSales BY price > 100; STORE BigSales INTO 'myreport';

Pig Interpreter / Execution Engine

MapReduce Jobs

!"Preprocess"and"parse"Pig"La0n !"Check"data"types !"Make"op0miza0ons !"Plan"execu0on !"Generate"MapReduce"jobs !"Submit"job(s)"to"Hadoop !"Monitor"progress

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐51  

Where  to  Get  Pig   § CDH  (Cloudera’s  Distribu6on  including  Apache  Hadoop)  is  the  easiest  way   to  install  Hadoop  and  Pig   – A  Hadoop  distribuHon  which  includes  core  Hadoop,  Pig,  Hive,  Sqoop,   HBase,  Oozie,  and  other  ecosystem  components   – Available  as  RPMs,  Ubuntu/Debian/SuSE  packages,  or  a  tarball   – Simple  installaHon   – 100%  free  and  open  source   § Installa6on  is  outside  the  scope  of  this  course   – Cloudera  offers  a  training  course  for  System  Administrators,  Cloudera   Administrator  Training  for  Apache  Hadoop  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐52  

Chapter  Topics   Introduc6on  to  Pig   §   What  is  Pig?   §   Pig’s  Features   §   Pig  Use  Cases   §   InteracHng  with  Pig  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐53  

Pig  Features   § Pig  is  an  alterna6ve  to  wri6ng  low-­‐level  MapReduce  code   § Many  features  enable  sophis6cated  analysis  and  processing   – HDFS  manipulaHon   – UNIX  shell  commands   – RelaHonal  operaHons   – PosiHonal  references  for  fields   – Common  mathemaHcal  funcHons   – Support  for  custom  funcHons  and  data  formats   – Complex  data  structures  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐54  

Chapter  Topics   Introduc6on  to  Pig   §   What  is  Pig?   §   Pig’s  Features   §   Pig  Use  Cases   §   InteracHng  with  Pig  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐55  

How  Are  OrganizaHons  Using  Pig?   § Many  organiza6ons  use  Pig  for  data  analysis   – Finding  relevant  records  in  a  massive  data  set   – Querying  mulHple  data  sets   – CalculaHng  values  from  input  data   § Pig  is  also  frequently  used  for  data  processing   – Reorganizing  an  exisHng  data  set   – Joining  data  from  mulHple  sources  to  produce  a  new  data  set  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐56  

Use  Case:  Web  Log  SessionizaHon   § Pig  can  help  you  extract  valuable  informa6on  from  Web  server  log  files   ...

Web Server Log Data

10.174.57.241 - - [03/May/2013:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129" 10.218.46.19 - - [03/May/2013:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.174.57.241 - - [03/May/2013:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.32.51.237 - - [03/May/2013:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.174.57.241 - - [03/May/2013:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.157.96.181 - - [03/May/2013:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622" 10.174.57.241 - - [03/May/2013:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [03/May/2013:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"

...

Clickstream Data for User Sessions Process Logs Recent Activity for John Smith May 3, 2013

May 12, 2013

Search for 'Widget'

Track Order

Widget Results

Contact Us

Details for Widget X

Send Complaint

Order Widget X

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐57  

Use  Case:  Data  Sampling   § Sampling  can  help  you  explore  a  representa6ve  por6on  of  a  large  data  set   – Allows  you  to  examine  this  porHon  with  tools  that  do  not  scale  well   – Supports  faster  iteraHons  during  development  of  analysis  jobs  

100 TB

50 MB

Random Sampling

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐58  

Use  Case:  ETL  Processing   § Pig  is  also  widely  used  for  Extract,  Transform,  and  Load  (ETL)  processing  

Operations

Pig Jobs Running on Hadoop Cluster

Data Warehouse

Accounting Validate data

Fix errors

Remove duplicates

Encode values

Call Center

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐59  

Chapter  Topics   Introduc6on  to  Pig   §   What  is  Pig?   §   Pig’s  Features   §   Pig  Use  Cases   §   Interac6ng  with  Pig  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐60  

Using  Pig  InteracHvely   § You  can  use  Pig  interac6vely,  via  the  Grunt  shell   – Pig  interprets  each  Pig  LaHn  statement  as  you  type  it   – ExecuHon  is  delayed  unHl  output  is  required   – Very  useful  for  ad  hoc  data  inspecHon   § Example  of  how  to  start,  use,  and  exit  Grunt   $ pig grunt> grunt> grunt> grunt>

allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 100; STORE bigsales INTO 'myreport'; quit;

  § Can  also  execute  a  Pig  La6n  statement  from  the  UNIX  shell  via  the  -e   op6on ©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐61  

InteracHng  with  HDFS   § You  can  manipulate  HDFS  with  Pig,  via  the  fs  command  

grunt> grunt> grunt> grunt> grunt> grunt>

fs -mkdir sales/; fs -put europe.txt sales/; allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 100; STORE bigsales INTO 'myreport'; fs -getmerge myreport/ bigsales.txt;

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐62  

InteracHng  with  UNIX   § The  sh  command  lets  you  run  UNIX  programs  from  Pig

 

grunt> sh date; Fri May 10 13:05:31 PDT 2013 grunt> fs -ls; grunt> sh ls;

-- lists HDFS files -- lists local files

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐63  

Running  Pig  Scripts   § A  Pig  script  is  simply  Pig  La6n  code  stored  in  a  text  file   – By  convenHon,  these  files  have  the  .pig  extension   § You  can  run  a  Pig  script  from  within  the  Grunt  shell  via  the  run  command   – This  is  useful  for  automaHon  and  batch  execuHon     grunt> run salesreport.pig;

§ It  is  common  to  run  a  Pig  script  directly  from  the  UNIX  shell   $ pig salesreport.pig

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐64  

MapReduce  and  Local  Modes   § As  described  earlier,  Pig  turns  Pig  La6n  into  MapReduce  jobs   – Pig  submits  those  jobs  for  execuHon  on  the  Hadoop  cluster   § It  is  also  possible  to  run  Pig  in  ‘local  mode’  using  the  -x  flag   – This  runs  MapReduce  jobs  on  the  local  machine  instead  of  the  cluster   – Local  mode  uses  the  local  filesystem  instead  ofHDFS   – Can  be  helpful  for  tesHng  before  deploying  a  job  to  producHon   $ pig –x local

-- interactive

$ pig -x local salesreport.pig

-- batch

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐65  

Client-­‐Side  Log  Files   § If  a  job  fails,  Pig  may  produce  a  log  file  to  explain  why   – These  log  files  are  typically  produced  in  your  current  working  directory   – On  the  local  (client)  machine  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐66  

EssenHal  Points   § Pig  offers  an  alterna6ve  to  wri6ng  MapReduce  code  directly   – Pig  interprets  Pig  LaHn  code  in  order  to  create  MapReduce  jobs   – It  then  submits  these  MapReduce  jobs  to  the  Hadoop  cluster   § You  can  execute  Pig  La6n  code  interac6vely  through  Grunt   – Pig  delays  job  execuHon  unHl  output  is  required   § It  is  also  common  to  store  Pig  La6n  code  in  a  script  for  batch  execu6on   – Allows  for  automaHon  and  code  reuse  

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐67  

Bibliography   The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter   § Apache  Pig  Web  Site   – http://pig.apache.org/ § Process  a  Million  Songs  with  Apache  Pig   – http://tiny.cloudera.com/dac03a § Powered  By  Pig   – http://tiny.cloudera.com/dac03b § LinkedIn:  User  Engagement  Powered  By  Apache  Pig  and  Hadoop   – http://tiny.cloudera.com/dac03c § Programming  Pig  (book)   – http://tiny.cloudera.com/dac03d

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐68  

Bibliography    (cont’d)   The  following  offer  more  informa6on  on  topics  discussed  in  this  chapter § Programming  Pig  (book)   – http://tiny.cloudera.com/dac03d § The  original  paper  on  Pig  published  by  Yahoo  in  2008:     – http://www.research.yahoo.com/files/sigmod08.pdf

©  Copyright  2010-­‐2014  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  

8-­‐69  

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF