Lecture 7 - CS 246h

January 17, 2017 | Author: jeromeku | Category: N/A
Share Embed Donate


Short Description

a...

Description

Stanford  CS  246H:   Mining  Massive  Data  Sets   Hadoop  Lab  

Stanford  CS  246H  Winter  ‘14  

Machine  Learning  &  Hadoop  

Stanford  CS  246H  Winter  ‘14  

Peanut  BuCer  and  Chocolate?   The  Promise  of  Big  Data™   •  Sounds  great,  but  how?   • 

•  • 

• 

Tools  and  toolkits  starKng  to  appear   • 

• 

Hadoop  talent  pool  is  small   ML  talent  pool  is  Kny   Mahout,  Oryx,  Alpine,  Ayasdi,  Skytree,  etc.  

Summary:  Hadoop  is  hard,  and  ML  is  hard   1.  2. 

Lots  of  people/companies  are  trying  to  make  it  easy   Don’t  believe  anyone  who  tells  you  they  make  it  easy  

Stanford  CS  246H  Winter  ‘14  

Hadoop  &  ML:  A  Brief  History   2005  –  Taste  project  started  on  SourceForge   •  2007  –  Mahout  project  started  at  Apache   •  2008  –  Taste  donated  to  Mahout   •  …  Kme  passes  …   •  2012  –  Myrrix  is  launched   •  2013  –  Cloudera  ML  project  started  on  Github   •  Late  2013  –  Oryx  project  started  on  Github   • 

Stanford  CS  246H  Winter  ‘14  

Hadoop  ML  Family  Tree  

Andrew  Ng  

Taste  

Lucene  

Mahout   Cloudera  ML  

Myrrix  

Oryx  

Stanford  CS  246H  Winter  ‘14  

Apache  Mahout  

Stanford  CS  246H  Winter  ‘14  

What  is  Mahout?   • 

“Scalable  machine  learning”   •  • 

• 

not  just  Hadoop-­‐oriented  machine  learning   not  en%rely,  that  is.    Just  mostly.  

Components   •  •  •  •  • 

math  library   clustering   classificaKon   decomposiKons   recommendaKons  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Mahout  Math   • 

Goals  are   •  •  •  •  •  • 

• 

basic  linear  algebra,   and  staKsKcal  sampling,   and  good  clustering,   decent  speed,   extensibility,   especially  for  sparse  data  

But  not     •  •  • 

totally  badass  speed   comprehensive  set  of  algorithms   opKmizaKon,  root  finders,  quadrature  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Caveat  Emptor   Mahout  is  a  toolkit   •  There  is  a  command  line  interface   • 

• 

You  can’t  always  use  it  

Very  oken  end  up  wriKng  code   •  DocumentaKon  is…  ahem…  scant   • 

• 

Best  reference  is  Mahout  in  AcKon  

Varying  levels  of  maturity   •  Varying  levels  of  Hadoop  support   • 

Stanford  CS  246H  Winter  ‘14  

Matrices  and  Vectors   • 

At  the  core:   •  • 

• 

DenseVector,  RandomAccessSparseVector   DenseMatrix,  SparseRowMatrix  

Highly  composable  API   m.viewDiagonal().assign(v)!

• 

Important  ideas:     •  • 

view*,  assign  and  aggregate   iteraKon  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Assign?    View?   • 

Why  assign?   •  •  • 

• 

Why  view?   •  •  • 

• 

Copying  is  the  major  cost  for  naïve  matrix  packages   In-­‐place  operaKons  criKcal  to  reasonable  performance   Many  kinds  of  updates  required,  so  funcKonal  style  very  helpful  

In-­‐place  operaKons  oken  required  for  blocks,  rows,  columns  or   diagonals   With  views,  we  need  #assign  +  #views  methods   Without  views,  we  need  #assign  x  #views  methods  

Synergies   • 

With  both  views  and  assign,  many  loops  become  single  line  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Assign   • 

Matrices   Matrix Matrix Matrix Matrix Matrix

• 

assign(double value);! assign(double[][] values);! assign(Matrix other);! assign(DoubleFunction f);! assign(Matrix other, DoubleDoubleFunction f);!

Vectors   Vector Vector Vector Vector Vector Vector

©MapR  Technologies  2013  

assign(double value);! assign(double[] values);! assign(Vector other);! assign(DoubleFunction f);! assign(Vector other, DoubleDoubleFunction f);! assign(DoubleDoubleFunction f, double y);!

Stanford  CS  246H  Winter  ‘14  

Views   • 

Matrices   Matrix Matrix Vector Vector Vector

• 

viewPart(int[] offset, int[] size);! viewPart(int row, int rlen, int col, int clen);! viewRow(int row);! viewColumn(int column);! viewDiagonal();!

Vectors   Vector viewPart(int offset, int length);!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Aggregates   • 

Matrices   double zSum();! Vector aggregateRows(VectorFunction f);! Vector aggregateColumns(VectorFunction f);! double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);!

!

 

• 

Vectors   double zSum();! double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);! double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Predefined  FuncKons   • 

Many  handy  funcKons   ABS ACOS ASIN ATAN CEIL COS EXP FLOOR IDENTITY INV LOGARITHM!

©MapR  Technologies  2013  

LOG2 NEGATE RINT SIGN SIN SQRT SQUARE SIGMOID SIGMOIDGRADIENT TAN

! ! ! ! ! ! ! ! !

Stanford  CS  246H  Winter  ‘14  

Examples   A =α

double  alpha;  a.assign(alpha);  

A = αB + β a.assign(b,  FuncKons.chain(          FuncKons.plus(beta),            FuncKons.mult(alpha));  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Sparse  OpKmizaKons   • 

DoubleDoubleFuncKon  abstract  properKes   public public public public public public public public

• 

boolean boolean boolean boolean boolean boolean boolean boolean

isLikeRightPlus();! isLikeLeftMult();! isLikeRightMult();! isLikeMult();! isCommutative();! isAssociative();! isAssociativeAndCommutative();! isDensifying();!

And  Vector  properKes   public public public public public

©MapR  Technologies  2013  

boolean isDense();! boolean isSequentialAccess();! double getLookupCost();! double getIteratorAdvanceCost();! boolean isAddConstantTime();! Stanford  CS  246H  Winter  ‘14  

Examples   • 

The  trace  of  a  matrix   m.viewDiagonal().zSum()!

• 

Set  diagonal  to  zero   m.viewDiagonal().assign(0)!

• 

Set  diagonal  to  negaKve  of  row  sums  excluding  the   diagonal   Vector diag = m.viewDiagonal().assign(0);! diag.assign(m.rowSums().assign(Functions.MINUS));!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

IteraKon   • 

Matrices  are  Iterable  in  Mahout   // compute both row and columns sums in one pass! for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);! }!

 

• 

Vectors  are  densely  or  sparsely  iterable   double entropy = 0;! for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());! }!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Random  Sampling   • 

Samples  from  some  type   public interface Sampler {! T sample();! }! ! public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler!

• 

Lots  of  kinds   ChineseRestaurant Empirical IndianBuffet

©MapR  Technologies  2013  

Missing Multinomial MultiNormal

Normal PoissonSampler Sampler !

Stanford  CS  246H  Winter  ‘14  

Mahout  Math  Summary   • 

Matrices,  Vectors   •  •  •  • 

• 

FuncKons   •  • 

• 

lots  built-­‐in   cooperate  with  sparse  vector  opKmizaKons  

Sampling   •  • 

• 

views   in-­‐place  assignment   aggregaKons   iteraKons  

abstract  samplers   samplers  as  funcKons  

Other  stuff  …  clustering,  SVD    

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Other  Stuff   Matrix  DecomposiKon   •  ClassificaKon   •  Clustering   •  RecommendaKons   • 

Stanford  CS  246H  Winter  ‘14  

Focus:  Machine  Learning   ApplicaKons  

Examples  

GeneKc  

Freq.   PaCern   Mining  

UKliKes   Lucene/Vectorizer  

ClassificaKon  

Clustering  

Math   Vectors/Matrices/ SVD  

Recommenders  

CollecKons   (primiKves)  

Apache   Hadoop  

See  hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms   ©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Prepare  Data  from  Raw  content   • 

Data  Sources:   • 

Lucene  integraKon   • 

• 

Document  Vectorizer   •  • 

• 

• 

bin/mahout  seqdirectory  …   bin/mahout  seq2sparse  …  

ProgrammaKcally   • 

• 

bin/mahout  lucenevector  …  

See  the  UKls  module  in  Mahout  

Database   File  system  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

RecommendaKons   Extensive  framework  for  collaboraKve  filtering   •  Recommenders   • 

• 

• 

Online  and  Offline  support   • 

• 

User  based,  Item  based,  ALS,  SlopeOne,  SVD,  others   Offline  can  uKlize  Hadoop  

Many  different  Similarity  measures   • 

Cosine,  LLR,  Tanimoto,  Pearson,  others  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Clustering   • 

Document  level   •  • 

• 

Group  documents  based   on  a  noKon  of  similarity   K-­‐Means,  Fuzzy  K-­‐Means,   Dirichlet,  Canopy,  Mean-­‐ •  Topic  Modeling     Shik   •  Cluster  words  across   Distance  Measures   documents  to  idenKfy   •  ManhaCan,  Euclidean,   topics   other   •  Latent  Dirichlet   AllocaKon  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

CategorizaKon   • 

Place  new  items  into   predefined  categories:   • 

• 

Sports,  poliKcs,  entertainment  

Mahout  has  several   implementaKons   •  •  •  • 

Naïve  Bayes   Complementary  Naïve  Bayes   Decision  Forests   LogisKc  Regression  (SGD)  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Freq.  PaCern  Mining   •  • 

IdenKfy  frequently  co-­‐ occurrent  items   Useful  for:   • 

Query  RecommendaKons   • 

• 

Apple  -­‐>  iPhone,  orange,  OS  X  

Related  product  placement   • 

• 

hCp://www.amazon.com  

“Beer  and  Diapers”  

Spam  DetecKon   • 

©Lucid  ImaginaKon  2010  

Yahoo:   hCp://www.slideshare.net/ hadoopusergroup/mail-­‐ anKspam  

Stanford  CS  246H  Winter  ‘14  

EvoluKonary   Map-­‐Reduce  ready  fitness  funcKons  for  geneKc   programming   •  IntegraKon  with  Watchmaker   • 

• 

• 

hCp://watchmaker.uncommons.org/index.php  

Problems  solved:   •  •  • 

Traveling  salesman   Class  discovery   Many  others  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Singular  Value  DecomposiKon   Reduces  a  big  matrix  into  a  much  smaller  matrix  by   amplifying  the  important  parts  while  removing/ reducing  the  less  important  parts   •  Mahout  has  fully  distributed  Lanczos  implementaKon   • 

/bin/mahout  svd  -­‐Dmapred.input.dir=path/ to/corpus  -­‐-­‐tempDir  path/for/svd-­‐output  -­‐-­‐rank  300  -­‐-­‐ numColumns    -­‐-­‐numRows     /bin/mahout  cleansvd  -­‐-­‐eigenInput  path/ for/svd-­‐output  -­‐-­‐corpusInput  path/to/corpus  -­‐-­‐output   path/for/cleanOutput  -­‐-­‐maxError  0.1  -­‐-­‐minEigenvalue   10.0    

• 

hCps://cwiki.apache.org/confluence/display/ MAHOUT/Dimensional+ReducKon    

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

How  to:  Command  Line   • 

Most  algorithms  have  a  Driver  program   • 

• 

Prepare  the  Data   • 

• 

Different  algorithms  require  different  setup  

Run  the  algorithm   •  • 

• 

Shell  script  in  $MAHOUT_HOME/bin  helps  with  most  tasks  

Single  Node   Hadoop  

Print  out  the  results   • 

Several  helper  classes:     • 

©Lucid  ImaginaKon  2010  

LDAPrintTopics,  ClusterDumper,  etc.  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II  -­‐  Prep   • 

Data  Set:  Reuters   •  • 

• 

hCp://www.daviddlewis.com/resources/testcollecKons/ reuters21578/   Convert  to  Text  via   hCp://www.lucenebootcamp.com/lucene-­‐boot-­‐camp-­‐ preclass-­‐training/  

Convert  to  Sequence  File:   bin/mahout  seqdirectory  –input    -­‐-­‐output    -­‐-­‐ charset  UTF-­‐8  

• 

Convert  to  Sparse  Vector:   bin/mahout  seq2sparse  -­‐-­‐input  /content/reuters/ seqfiles/  -­‐-­‐norm  2  -­‐-­‐weight  TF  -­‐-­‐output  /content/ reuters/seqfiles-­‐TF/  -­‐-­‐minDF  5  -­‐-­‐maxDFPercent  90  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II:  Topic  Modeling   • 

Latent  Dirichlet  AllocaKon   ./mahout  lda  -­‐-­‐input    /content/reuters/seqfiles-­‐TF/ vectors/  -­‐-­‐output    /content/reuters/seqfiles-­‐TF/ lda-­‐output  -­‐-­‐numWords  34000  –numTopics  10   ./mahout  org.apache.mahout.clustering.lda.LDAPrintTopics   -­‐-­‐input  /content/reuters/seqfiles-­‐TF/lda-­‐output/ state-­‐19  -­‐-­‐dict  /content/reuters/seqfiles-­‐TF/ dictionary.file-­‐0  -­‐-­‐words  10  -­‐-­‐output  /content/ reuters/seqfiles-­‐TF/lda-­‐output/topics  -­‐-­‐dictionaryType   sequencefile  

• 

Good  feature  reducKon  (stopword  removal)  required  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  III:  Clustering   • 

K-­‐Means   • 

Same  Prep  as  UD  II,  except  use  TFIDF  weight  

./mahout  kmeans  -­‐-­‐input  /content/reuters/seqfiles-­‐ TFIDF/vectors/part-­‐00000  -­‐-­‐k  15  -­‐-­‐output  /content/ reuters/seqfiles-­‐TFIDF/output-­‐kmeans  -­‐-­‐clusters  / content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/clusters  

• 

Print  out  the  clusters:  

./mahout  clusterdump  -­‐-­‐seqFileDir  /content/reuters/ seqfiles-­‐TFIDF/output-­‐kmeans/clusters-­‐15/  -­‐-­‐pointsDir   /content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/points/   -­‐-­‐dictionary  /content/reuters/seqfiles-­‐TFIDF/ dictionary.file-­‐0  -­‐-­‐dictionaryType  sequencefile  -­‐-­‐substring   20  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  IV:  Frequent  PaCern  Mining   • 

Data:  hCp://fimi.cs.helsinki.fi/data/  

• 

./mahout  fpg  -­‐i  /content/freqitemset/ accidents.dat  -­‐o  patterns  -­‐k  50  -­‐method   mapreduce  -­‐g  10  -­‐regex  [\  ]    ./mahout  seqdump  -­‐-­‐seqFile  patterns/fpgrowth/ part-­‐r-­‐00000    

• 

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML   CollecKon  of  Java  libraries  and  command-­‐line  tools   •  Goal:  make  data  scienKsts  more  producKve  with  CDH   • 

•  •  •  • 

Exploratory  data  analysis   Data  preparaKon   Model  fi}ng   Model  evaluaKon  

Apache  2.0  licensed   •  Developed  on  GitHub   • 

• 

hCp://github.com/cloudera/ml  

37   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  Building  Blocks   • 

Apache  Hadoop   • 

• 

Apache  Hive   • 

• 

Easy  MapReduce  pipelines  

Apache  Mahout   • 

• 

Metadata  for  structured  data  in  HDFS  

Apache  Crunch   • 

• 

Scalable  data  storage  (HDFS)  and  processing  (MapReduce)  

Vector  interface  

Apache  Avro   • 

SerializaKon  format  

38   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  Workflow:  Clustering  

39   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary   • 

client/bin/ml  summary   -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)   -­‐-­‐format  text   -­‐-­‐header-­‐file  examples/kdd99/header.csv  (local  FS)   -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)    

40   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

HDFS

kddcup. data_10_percent

1. summary

Local FS

41  

header.csv

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

HDFS

kddcup. data_10_percent

1. summary

Local FS

42  

header.csv

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary   • 

s.json   •  • 

Categorical  features:  histogram   Numerical  features:  distribuKon  summary  

43   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize   • 

client/bin/ml  normalize   -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)   -­‐-­‐format  text   -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)   -­‐-­‐transform  Z   -­‐-­‐output-­‐path  kdd99  (HDFS)   -­‐-­‐output-­‐type  avro   -­‐-­‐id-­‐column  category   -­‐-­‐compress  

44   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

HDFS

kddcup. data_10_percent

2. normalize

Local FS

45  

header.csv

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

HDFS

kddcup. data_10_percent

kdd99/

2. normalize

Local FS

46  

header.csv

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize   • 

kdd99/part-­‐m-­‐0000[0|1].avro   • 

Examples  (rows)     •  •  • 

• 

Part  0:  442,454  vectors   Part  1:  51,567  vectors   Total:  494,021  vectors  

Features  (columns)   •  • 

Before:  41  fields   Aker:  143  fields  

47   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch   • 

client/bin/ml  ksketch    -­‐-­‐input-­‐paths  kdd99  (HDFS)   -­‐-­‐format  avro   -­‐-­‐points-­‐per-­‐iteraKon  500   -­‐-­‐output-­‐file  wc.avro  (local  FS)   -­‐-­‐seed  1729   -­‐-­‐iteraKons  5   -­‐-­‐cross-­‐folds  2  

48   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

HDFS

kddcup. data_10_percent

kdd99/

3. ksketch

Local FS

49  

header.csv

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

HDFS

kddcup. data_10_percent

kdd99/

3. ksketch

Local FS

50  

header.csv

s.json

wc.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch   • 

wc.avro   • 

Examples  (rows)   •  •  •  • 

• 

2  “folds”  of  2501  examples   1  iniKal  example   500  examples  from  each  iteraKon  (5  iteraKons)   Each  example  has  an  associated  weight  

Features  (columns)   • 

143  features  (sKll)  

51   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans   • 

client/bin/ml  kmeans   -­‐-­‐input-­‐file  wc.avro  (local  FS)   -­‐-­‐centers-­‐file  centers.avro  (local  FS)   -­‐-­‐seed  19   -­‐-­‐clusters  1,10,25,35,45   -­‐-­‐best-­‐of  2   -­‐-­‐num-­‐threads  4   -­‐-­‐eval-­‐stats-­‐file  kmeans_stats.csv  (local  FS)  

52   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

HDFS

kddcup. data_10_percent

kdd99/

4. kmeans

Local FS

53  

header.csv

s.json

wc.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

HDFS

kddcup. data_10_percent

kdd99/

4. kmeans

Local FS

header.csv

s.json

wc.avro

centers.avro

kmeans_stats.csv

54  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans   • 

centers.avro   •  • 

• 

1  row  for  each  run  of  k-­‐means++   9  total  runs:  1  for  k=1,  2  each  for  k=10,  25,  35,  and  45  

kmeans_stats.csv   • 

Clustering  quality  scores  

55   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign   • 

client/bin/ml  kassign   -­‐-­‐input-­‐paths  kdd99  (HDFS)   -­‐-­‐format  avro   -­‐-­‐centers-­‐file  centers.avro  (local  FS)   -­‐-­‐center-­‐ids  4   -­‐-­‐output-­‐path  assigned  (HDFS)   -­‐-­‐output-­‐type  csv  

56   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

HDFS

kddcup. data_10_percent

kdd99/

5. kassign

Local FS

57  

header.csv

s.json

wc.avro

centers.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

HDFS

kddcup. data_10_percent

kdd99/

assigned/

5. kassign

Local FS

58  

header.csv

s.json

wc.avro

centers.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign   • 

assigned/part-­‐m-­‐0000[0|1]   • 

Rows     •  •  • 

• 

Part  0:  442,454   Part  1:  51,567   Total:  494,021  

Columns   •  •  •  • 

Point  ID  (normal/aCack  type,  in  this  case)   Index  in  centers.avro   Assigned  cluster  ID   Squared  distance  to  nearest  cluster  

59   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample   • 

client/bin/ml  sample   -­‐-­‐input-­‐paths  assigned  (HDFS)   -­‐-­‐format  text   -­‐-­‐header-­‐file  examples/kdd99/kassign_header.csv  (local  FS)   -­‐-­‐weight-­‐field  squared_distance   -­‐-­‐group-­‐fields  clustering_id,closest_center_id   -­‐-­‐output-­‐type  csv   -­‐-­‐size  20   -­‐-­‐output-­‐path  extremal  (HDFS)  

60   Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

HDFS

kddcup. data_10_percent

kdd99/

assigned/

6. sample

Local FS

61  

header.csv

s.json

wc.avro

centers.avro

kassign_header.csv

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

HDFS

kddcup. data_10_percent

kdd99/

assigned/

extremal/

6. sample

Local FS

62  

header.csv

s.json

wc.avro

centers.avro

kassign_header.csv

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample   • 

extremal/part-­‐r-­‐00000   • 

Rows     •  • 

• 

Up  to  20  examples  from  each  cluster   Examples  that  are  furthest  from  the  center  of  the  cluster  

Columns   •  •  •  • 

Point  ID  (normal/aCack  type,  in  this  case)   Index  in  centers.avro   Assigned  cluster  ID   Squared  distance  to  nearest  cluster  

63   Stanford  CS  246H  Winter  ‘14  

Oryx  

Stanford  CS  246H  Winter  ‘14  

2014:  Lab  to  Factory  

65  

Stanford  CS  246H  Winter  ‘14  

Data  Science  Will  Be  Opera-onal  Analy-cs  

66  

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model.  Now  What?  

Collect  Input  

Build  Model  

Query  Model  

Repeat  

67  

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model  On  Hadoop.  Now  What?  

?   ?   ?   Collect  Input  

Build  Model  

Query  Model  

Repeat  

68  

Stanford  CS  246H  Winter  ‘14  

Example:  Oryx  

69   Stanford  CS  246H  Winter  ‘14  

www.mwCl.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mwCl.jpg  

70  

Stanford  CS  246H  Winter  ‘14  

Gaps  to  fill,  and  Goals   • 

Model  Building   •  •  •  • 

• 

Model  Serving   •  • 

71  

Large-­‐scale   Con-nuous   Apache  Hadoop™-­‐based   Few,  good  algorithms   Real-­‐-me  query   Real-­‐-me  update  

• 

Algorithms   •  •  • 

• 

Parallelizable   Updateable   Works  on  diverse  input  

Interoperable   •  •  • 

PMML  model  format   Simple  REST  API   Open  source  

Stanford  CS  246H  Winter  ‘14  

Large-­‐Scale  or  Real-­‐Time?   Large-­‐Scale   Offline   Batch  

vs  

Real-­‐Time   Online   Streaming  

Why  Don’t  We  Have  Both?  

λ!  

72  

Stanford  CS  246H  Winter  ‘14  

Lambda  Architecture   Batch,  Stream     Processing  are  different   •  Tackle  separately  in     2+  Layers   •  Batch  Layer:  offline,   asynchronous   •  Serving  /  Speed  Layer:   real-­‐Kme,  incremental,   approximate   • 

…  λ?  

jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecKng  

73  

Stanford  CS  246H  Winter  ‘14  

Serving/Speed  

Batch  

74  

Stanford  CS  246H  Winter  ‘14  

Two  Layers   • 

ComputaKon  Layer   •  •  • 

•  • 

Java-­‐based  server  process   Client  of  Hadoop  2.x   Periodically  builds   “generaKon”  from  recent   data  and  past  model   Baby-­‐sits  MapReduce*   jobs  (or,  locally  in-­‐core)   Publishes  models  

• 

Serving  Layer   •  •  •  •  •  • 

Apache  Tomcat™-­‐based   server  process   Consumes  models  from   HDFS  (or  local  FS)   Serves  queries  from   model  in  memory   Updates  from  new  input   Also  writes  input  to  HDFS   Replicas  for  scale  

*  Apache  Spark  later  

75  

Stanford  CS  246H  Winter  ‘14  

CollaboraKve  Filtering  :  ALS   AlternaKng  Least  Squares   •  Latent-­‐factor  model   •  Accepts  implicit  or     explicit  feedback   •  Real-­‐Kme  update     via  fold-­‐in  of  input   •  No  cold-­‐start   •  Parallelizable   • 

76  

YT  

X  

Stanford  CS  246H  Winter  ‘14  

Clustering  :  k-­‐means++   Well-­‐known  and   understood   •  Parallelizable   •  Clusters  updateable   • 

cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  

77  

Stanford  CS  246H  Winter  ‘14  

ClassificaKon  /  Regression  :  RDF   Random  Decision  Forests   •  Ensemble  method   •  Numeric,  categorical     features  and  target     •  Very  parallel   •  Nodes  updateable   •  Works  well  on  many   problems   • 

78  

age$>$30

female?

income$>$20000

Yes

Yes

Yes

No

Stanford  CS  246H  Winter  ‘14  

PMML   PredicKve  Modeling   Markup  Language   •  XML-­‐based  format  for   predicKve  models   •  Standardized  by  Data   Mining  Group   (www.dmg.org)   •  Wide  tool  support   • 

! ! ! ! …! ! ! ! ! … ! ! ! ! ! …! ! ! ! !

www.dmg.org/v4-­‐1/TreeModel.html  

79  

Stanford  CS  246H  Winter  ‘14  

HTTP  REST  API   ConvenKon  for  RPC-­‐like   request  /  response   •  HTTP  verbs,  transport   •  GET  :  query   •  POST  :  add  input   •  Easy  from  browser,  CLI,   Java,  Python,  Scala,  etc.   • 

GET /recommend/jwills!

HTTP/1.1 200 OK! Content-Type: text/plain! ! "Ray LaMontagne",0.951
 "Fleet Foxes",0.7905! "The National",0.688! "Shearwater",0.3017!

  80  

Stanford  CS  246H  Winter  ‘14  

Wish  List   • 

Revamp  workflow   • 

• 

De-­‐emphasize  model   building   •  • 

81  

Spark  /  Crunch-­‐like  API,   not  raw  M/R  

Well-­‐solved   Bring  your  own  

More  component-­‐ized     •  Less  black-­‐box  service   •  Emphasize  integraKon   • 

• 

• 

PMML,  etc.  

“Pull”  opKons   •  • 

Ka‡a?   Hive  /  Impala  ?  

Stanford  CS  246H  Winter  ‘14  

Open  Source  

github.com/cloudera/oryx! 100%  Apache  License  2.0  

82  

Stanford  CS  246H  Winter  ‘14  

Stanford  CS  246H  Winter  ‘14  

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF