Lecture 7 - CS 246h

January 17, 2017 | Author: jeromeku | Category: N/A

Share Embed Donate

Report this link

Short Description

a...

Description

Stanford CS 246H: Mining Massive Data Sets Hadoop Lab

Stanford CS 246H Winter ‘14

Machine Learning & Hadoop

Stanford CS 246H Winter ‘14

Peanut BuCer and Chocolate? The Promise of Big Data™ •  Sounds great, but how? • 

•  • 

• 

Tools and toolkits starKng to appear • 

• 

Hadoop talent pool is small ML talent pool is Kny Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.

Summary: Hadoop is hard, and ML is hard 1.  2. 

Lots of people/companies are trying to make it easy Don’t believe anyone who tells you they make it easy

Stanford CS 246H Winter ‘14

Hadoop & ML: A Brief History 2005 – Taste project started on SourceForge •  2007 – Mahout project started at Apache •  2008 – Taste donated to Mahout •  … Kme passes … •  2012 – Myrrix is launched •  2013 – Cloudera ML project started on Github •  Late 2013 – Oryx project started on Github • 

Stanford CS 246H Winter ‘14

Hadoop ML Family Tree

Andrew Ng

Taste

Lucene

Mahout Cloudera ML

Myrrix

Oryx

Stanford CS 246H Winter ‘14

Apache Mahout

Stanford CS 246H Winter ‘14

What is Mahout? • 

“Scalable machine learning” •  • 

• 

not just Hadoop-‐oriented machine learning not en%rely, that is. Just mostly.

Components •  •  •  •  • 

math library clustering classificaKon decomposiKons recommendaKons

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Mahout Math • 

Goals are •  •  •  •  •  • 

• 

basic linear algebra, and staKsKcal sampling, and good clustering, decent speed, extensibility, especially for sparse data

But not •  •  • 

totally badass speed comprehensive set of algorithms opKmizaKon, root finders, quadrature

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Caveat Emptor Mahout is a toolkit •  There is a command line interface • 

• 

You can’t always use it

Very oken end up wriKng code •  DocumentaKon is… ahem… scant • 

• 

Best reference is Mahout in AcKon

Varying levels of maturity •  Varying levels of Hadoop support • 

Stanford CS 246H Winter ‘14

Matrices and Vectors • 

At the core: •  • 

• 

DenseVector, RandomAccessSparseVector DenseMatrix, SparseRowMatrix

Highly composable API m.viewDiagonal().assign(v)!

• 

Important ideas: •  • 

view*, assign and aggregate iteraKon

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Assign? View? • 

Why assign? •  •  • 

• 

Why view? •  •  • 

• 

Copying is the major cost for naïve matrix packages In-‐place operaKons criKcal to reasonable performance Many kinds of updates required, so funcKonal style very helpful

In-‐place operaKons oken required for blocks, rows, columns or diagonals With views, we need #assign + #views methods Without views, we need #assign x #views methods

Synergies • 

With both views and assign, many loops become single line

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Assign • 

Matrices Matrix Matrix Matrix Matrix Matrix

• 

assign(double value);! assign(double[][] values);! assign(Matrix other);! assign(DoubleFunction f);! assign(Matrix other, DoubleDoubleFunction f);!

Vectors Vector Vector Vector Vector Vector Vector

©MapR Technologies 2013

assign(double value);! assign(double[] values);! assign(Vector other);! assign(DoubleFunction f);! assign(Vector other, DoubleDoubleFunction f);! assign(DoubleDoubleFunction f, double y);!

Stanford CS 246H Winter ‘14

Views • 

Matrices Matrix Matrix Vector Vector Vector

• 

viewPart(int[] offset, int[] size);! viewPart(int row, int rlen, int col, int clen);! viewRow(int row);! viewColumn(int column);! viewDiagonal();!

Vectors Vector viewPart(int offset, int length);!

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Aggregates • 

Matrices double zSum();! Vector aggregateRows(VectorFunction f);! Vector aggregateColumns(VectorFunction f);! double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);!

!

• 

Vectors double zSum();! double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);! double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Predefined FuncKons • 

Many handy funcKons ABS ACOS ASIN ATAN CEIL COS EXP FLOOR IDENTITY INV LOGARITHM!

©MapR Technologies 2013

LOG2 NEGATE RINT SIGN SIN SQRT SQUARE SIGMOID SIGMOIDGRADIENT TAN

! ! ! ! ! ! ! ! !

Stanford CS 246H Winter ‘14

Examples A =α

double alpha; a.assign(alpha);

A = αB + β a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Sparse OpKmizaKons • 

DoubleDoubleFuncKon abstract properKes public public public public public public public public

• 

boolean boolean boolean boolean boolean boolean boolean boolean

isLikeRightPlus();! isLikeLeftMult();! isLikeRightMult();! isLikeMult();! isCommutative();! isAssociative();! isAssociativeAndCommutative();! isDensifying();!

And Vector properKes public public public public public

©MapR Technologies 2013

boolean isDense();! boolean isSequentialAccess();! double getLookupCost();! double getIteratorAdvanceCost();! boolean isAddConstantTime();! Stanford CS 246H Winter ‘14

Examples • 

The trace of a matrix m.viewDiagonal().zSum()!

• 

Set diagonal to zero m.viewDiagonal().assign(0)!

• 

Set diagonal to negaKve of row sums excluding the diagonal Vector diag = m.viewDiagonal().assign(0);! diag.assign(m.rowSums().assign(Functions.MINUS));!

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

IteraKon • 

Matrices are Iterable in Mahout // compute both row and columns sums in one pass! for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);! }!

• 

Vectors are densely or sparsely iterable double entropy = 0;! for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());! }!

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Random Sampling • 

Samples from some type public interface Sampler {! T sample();! }! ! public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler!

• 

Lots of kinds ChineseRestaurant Empirical IndianBuffet

©MapR Technologies 2013

Missing Multinomial MultiNormal

Normal PoissonSampler Sampler !

Stanford CS 246H Winter ‘14

Mahout Math Summary • 

Matrices, Vectors •  •  •  • 

• 

FuncKons •  • 

• 

lots built-‐in cooperate with sparse vector opKmizaKons

Sampling •  • 

• 

views in-‐place assignment aggregaKons iteraKons

abstract samplers samplers as funcKons

Other stuff … clustering, SVD

©MapR Technologies 2013

Stanford CS 246H Winter ‘14

Other Stuff Matrix DecomposiKon •  ClassificaKon •  Clustering •  RecommendaKons • 

Stanford CS 246H Winter ‘14

Focus: Machine Learning ApplicaKons

Examples

GeneKc

Freq. PaCern Mining

UKliKes Lucene/Vectorizer

ClassificaKon

Clustering

Math Vectors/Matrices/ SVD

Recommenders

CollecKons (primiKves)

Apache Hadoop

See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms ©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Prepare Data from Raw content • 

Data Sources: • 

Lucene integraKon • 

• 

Document Vectorizer •  • 

• 

• 

bin/mahout seqdirectory … bin/mahout seq2sparse …

ProgrammaKcally • 

• 

bin/mahout lucenevector …

See the UKls module in Mahout

Database File system

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

RecommendaKons Extensive framework for collaboraKve filtering •  Recommenders • 

• 

• 

Online and Offline support • 

• 

User based, Item based, ALS, SlopeOne, SVD, others Offline can uKlize Hadoop

Many different Similarity measures • 

Cosine, LLR, Tanimoto, Pearson, others

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Clustering • 

Document level •  • 

• 

Group documents based on a noKon of similarity K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐ •  Topic Modeling Shik •  Cluster words across Distance Measures documents to idenKfy •  ManhaCan, Euclidean, topics other •  Latent Dirichlet AllocaKon

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

CategorizaKon • 

Place new items into predefined categories: • 

• 

Sports, poliKcs, entertainment

Mahout has several implementaKons •  •  •  • 

Naïve Bayes Complementary Naïve Bayes Decision Forests LogisKc Regression (SGD)

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Freq. PaCern Mining •  • 

IdenKfy frequently co-‐ occurrent items Useful for: • 

Query RecommendaKons • 

• 

Apple -‐> iPhone, orange, OS X

Related product placement • 

• 

hCp://www.amazon.com

“Beer and Diapers”

Spam DetecKon • 

©Lucid ImaginaKon 2010

Yahoo: hCp://www.slideshare.net/ hadoopusergroup/mail-‐ anKspam

Stanford CS 246H Winter ‘14

EvoluKonary Map-‐Reduce ready fitness funcKons for geneKc programming •  IntegraKon with Watchmaker • 

• 

• 

hCp://watchmaker.uncommons.org/index.php

Problems solved: •  •  • 

Traveling salesman Class discovery Many others

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Singular Value DecomposiKon Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/ reducing the less important parts •  Mahout has fully distributed Lanczos implementaKon • 

/bin/mahout svd -‐Dmapred.input.dir=path/ to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐ numColumns -‐-‐numRows /bin/mahout cleansvd -‐-‐eigenInput path/ for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0

• 

hCps://cwiki.apache.org/confluence/display/ MAHOUT/Dimensional+ReducKon

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

How to: Command Line • 

Most algorithms have a Driver program • 

• 

Prepare the Data • 

• 

Different algorithms require different setup

Run the algorithm •  • 

• 

Shell script in $MAHOUT_HOME/bin helps with most tasks

Single Node Hadoop

Print out the results • 

Several helper classes: • 

©Lucid ImaginaKon 2010

LDAPrintTopics, ClusterDumper, etc.

Stanford CS 246H Winter ‘14

Ugly Demo II -‐ Prep • 

Data Set: Reuters •  • 

• 

hCp://www.daviddlewis.com/resources/testcollecKons/ reuters21578/ Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐ preclass-‐training/

Convert to Sequence File: bin/mahout seqdirectory –input -‐-‐output -‐-‐ charset UTF-‐8

• 

Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input /content/reuters/ seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output /content/ reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Ugly Demo II: Topic Modeling • 

Latent Dirichlet AllocaKon ./mahout lda -‐-‐input /content/reuters/seqfiles-‐TF/ vectors/ -‐-‐output /content/reuters/seqfiles-‐TF/ lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input /content/reuters/seqfiles-‐TF/lda-‐output/ state-‐19 -‐-‐dict /content/reuters/seqfiles-‐TF/ dictionary.file-‐0 -‐-‐words 10 -‐-‐output /content/ reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile

• 

Good feature reducKon (stopword removal) required

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Ugly Demo III: Clustering • 

K-‐Means • 

Same Prep as UD II, except use TFIDF weight

./mahout kmeans -‐-‐input /content/reuters/seqfiles-‐ TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output /content/ reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters / content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters

• 

Print out the clusters:

./mahout clusterdump -‐-‐seqFileDir /content/reuters/ seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir /content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary /content/reuters/seqfiles-‐TFIDF/ dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Ugly Demo IV: Frequent PaCern Mining • 

Data: hCp://fimi.cs.helsinki.fi/data/

• 

./mahout fpg -‐i /content/freqitemset/ accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ] ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/ part-‐r-‐00000

• 

©Lucid ImaginaKon 2010

Stanford CS 246H Winter ‘14

Cloudera ML

Stanford CS 246H Winter ‘14

Cloudera ML CollecKon of Java libraries and command-‐line tools •  Goal: make data scienKsts more producKve with CDH • 

•  •  •  • 

Exploratory data analysis Data preparaKon Model fi}ng Model evaluaKon

Apache 2.0 licensed •  Developed on GitHub • 

• 

hCp://github.com/cloudera/ml

37 Stanford CS 246H Winter ‘14

Cloudera ML: Building Blocks • 

Apache Hadoop • 

• 

Apache Hive • 

• 

Easy MapReduce pipelines

Apache Mahout • 

• 

Metadata for structured data in HDFS

Apache Crunch • 

• 

Scalable data storage (HDFS) and processing (MapReduce)

Vector interface

Apache Avro • 

SerializaKon format

38 Stanford CS 246H Winter ‘14

Cloudera ML Workflow: Clustering

39 Stanford CS 246H Winter ‘14

Cloudera ML: summary • 

client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)

40 Stanford CS 246H Winter ‘14

Cloudera ML: summary

HDFS

kddcup. data_10_percent

1. summary

Local FS

41

header.csv

Stanford CS 246H Winter ‘14

Cloudera ML: summary

HDFS

kddcup. data_10_percent

1. summary

Local FS

42

header.csv

s.json

Stanford CS 246H Winter ‘14

Cloudera ML: summary • 

s.json •  • 

Categorical features: histogram Numerical features: distribuKon summary

43 Stanford CS 246H Winter ‘14

Cloudera ML: normalize • 

client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress

44 Stanford CS 246H Winter ‘14

Cloudera ML: normalize

HDFS

kddcup. data_10_percent

2. normalize

Local FS

45

header.csv

s.json

Stanford CS 246H Winter ‘14

Cloudera ML: normalize

HDFS

kddcup. data_10_percent

kdd99/

2. normalize

Local FS

46

header.csv

s.json

Stanford CS 246H Winter ‘14

Cloudera ML: normalize • 

kdd99/part-‐m-‐0000[0|1].avro • 

Examples (rows) •  •  • 

• 

Part 0: 442,454 vectors Part 1: 51,567 vectors Total: 494,021 vectors

Features (columns) •  • 

Before: 41 fields Aker: 143 fields

47 Stanford CS 246H Winter ‘14

Cloudera ML: ksketch • 

client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2

48 Stanford CS 246H Winter ‘14

Cloudera ML: ksketch

HDFS

kddcup. data_10_percent

kdd99/

3. ksketch

Local FS

49

header.csv

s.json

Stanford CS 246H Winter ‘14

Cloudera ML: ksketch

HDFS

kddcup. data_10_percent

kdd99/

3. ksketch

Local FS

50

header.csv

s.json

wc.avro

Stanford CS 246H Winter ‘14

Cloudera ML: ksketch • 

wc.avro • 

Examples (rows) •  •  •  • 

• 

2 “folds” of 2501 examples 1 iniKal example 500 examples from each iteraKon (5 iteraKons) Each example has an associated weight

Features (columns) • 

143 features (sKll)

51 Stanford CS 246H Winter ‘14

Cloudera ML: kmeans • 

client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)

52 Stanford CS 246H Winter ‘14

Cloudera ML: kmeans

HDFS

kddcup. data_10_percent

kdd99/

4. kmeans

Local FS

53

header.csv

s.json

wc.avro

Stanford CS 246H Winter ‘14

Cloudera ML: kmeans

HDFS

kddcup. data_10_percent

kdd99/

4. kmeans

Local FS

header.csv

s.json

wc.avro

centers.avro

kmeans_stats.csv

54

Stanford CS 246H Winter ‘14

Cloudera ML: kmeans • 

centers.avro •  • 

• 

1 row for each run of k-‐means++ 9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45

kmeans_stats.csv • 

Clustering quality scores

55 Stanford CS 246H Winter ‘14

Cloudera ML: kassign • 

client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv

56 Stanford CS 246H Winter ‘14

Cloudera ML: kassign

HDFS

kddcup. data_10_percent

kdd99/

5. kassign

Local FS

57

header.csv

s.json

wc.avro

centers.avro

Stanford CS 246H Winter ‘14

Cloudera ML: kassign

HDFS

kddcup. data_10_percent

kdd99/

assigned/

5. kassign

Local FS

58

header.csv

s.json

wc.avro

centers.avro

Stanford CS 246H Winter ‘14

Cloudera ML: kassign • 

assigned/part-‐m-‐0000[0|1] • 

Rows •  •  • 

• 

Part 0: 442,454 Part 1: 51,567 Total: 494,021

Columns •  •  •  • 

Point ID (normal/aCack type, in this case) Index in centers.avro Assigned cluster ID Squared distance to nearest cluster

59 Stanford CS 246H Winter ‘14

Cloudera ML: sample • 

client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)

60 Stanford CS 246H Winter ‘14

Cloudera ML: sample

HDFS

kddcup. data_10_percent

kdd99/

assigned/

6. sample

Local FS

61

header.csv

s.json

wc.avro

centers.avro

kassign_header.csv

Stanford CS 246H Winter ‘14

Cloudera ML: sample

HDFS

kddcup. data_10_percent

kdd99/

assigned/

extremal/

6. sample

Local FS

62

header.csv

s.json

wc.avro

centers.avro

kassign_header.csv

Stanford CS 246H Winter ‘14

Cloudera ML: sample • 

extremal/part-‐r-‐00000 • 

Rows •  • 

• 

Up to 20 examples from each cluster Examples that are furthest from the center of the cluster

Columns •  •  •  • 

Point ID (normal/aCack type, in this case) Index in centers.avro Assigned cluster ID Squared distance to nearest cluster

63 Stanford CS 246H Winter ‘14

Oryx

Stanford CS 246H Winter ‘14

2014: Lab to Factory

65

Stanford CS 246H Winter ‘14

Data Science Will Be Opera-onal Analy-cs

66

Stanford CS 246H Winter ‘14

I Built A Model. Now What?

Collect Input

Build Model

Query Model

Repeat

67

Stanford CS 246H Winter ‘14

I Built A Model On Hadoop. Now What?

? ? ? Collect Input

Build Model

Query Model

Repeat

68

Stanford CS 246H Winter ‘14

Example: Oryx

69 Stanford CS 246H Winter ‘14

www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg

70

Stanford CS 246H Winter ‘14

Gaps to fill, and Goals • 

Model Building •  •  •  • 

• 

Model Serving •  • 

71

Large-‐scale Con-nuous Apache Hadoop™-‐based Few, good algorithms Real-‐-me query Real-‐-me update

• 

Algorithms •  •  • 

• 

Parallelizable Updateable Works on diverse input

Interoperable •  •  • 

PMML model format Simple REST API Open source

Stanford CS 246H Winter ‘14

Large-‐Scale or Real-‐Time? Large-‐Scale Offline Batch

vs

Real-‐Time Online Streaming

Why Don’t We Have Both?

λ!

72

Stanford CS 246H Winter ‘14

Lambda Architecture Batch, Stream Processing are different •  Tackle separately in 2+ Layers •  Batch Layer: offline, asynchronous •  Serving / Speed Layer: real-‐Kme, incremental, approximate • 

… λ?

jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng

73

Stanford CS 246H Winter ‘14

Serving/Speed

Batch

74

Stanford CS 246H Winter ‘14

Two Layers • 

ComputaKon Layer •  •  • 

•  • 

Java-‐based server process Client of Hadoop 2.x Periodically builds “generaKon” from recent data and past model Baby-‐sits MapReduce* jobs (or, locally in-‐core) Publishes models

• 

Serving Layer •  •  •  •  •  • 

Apache Tomcat™-‐based server process Consumes models from HDFS (or local FS) Serves queries from model in memory Updates from new input Also writes input to HDFS Replicas for scale

* Apache Spark later

75

Stanford CS 246H Winter ‘14

CollaboraKve Filtering : ALS AlternaKng Least Squares •  Latent-‐factor model •  Accepts implicit or explicit feedback •  Real-‐Kme update via fold-‐in of input •  No cold-‐start •  Parallelizable • 

76

YT

X

Stanford CS 246H Winter ‘14

Clustering : k-‐means++ Well-‐known and understood •  Parallelizable •  Clusters updateable • 

cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering

77

Stanford CS 246H Winter ‘14

ClassificaKon / Regression : RDF Random Decision Forests •  Ensemble method •  Numeric, categorical features and target •  Very parallel •  Nodes updateable •  Works well on many problems • 

78

age$>$30

female?

income$>$20000

Yes

Yes

Yes

No

Stanford CS 246H Winter ‘14

PMML PredicKve Modeling Markup Language •  XML-‐based format for predicKve models •  Standardized by Data Mining Group (www.dmg.org) •  Wide tool support • 

! ! ! ! …! ! ! ! ! … ! ! ! ! ! …! ! ! ! !

www.dmg.org/v4-‐1/TreeModel.html

79

Stanford CS 246H Winter ‘14

HTTP REST API ConvenKon for RPC-‐like request / response •  HTTP verbs, transport •  GET : query •  POST : add input •  Easy from browser, CLI, Java, Python, Scala, etc. • 

GET /recommend/jwills!

HTTP/1.1 200 OK! Content-Type: text/plain! ! "Ray LaMontagne",0.951  "Fleet Foxes",0.7905! "The National",0.688! "Shearwater",0.3017!

80

Stanford CS 246H Winter ‘14

Wish List • 

Revamp workflow • 

• 

De-‐emphasize model building •  • 

81

Spark / Crunch-‐like API, not raw M/R

Well-‐solved Bring your own

More component-‐ized •  Less black-‐box service •  Emphasize integraKon • 

• 

• 

PMML, etc.

“Pull” opKons •  • 

Ka‡a? Hive / Impala ?

Stanford CS 246H Winter ‘14

Open Source

github.com/cloudera/oryx! 100% Apache License 2.0

82

Stanford CS 246H Winter ‘14

Stanford CS 246H Winter ‘14

Lecture 7 - CS 246h

Short Description

Description

Comments

We need your help!