Lecture 7 - CS 246h
January 17, 2017 | Author: jeromeku | Category: N/A
Short Description
a...
Description
Stanford CS 246H: Mining Massive Data Sets Hadoop Lab
Stanford CS 246H Winter ‘14
Machine Learning & Hadoop
Stanford CS 246H Winter ‘14
Peanut BuCer and Chocolate? The Promise of Big Data™ • Sounds great, but how? •
• •
•
Tools and toolkits starKng to appear •
•
Hadoop talent pool is small ML talent pool is Kny Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.
Summary: Hadoop is hard, and ML is hard 1. 2.
Lots of people/companies are trying to make it easy Don’t believe anyone who tells you they make it easy
Stanford CS 246H Winter ‘14
Hadoop & ML: A Brief History 2005 – Taste project started on SourceForge • 2007 – Mahout project started at Apache • 2008 – Taste donated to Mahout • … Kme passes … • 2012 – Myrrix is launched • 2013 – Cloudera ML project started on Github • Late 2013 – Oryx project started on Github •
Stanford CS 246H Winter ‘14
Hadoop ML Family Tree
Andrew Ng
Taste
Lucene
Mahout Cloudera ML
Myrrix
Oryx
Stanford CS 246H Winter ‘14
Apache Mahout
Stanford CS 246H Winter ‘14
What is Mahout? •
“Scalable machine learning” • •
•
not just Hadoop-‐oriented machine learning not en%rely, that is. Just mostly.
Components • • • • •
math library clustering classificaKon decomposiKons recommendaKons
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Mahout Math •
Goals are • • • • • •
•
basic linear algebra, and staKsKcal sampling, and good clustering, decent speed, extensibility, especially for sparse data
But not • • •
totally badass speed comprehensive set of algorithms opKmizaKon, root finders, quadrature
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Caveat Emptor Mahout is a toolkit • There is a command line interface •
•
You can’t always use it
Very oken end up wriKng code • DocumentaKon is… ahem… scant •
•
Best reference is Mahout in AcKon
Varying levels of maturity • Varying levels of Hadoop support •
Stanford CS 246H Winter ‘14
Matrices and Vectors •
At the core: • •
•
DenseVector, RandomAccessSparseVector DenseMatrix, SparseRowMatrix
Highly composable API m.viewDiagonal().assign(v)!
•
Important ideas: • •
view*, assign and aggregate iteraKon
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Assign? View? •
Why assign? • • •
•
Why view? • • •
•
Copying is the major cost for naïve matrix packages In-‐place operaKons criKcal to reasonable performance Many kinds of updates required, so funcKonal style very helpful
In-‐place operaKons oken required for blocks, rows, columns or diagonals With views, we need #assign + #views methods Without views, we need #assign x #views methods
Synergies •
With both views and assign, many loops become single line
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Assign •
Matrices Matrix Matrix Matrix Matrix Matrix
•
assign(double value);! assign(double[][] values);! assign(Matrix other);! assign(DoubleFunction f);! assign(Matrix other, DoubleDoubleFunction f);!
Vectors Vector Vector Vector Vector Vector Vector
©MapR Technologies 2013
assign(double value);! assign(double[] values);! assign(Vector other);! assign(DoubleFunction f);! assign(Vector other, DoubleDoubleFunction f);! assign(DoubleDoubleFunction f, double y);!
Stanford CS 246H Winter ‘14
Views •
Matrices Matrix Matrix Vector Vector Vector
•
viewPart(int[] offset, int[] size);! viewPart(int row, int rlen, int col, int clen);! viewRow(int row);! viewColumn(int column);! viewDiagonal();!
Vectors Vector viewPart(int offset, int length);!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Aggregates •
Matrices double zSum();! Vector aggregateRows(VectorFunction f);! Vector aggregateColumns(VectorFunction f);! double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);!
!
•
Vectors double zSum();! double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);! double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Predefined FuncKons •
Many handy funcKons ABS ACOS ASIN ATAN CEIL COS EXP FLOOR IDENTITY INV LOGARITHM!
©MapR Technologies 2013
LOG2 NEGATE RINT SIGN SIN SQRT SQUARE SIGMOID SIGMOIDGRADIENT TAN
! ! ! ! ! ! ! ! !
Stanford CS 246H Winter ‘14
Examples A =α
double alpha; a.assign(alpha);
A = αB + β a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Sparse OpKmizaKons •
DoubleDoubleFuncKon abstract properKes public public public public public public public public
•
boolean boolean boolean boolean boolean boolean boolean boolean
isLikeRightPlus();! isLikeLeftMult();! isLikeRightMult();! isLikeMult();! isCommutative();! isAssociative();! isAssociativeAndCommutative();! isDensifying();!
And Vector properKes public public public public public
©MapR Technologies 2013
boolean isDense();! boolean isSequentialAccess();! double getLookupCost();! double getIteratorAdvanceCost();! boolean isAddConstantTime();! Stanford CS 246H Winter ‘14
Examples •
The trace of a matrix m.viewDiagonal().zSum()!
•
Set diagonal to zero m.viewDiagonal().assign(0)!
•
Set diagonal to negaKve of row sums excluding the diagonal Vector diag = m.viewDiagonal().assign(0);! diag.assign(m.rowSums().assign(Functions.MINUS));!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
IteraKon •
Matrices are Iterable in Mahout // compute both row and columns sums in one pass! for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);! }!
•
Vectors are densely or sparsely iterable double entropy = 0;! for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());! }!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Random Sampling •
Samples from some type public interface Sampler {! T sample();! }! ! public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler!
•
Lots of kinds ChineseRestaurant Empirical IndianBuffet
©MapR Technologies 2013
Missing Multinomial MultiNormal
Normal PoissonSampler Sampler !
Stanford CS 246H Winter ‘14
Mahout Math Summary •
Matrices, Vectors • • • •
•
FuncKons • •
•
lots built-‐in cooperate with sparse vector opKmizaKons
Sampling • •
•
views in-‐place assignment aggregaKons iteraKons
abstract samplers samplers as funcKons
Other stuff … clustering, SVD
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Other Stuff Matrix DecomposiKon • ClassificaKon • Clustering • RecommendaKons •
Stanford CS 246H Winter ‘14
Focus: Machine Learning ApplicaKons
Examples
GeneKc
Freq. PaCern Mining
UKliKes Lucene/Vectorizer
ClassificaKon
Clustering
Math Vectors/Matrices/ SVD
Recommenders
CollecKons (primiKves)
Apache Hadoop
See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms ©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Prepare Data from Raw content •
Data Sources: •
Lucene integraKon •
•
Document Vectorizer • •
•
•
bin/mahout seqdirectory … bin/mahout seq2sparse …
ProgrammaKcally •
•
bin/mahout lucenevector …
See the UKls module in Mahout
Database File system
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
RecommendaKons Extensive framework for collaboraKve filtering • Recommenders •
•
•
Online and Offline support •
•
User based, Item based, ALS, SlopeOne, SVD, others Offline can uKlize Hadoop
Many different Similarity measures •
Cosine, LLR, Tanimoto, Pearson, others
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Clustering •
Document level • •
•
Group documents based on a noKon of similarity K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐ • Topic Modeling Shik • Cluster words across Distance Measures documents to idenKfy • ManhaCan, Euclidean, topics other • Latent Dirichlet AllocaKon
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
CategorizaKon •
Place new items into predefined categories: •
•
Sports, poliKcs, entertainment
Mahout has several implementaKons • • • •
Naïve Bayes Complementary Naïve Bayes Decision Forests LogisKc Regression (SGD)
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Freq. PaCern Mining • •
IdenKfy frequently co-‐ occurrent items Useful for: •
Query RecommendaKons •
•
Apple -‐> iPhone, orange, OS X
Related product placement •
•
hCp://www.amazon.com
“Beer and Diapers”
Spam DetecKon •
©Lucid ImaginaKon 2010
Yahoo: hCp://www.slideshare.net/ hadoopusergroup/mail-‐ anKspam
Stanford CS 246H Winter ‘14
EvoluKonary Map-‐Reduce ready fitness funcKons for geneKc programming • IntegraKon with Watchmaker •
•
•
hCp://watchmaker.uncommons.org/index.php
Problems solved: • • •
Traveling salesman Class discovery Many others
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Singular Value DecomposiKon Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/ reducing the less important parts • Mahout has fully distributed Lanczos implementaKon •
/bin/mahout svd -‐Dmapred.input.dir=path/ to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐ numColumns -‐-‐numRows /bin/mahout cleansvd -‐-‐eigenInput path/ for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0
•
hCps://cwiki.apache.org/confluence/display/ MAHOUT/Dimensional+ReducKon
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
How to: Command Line •
Most algorithms have a Driver program •
•
Prepare the Data •
•
Different algorithms require different setup
Run the algorithm • •
•
Shell script in $MAHOUT_HOME/bin helps with most tasks
Single Node Hadoop
Print out the results •
Several helper classes: •
©Lucid ImaginaKon 2010
LDAPrintTopics, ClusterDumper, etc.
Stanford CS 246H Winter ‘14
Ugly Demo II -‐ Prep •
Data Set: Reuters • •
•
hCp://www.daviddlewis.com/resources/testcollecKons/ reuters21578/ Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐ preclass-‐training/
Convert to Sequence File: bin/mahout seqdirectory –input -‐-‐output -‐-‐ charset UTF-‐8
•
Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input /content/reuters/ seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output /content/ reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo II: Topic Modeling •
Latent Dirichlet AllocaKon ./mahout lda -‐-‐input /content/reuters/seqfiles-‐TF/ vectors/ -‐-‐output /content/reuters/seqfiles-‐TF/ lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input /content/reuters/seqfiles-‐TF/lda-‐output/ state-‐19 -‐-‐dict /content/reuters/seqfiles-‐TF/ dictionary.file-‐0 -‐-‐words 10 -‐-‐output /content/ reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile
•
Good feature reducKon (stopword removal) required
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo III: Clustering •
K-‐Means •
Same Prep as UD II, except use TFIDF weight
./mahout kmeans -‐-‐input /content/reuters/seqfiles-‐ TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output /content/ reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters / content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters
•
Print out the clusters:
./mahout clusterdump -‐-‐seqFileDir /content/reuters/ seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir /content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary /content/reuters/seqfiles-‐TFIDF/ dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo IV: Frequent PaCern Mining •
Data: hCp://fimi.cs.helsinki.fi/data/
•
./mahout fpg -‐i /content/freqitemset/ accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ] ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/ part-‐r-‐00000
•
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Cloudera ML
Stanford CS 246H Winter ‘14
Cloudera ML CollecKon of Java libraries and command-‐line tools • Goal: make data scienKsts more producKve with CDH •
• • • •
Exploratory data analysis Data preparaKon Model fi}ng Model evaluaKon
Apache 2.0 licensed • Developed on GitHub •
•
hCp://github.com/cloudera/ml
37 Stanford CS 246H Winter ‘14
Cloudera ML: Building Blocks •
Apache Hadoop •
•
Apache Hive •
•
Easy MapReduce pipelines
Apache Mahout •
•
Metadata for structured data in HDFS
Apache Crunch •
•
Scalable data storage (HDFS) and processing (MapReduce)
Vector interface
Apache Avro •
SerializaKon format
38 Stanford CS 246H Winter ‘14
Cloudera ML Workflow: Clustering
39 Stanford CS 246H Winter ‘14
Cloudera ML: summary •
client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)
40 Stanford CS 246H Winter ‘14
Cloudera ML: summary
HDFS
kddcup. data_10_percent
1. summary
Local FS
41
header.csv
Stanford CS 246H Winter ‘14
Cloudera ML: summary
HDFS
kddcup. data_10_percent
1. summary
Local FS
42
header.csv
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: summary •
s.json • •
Categorical features: histogram Numerical features: distribuKon summary
43 Stanford CS 246H Winter ‘14
Cloudera ML: normalize •
client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress
44 Stanford CS 246H Winter ‘14
Cloudera ML: normalize
HDFS
kddcup. data_10_percent
2. normalize
Local FS
45
header.csv
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
HDFS
kddcup. data_10_percent
kdd99/
2. normalize
Local FS
46
header.csv
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: normalize •
kdd99/part-‐m-‐0000[0|1].avro •
Examples (rows) • • •
•
Part 0: 442,454 vectors Part 1: 51,567 vectors Total: 494,021 vectors
Features (columns) • •
Before: 41 fields Aker: 143 fields
47 Stanford CS 246H Winter ‘14
Cloudera ML: ksketch •
client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2
48 Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
HDFS
kddcup. data_10_percent
kdd99/
3. ksketch
Local FS
49
header.csv
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
HDFS
kddcup. data_10_percent
kdd99/
3. ksketch
Local FS
50
header.csv
s.json
wc.avro
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch •
wc.avro •
Examples (rows) • • • •
•
2 “folds” of 2501 examples 1 iniKal example 500 examples from each iteraKon (5 iteraKons) Each example has an associated weight
Features (columns) •
143 features (sKll)
51 Stanford CS 246H Winter ‘14
Cloudera ML: kmeans •
client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)
52 Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
HDFS
kddcup. data_10_percent
kdd99/
4. kmeans
Local FS
53
header.csv
s.json
wc.avro
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
HDFS
kddcup. data_10_percent
kdd99/
4. kmeans
Local FS
header.csv
s.json
wc.avro
centers.avro
kmeans_stats.csv
54
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans •
centers.avro • •
•
1 row for each run of k-‐means++ 9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45
kmeans_stats.csv •
Clustering quality scores
55 Stanford CS 246H Winter ‘14
Cloudera ML: kassign •
client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv
56 Stanford CS 246H Winter ‘14
Cloudera ML: kassign
HDFS
kddcup. data_10_percent
kdd99/
5. kassign
Local FS
57
header.csv
s.json
wc.avro
centers.avro
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
HDFS
kddcup. data_10_percent
kdd99/
assigned/
5. kassign
Local FS
58
header.csv
s.json
wc.avro
centers.avro
Stanford CS 246H Winter ‘14
Cloudera ML: kassign •
assigned/part-‐m-‐0000[0|1] •
Rows • • •
•
Part 0: 442,454 Part 1: 51,567 Total: 494,021
Columns • • • •
Point ID (normal/aCack type, in this case) Index in centers.avro Assigned cluster ID Squared distance to nearest cluster
59 Stanford CS 246H Winter ‘14
Cloudera ML: sample •
client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)
60 Stanford CS 246H Winter ‘14
Cloudera ML: sample
HDFS
kddcup. data_10_percent
kdd99/
assigned/
6. sample
Local FS
61
header.csv
s.json
wc.avro
centers.avro
kassign_header.csv
Stanford CS 246H Winter ‘14
Cloudera ML: sample
HDFS
kddcup. data_10_percent
kdd99/
assigned/
extremal/
6. sample
Local FS
62
header.csv
s.json
wc.avro
centers.avro
kassign_header.csv
Stanford CS 246H Winter ‘14
Cloudera ML: sample •
extremal/part-‐r-‐00000 •
Rows • •
•
Up to 20 examples from each cluster Examples that are furthest from the center of the cluster
Columns • • • •
Point ID (normal/aCack type, in this case) Index in centers.avro Assigned cluster ID Squared distance to nearest cluster
63 Stanford CS 246H Winter ‘14
Oryx
Stanford CS 246H Winter ‘14
2014: Lab to Factory
65
Stanford CS 246H Winter ‘14
Data Science Will Be Opera-onal Analy-cs
66
Stanford CS 246H Winter ‘14
I Built A Model. Now What?
Collect Input
Build Model
Query Model
Repeat
67
Stanford CS 246H Winter ‘14
I Built A Model On Hadoop. Now What?
? ? ? Collect Input
Build Model
Query Model
Repeat
68
Stanford CS 246H Winter ‘14
Example: Oryx
69 Stanford CS 246H Winter ‘14
www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg
70
Stanford CS 246H Winter ‘14
Gaps to fill, and Goals •
Model Building • • • •
•
Model Serving • •
71
Large-‐scale Con-nuous Apache Hadoop™-‐based Few, good algorithms Real-‐-me query Real-‐-me update
•
Algorithms • • •
•
Parallelizable Updateable Works on diverse input
Interoperable • • •
PMML model format Simple REST API Open source
Stanford CS 246H Winter ‘14
Large-‐Scale or Real-‐Time? Large-‐Scale Offline Batch
vs
Real-‐Time Online Streaming
Why Don’t We Have Both?
λ!
72
Stanford CS 246H Winter ‘14
Lambda Architecture Batch, Stream Processing are different • Tackle separately in 2+ Layers • Batch Layer: offline, asynchronous • Serving / Speed Layer: real-‐Kme, incremental, approximate •
… λ?
jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng
73
Stanford CS 246H Winter ‘14
Serving/Speed
Batch
74
Stanford CS 246H Winter ‘14
Two Layers •
ComputaKon Layer • • •
• •
Java-‐based server process Client of Hadoop 2.x Periodically builds “generaKon” from recent data and past model Baby-‐sits MapReduce* jobs (or, locally in-‐core) Publishes models
•
Serving Layer • • • • • •
Apache Tomcat™-‐based server process Consumes models from HDFS (or local FS) Serves queries from model in memory Updates from new input Also writes input to HDFS Replicas for scale
* Apache Spark later
75
Stanford CS 246H Winter ‘14
CollaboraKve Filtering : ALS AlternaKng Least Squares • Latent-‐factor model • Accepts implicit or explicit feedback • Real-‐Kme update via fold-‐in of input • No cold-‐start • Parallelizable •
76
YT
X
Stanford CS 246H Winter ‘14
Clustering : k-‐means++ Well-‐known and understood • Parallelizable • Clusters updateable •
cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering
77
Stanford CS 246H Winter ‘14
ClassificaKon / Regression : RDF Random Decision Forests • Ensemble method • Numeric, categorical features and target • Very parallel • Nodes updateable • Works well on many problems •
78
age$>$30
female?
income$>$20000
Yes
Yes
Yes
No
Stanford CS 246H Winter ‘14
PMML PredicKve Modeling Markup Language • XML-‐based format for predicKve models • Standardized by Data Mining Group (www.dmg.org) • Wide tool support •
! ! ! ! …! ! ! ! ! … ! ! ! ! ! …! ! ! ! !
www.dmg.org/v4-‐1/TreeModel.html
79
Stanford CS 246H Winter ‘14
HTTP REST API ConvenKon for RPC-‐like request / response • HTTP verbs, transport • GET : query • POST : add input • Easy from browser, CLI, Java, Python, Scala, etc. •
GET /recommend/jwills!
HTTP/1.1 200 OK! Content-Type: text/plain! ! "Ray LaMontagne",0.951
"Fleet Foxes",0.7905! "The National",0.688! "Shearwater",0.3017!
80
Stanford CS 246H Winter ‘14
Wish List •
Revamp workflow •
•
De-‐emphasize model building • •
81
Spark / Crunch-‐like API, not raw M/R
Well-‐solved Bring your own
More component-‐ized • Less black-‐box service • Emphasize integraKon •
•
•
PMML, etc.
“Pull” opKons • •
Ka‡a? Hive / Impala ?
Stanford CS 246H Winter ‘14
Open Source
github.com/cloudera/oryx! 100% Apache License 2.0
82
Stanford CS 246H Winter ‘14
Stanford CS 246H Winter ‘14
View more...
Comments