Download AI meetup Auckland Tony Cooper.pdf...
MACHINE LEARNING – AN OVERVIEW Tony Cooper Senior Data Scientist
[email protected]
July 2016 kpmg.com/nz
Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning
Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning
Introduction • Last meeting – Machine Learning – what can it do • This meeting – Machine Learning – how does it work • Not covering – How to do Machine Learning (e.g. test / train split) • Not covering – Applications (see e.g. a long list at http://www.deeplearningpatterns.com/doku.php/applications)
Also not covering: • Speech • Text (Buffalo buffalo Buffalo, buffalo buffalo, buffalo Buffalo buffalo) • Audio • Time Series • Graphs • Internet of Things • Bots (e.g. Siri) • Big Data
Reminder - Last Meeting – Nickle Lu • Amazing things AI has done • Why AI can do those things • Why AI will eat everything • How I learned AI in my career • How can you apply it in yours
Presenter – Tony Cooper • 5 years Stanford PhD • Thesis – “Computer Intensive Statistics” project on numerical methods for the bootstrap (unfinished)
• DSIR - Consulting Statistician • Funds Management – Database Technology (not Big Data) • Double-Digit Numerics – Consulting Data Scientist • KPMG – Setup Data Science Innovation Lab • 3 years experience with Deep Learning (2 years CNNs, 3 years H2O)
KPMG Innovation Lab • Big Data • Spark (Spark Meetup, Auckland, 5 September 2016 at KPMG) •R • H2O
• Machine Learning • • • • •
Recommender Systems Computational Advertising Hyperpersonalisation (segmentation with segment size 1) Computer Vision GPU programming
KPMG Hardware • 7 Node Spark Cluster (7 x 2 Xeon) • 1 GPU Server
Tesla K80 GPU
4 x Tesla K80 (4 x 24GB GPU RAM 4 x 5000 cores 4 x 5.8 TeraFLOPs)
2 x Xeon 14 cores (56 threads) 1 TB RAM 6TB SSD
Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning
Machine Learning
SL ML
Statistical Learning
AI
Machine Learning Machine Intelligence
Types of Machine Learning
Unsupervised Supervised
Semi-supervised
Machine Learning Resources
Technical
especially 7.10.2, 7.10.3
Practical
Dummies (can download the R and Python code without buying the book)
Experts Machine Learning’s best kept secret
Interesting
Courses, MOOCs (e.g. Udacity Deep Learning)
Internet • http://deeplearning.stanford.edu/tutorial/ • http://cs231n.stanford.edu/ • (some images in this presentation taken from there)
• Contests (esp Kaggle.com) • Glossaries • http://envisat.esa.int/handbooks/meris/CNTR4-2-5.html • http://www.wildml.com/deep-learning-glossary/ • http://deeplearning4j.org/glossary.html
Kaggle.com – competitions (Titanic a good starter), scripts (“kernels”), and real data
Tip – use containers • Docker • Can run Ubuntu on Windows • All set up for you e.g. Google TensorFlow course at Udacity
A Taste of Machine Learning? Regression Example – Recommender System 1 Alice Bob Chad
4
4
2
3
4
2
1
3
3
3
2
1
3
5 5
Recommender System 4 4
4 4
2 3 2
1 3 1
2 3 2
1 3 1
5 3 3
=
h11 h21 h31
h12 h22 h32
=
h11 h21 h31
h12 h22 h32
5 3 3
𝑅12 = ℎ11 𝑤12 + ℎ12 𝑤22
x
w11 w21
w12 w22
w13 w23
w14 w24
w15 w25
x
w11 w21
w12 w22
w13 w23
w14 w24
w15 w25
11 equations in 16 unknowns
Generically: 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛 Solved using Alternating Least Squares (the machine chooses the features – latent features) The machine did the work for us in deciding what features to use
R pseudo code # ratings matrix R = matrix(nr=3, nc=5, data=c(4,2,1,NA,5,NA,3,3,3,NA,4,2,1,3,NA)) # initial users matrix. h = matrix(nr=3, nc=2, data=rnorm(6)); # initial items matrix. w = matrix(nr=5, nc=2, data=rnorm(10)); # find h, w to minimize the squared error For (iter in 1:5) { # update users for (i in 1:3) { h[i, ] = solve(...) } # update items for (j in 1:5) { w[j, ] = solve(...) } }
Another Taste - Beyond Linear Regression Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT
Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000
CONTENT GOES HERE
Linear Regression (200 years old) • Essentially a weighted combination of the inputs e.g. • SalePrice = w0 + w1*Suburb + w2*ListPrice + w3*AgreementDate + w4*Type + w5*Title + w6*SaleMethod + w7*Bedrooms + w8*LandArea + w9*FloorArea + w10*Existing + w11*ValuationYear • Pros • Simple to understand and interpret (taught at high school) • Simple to compute in Excel (Least Squares)
• Problems • The world isn’t linear • Doesn’t handle interactions easily (Samuel Johnson: Your Manuscript Is Good and Original, But What is Original Is Not Good; What Is Good Is Not Original) • Doesn’t handle missing values at all • Doesn’t handle correlated inputs well
Simple Example – Actual Function 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Simple Example – Noise Added Response samples 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Linear Fit (underfit) Linear Fit (Underfitting) 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Cubic Fit Cubic Fit 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Quartic Fit Quartic Fit 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Quintic Fit Quintic Fit 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Overfitting Overfitting Fit 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Support Vector Regression Fit Support Vector Machine Fit C = 127.578, gamma = 1.22 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Sigmoid
Neural Network 1 Neuron (sigmoid) Neural Network Fit Hidden Layer Size = 1 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛
Linear hi is constant
𝑅 = 𝑆1 𝑤1 + 𝑆2 𝑤2 + ⋯ + 𝑆𝑛 𝑤𝑛
Neural Network Si is sigmoid
A neural network is just a bunch of weighted sigmoid regressions, n is the number of nodes
Neural Network 2 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 2 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Neural Network 5 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 5 12
10
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Neural Network 10 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 10 12
10
Overfitting
10 is too many neurons
Response
8
6
4
2
0
0
500
1000
1500 Input
2000
2500
3000
Neural Network 5 Neurons (sigmoid)
playground.tensorflow.org
Neural Networks are just weighted regressions 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛
Neural Network – there is a theorem that says you can model anything with a single layer Neural Network But instead of going wide it can be more effective going deep
Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning
- Deep Learning • • • •
detect complex interactions among features learn low-level features from minimally processed raw data work with high-cardinality class memberships work with unlabelled data
Lots of hype – but it’s mostly true Fosbury flop analogy, gold rush analogy “Unreasonably effective”
Example - Drive a car
Which way to turn the steering wheel?
Same Problem as Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT
Build a model to predict output from input
Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000
Example – Two Variable Classification e.g. X1 = House Price, X2 = House Area, Y = whether or not house sells at auction
Model: Predict whether or not the house will sell at auction (obviously fake data for illustration only)
Linear Regression
(X1, X2) model and (X1, X2, X1*X2) model
Tree
(X1, X2) model and (X1, X2, X1*X2) model
Random Forest
(X1, X2) model and (X1, X2, X1*X2) model
Support Vector Machine
(X1, X2) model and (X1, X2, X1*X2) model
Gradient Boosting
(X1, X2) model and (X1, X2, X1*X2) model
Single Layer Neural Network 5 nodes (X1, X2) model and (X1, X2, X1*X2) model
Going Deeper
playground.tensorflow.org
Adding X1*X2
Dropping X1 and X2 – Feature Engineering
Feature Engineering and Feature Selection is hard! An art and a science, computationally difficult – O(2n), n = no. of features, n can be thousands How did we know to add X1*X2? Can we get the machine to do it for us? Yes – Deep Learning
Deep Learning – no X1*X2
Go and Play!
(use ReLU)
ReLU (Rectified Linear Unit) (similar to sigmoid but has advantages)
Types of Neural Networks (different plumbing) • Recursive Neural Networks (including LSTM) • Deep Belief Network • Deep Boltzman Machines • Autoencoders • Convolutional •…
Recursive Neural Networks (including LSTM) Exploit repeated patterns that occur over, say, time or, say, sentences by feeding data repeatedly into the network
Recursive Neural Networks (including LSTM)
LSTM – Long Short-Term Memory
Autoencoders (strange but the most fun) Train the output to match the input
deeplearning4j.org
Autoencoders • • • •
Compression Dimension Reduction (resembles PCA) Noise reduction (MRI example) Drawing stuff
Drawing Stuff
More Drawing Stuff (DeepDream) – messing the picture instead of optimising the weights. with weights Optimise Find the best picture that turns on the “dog” neuron
More Drawing Stuff (DeepDream) – messing with weights
More Drawing Stuff
-Deep Learning – Convolutional Neural Networks Essentially networks of weights connected by activation functions (e.g. sigmoid)
Convolutions • Just functions that combine pixels in a weighted way • A way of getting a correlation between a shape and parts of an image • Example: find red circles in an image, find edges • Measure how much the image part matches the shape
Convolutions – Example – Gabor filters Find correlations with these shapes in the image
Example - Find edges in Images
The mathematics behind convolutions (animated gif)
Hierarchy of Image Features – “ Max Pooling”
We rescale the picture to find what we are looking for at different sizes and to find more complicated shapes
A Convolutional Neural Network is just stacked layers of convolutions and pooling
(clarifai.com)
It creates a hierarchy of features at decreasing resolutions
Another example (Le Cun)
Google’s Inception V3 network
Google’s Inception V3 network performance
Inception V4 is out – see ArXiv 1602.07261
It’s not all Convolutions (but still weights)
Deep Learning Software (examples) • Theano, Keras, Caffe, CNTK, MXNet, H2O, Neon, Deeplearning4j, etc • TensorFlow – has lots of community support • NVIDIA DIGITS – easy to use – a web interface to Caffe • MXNet and H2O can be used from R
Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning
- Transfer Learning (one more reason why the robots will be eating our lunch)
Example – Cat detector to turn on sprinklers
Transfer Learning Transfer Learning uses an existing set of weights for an existing Deep Learning network and adapts (retrains) some of the layers for a new set of images. This lets us (and robots) transfer learning to new tasks.
The next two examples show a case where the last three layers of a network downloaded from the internet are retrained to distinguish between cars and SUVs. The fruit image example does no retraining but takes weights from the second last layer as inputs to a model trained in R.
- SUV / CAR example
ImageNet Network (MATLAB)
ImageNet Network (MATLAB)
Results - before
Results - after
Fruit Classification Demo
0 correct out of 57 images
Demo – Use the outputs from the second last layer (before classification), 1000 columns
Demo – R code using the 1000 columns library(readr) library(xgboost) library(caret) library(plyr)
setwd("~/MATLAB/matconvnet") cat("reading data file\n") train