Please copy and paste this embed script to where you want to embed

MACHINE LEARNING – AN OVERVIEW Tony Cooper Senior Data Scientist [email protected]

July 2016 kpmg.com/nz

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Introduction • Last meeting – Machine Learning – what can it do • This meeting – Machine Learning – how does it work • Not covering – How to do Machine Learning (e.g. test / train split) • Not covering – Applications (see e.g. a long list at http://www.deeplearningpatterns.com/doku.php/applications)

Also not covering: • Speech • Text (Buffalo buffalo Buffalo, buffalo buffalo, buffalo Buffalo buffalo) • Audio • Time Series • Graphs • Internet of Things • Bots (e.g. Siri) • Big Data

Reminder - Last Meeting – Nickle Lu • Amazing things AI has done • Why AI can do those things • Why AI will eat everything • How I learned AI in my career • How can you apply it in yours

Presenter – Tony Cooper • 5 years Stanford PhD • Thesis – “Computer Intensive Statistics” project on numerical methods for the bootstrap (unfinished)

• DSIR - Consulting Statistician • Funds Management – Database Technology (not Big Data) • Double-Digit Numerics – Consulting Data Scientist • KPMG – Setup Data Science Innovation Lab • 3 years experience with Deep Learning (2 years CNNs, 3 years H2O)

KPMG Innovation Lab • Big Data • Spark (Spark Meetup, Auckland, 5 September 2016 at KPMG) •R • H2O

• Machine Learning • • • • •

Recommender Systems Computational Advertising Hyperpersonalisation (segmentation with segment size 1) Computer Vision GPU programming

KPMG Hardware • 7 Node Spark Cluster (7 x 2 Xeon) • 1 GPU Server

Tesla K80 GPU

4 x Tesla K80 (4 x 24GB GPU RAM 4 x 5000 cores 4 x 5.8 TeraFLOPs)

2 x Xeon 14 cores (56 threads) 1 TB RAM 6TB SSD

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Machine Learning

SL ML

Statistical Learning

AI

Machine Learning Machine Intelligence

Types of Machine Learning

Unsupervised Supervised

Semi-supervised

Machine Learning Resources

Technical

especially 7.10.2, 7.10.3

Practical

Dummies (can download the R and Python code without buying the book)

Experts Machine Learning’s best kept secret

Interesting

Courses, MOOCs (e.g. Udacity Deep Learning)

Internet • http://deeplearning.stanford.edu/tutorial/ • http://cs231n.stanford.edu/ • (some images in this presentation taken from there)

• Contests (esp Kaggle.com) • Glossaries • http://envisat.esa.int/handbooks/meris/CNTR4-2-5.html • http://www.wildml.com/deep-learning-glossary/ • http://deeplearning4j.org/glossary.html

Kaggle.com – competitions (Titanic a good starter), scripts (“kernels”), and real data

Tip – use containers • Docker • Can run Ubuntu on Windows • All set up for you e.g. Google TensorFlow course at Udacity

A Taste of Machine Learning? Regression Example – Recommender System 1 Alice Bob Chad

4

4

2

3

4

2

1

3

3

3

2

1

3

5 5

Recommender System 4 4

4 4

2 3 2

1 3 1

2 3 2

1 3 1

5 3 3

=

h11 h21 h31

h12 h22 h32

=

h11 h21 h31

h12 h22 h32

5 3 3

𝑅12 = ℎ11 𝑤12 + ℎ12 𝑤22

x

w11 w21

w12 w22

w13 w23

w14 w24

w15 w25

x

w11 w21

w12 w22

w13 w23

w14 w24

w15 w25

11 equations in 16 unknowns

Generically: 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛 Solved using Alternating Least Squares (the machine chooses the features – latent features) The machine did the work for us in deciding what features to use

R pseudo code # ratings matrix R = matrix(nr=3, nc=5, data=c(4,2,1,NA,5,NA,3,3,3,NA,4,2,1,3,NA)) # initial users matrix. h = matrix(nr=3, nc=2, data=rnorm(6)); # initial items matrix. w = matrix(nr=5, nc=2, data=rnorm(10)); # find h, w to minimize the squared error For (iter in 1:5) { # update users for (i in 1:3) { h[i, ] = solve(...) } # update items for (j in 1:5) { w[j, ] = solve(...) } }

Another Taste - Beyond Linear Regression Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000

CONTENT GOES HERE

Linear Regression (200 years old) • Essentially a weighted combination of the inputs e.g. • SalePrice = w0 + w1*Suburb + w2*ListPrice + w3*AgreementDate + w4*Type + w5*Title + w6*SaleMethod + w7*Bedrooms + w8*LandArea + w9*FloorArea + w10*Existing + w11*ValuationYear • Pros • Simple to understand and interpret (taught at high school) • Simple to compute in Excel (Least Squares)

• Problems • The world isn’t linear • Doesn’t handle interactions easily (Samuel Johnson: Your Manuscript Is Good and Original, But What is Original Is Not Good; What Is Good Is Not Original) • Doesn’t handle missing values at all • Doesn’t handle correlated inputs well

Simple Example – Actual Function 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Simple Example – Noise Added Response samples 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Linear Fit (underfit) Linear Fit (Underfitting) 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Cubic Fit Cubic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Quartic Fit Quartic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Quintic Fit Quintic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Overfitting Overfitting Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Support Vector Regression Fit Support Vector Machine Fit C = 127.578, gamma = 1.22 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Sigmoid

Neural Network 1 Neuron (sigmoid) Neural Network Fit Hidden Layer Size = 1 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛

Linear hi is constant

𝑅 = 𝑆1 𝑤1 + 𝑆2 𝑤2 + ⋯ + 𝑆𝑛 𝑤𝑛

Neural Network Si is sigmoid

A neural network is just a bunch of weighted sigmoid regressions, n is the number of nodes

Neural Network 2 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 2 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 5 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 10 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 10 12

10

Overfitting

10 is too many neurons

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid)

playground.tensorflow.org

Neural Networks are just weighted regressions 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛

Neural Network – there is a theorem that says you can model anything with a single layer Neural Network But instead of going wide it can be more effective going deep

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

- Deep Learning • • • •

detect complex interactions among features learn low-level features from minimally processed raw data work with high-cardinality class memberships work with unlabelled data

Lots of hype – but it’s mostly true Fosbury flop analogy, gold rush analogy “Unreasonably effective”

Example - Drive a car

Which way to turn the steering wheel?

Same Problem as Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT

Build a model to predict output from input

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000

Example – Two Variable Classification e.g. X1 = House Price, X2 = House Area, Y = whether or not house sells at auction

Model: Predict whether or not the house will sell at auction (obviously fake data for illustration only)

Linear Regression

(X1, X2) model and (X1, X2, X1*X2) model

Tree

(X1, X2) model and (X1, X2, X1*X2) model

Random Forest

(X1, X2) model and (X1, X2, X1*X2) model

Support Vector Machine

(X1, X2) model and (X1, X2, X1*X2) model

Gradient Boosting

(X1, X2) model and (X1, X2, X1*X2) model

Single Layer Neural Network 5 nodes (X1, X2) model and (X1, X2, X1*X2) model

Going Deeper

playground.tensorflow.org

Adding X1*X2

Dropping X1 and X2 – Feature Engineering

Feature Engineering and Feature Selection is hard! An art and a science, computationally difficult – O(2n), n = no. of features, n can be thousands How did we know to add X1*X2? Can we get the machine to do it for us? Yes – Deep Learning

Deep Learning – no X1*X2

Go and Play!

(use ReLU)

ReLU (Rectified Linear Unit) (similar to sigmoid but has advantages)

Types of Neural Networks (different plumbing) • Recursive Neural Networks (including LSTM) • Deep Belief Network • Deep Boltzman Machines • Autoencoders • Convolutional •…

Recursive Neural Networks (including LSTM) Exploit repeated patterns that occur over, say, time or, say, sentences by feeding data repeatedly into the network

Recursive Neural Networks (including LSTM)

LSTM – Long Short-Term Memory

Autoencoders (strange but the most fun) Train the output to match the input

deeplearning4j.org

Autoencoders • • • •

Compression Dimension Reduction (resembles PCA) Noise reduction (MRI example) Drawing stuff

Drawing Stuff

More Drawing Stuff (DeepDream) – messing the picture instead of optimising the weights. with weights Optimise Find the best picture that turns on the “dog” neuron

More Drawing Stuff (DeepDream) – messing with weights

More Drawing Stuff

-Deep Learning – Convolutional Neural Networks Essentially networks of weights connected by activation functions (e.g. sigmoid)

Convolutions • Just functions that combine pixels in a weighted way • A way of getting a correlation between a shape and parts of an image • Example: find red circles in an image, find edges • Measure how much the image part matches the shape

Convolutions – Example – Gabor filters Find correlations with these shapes in the image

Example - Find edges in Images

The mathematics behind convolutions (animated gif)

Hierarchy of Image Features – “ Max Pooling”

We rescale the picture to find what we are looking for at different sizes and to find more complicated shapes

A Convolutional Neural Network is just stacked layers of convolutions and pooling

(clarifai.com)

It creates a hierarchy of features at decreasing resolutions

Another example (Le Cun)

Google’s Inception V3 network

Google’s Inception V3 network performance

Inception V4 is out – see ArXiv 1602.07261

It’s not all Convolutions (but still weights)

Deep Learning Software (examples) • Theano, Keras, Caffe, CNTK, MXNet, H2O, Neon, Deeplearning4j, etc • TensorFlow – has lots of community support • NVIDIA DIGITS – easy to use – a web interface to Caffe • MXNet and H2O can be used from R

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

- Transfer Learning (one more reason why the robots will be eating our lunch)

Example – Cat detector to turn on sprinklers

Transfer Learning Transfer Learning uses an existing set of weights for an existing Deep Learning network and adapts (retrains) some of the layers for a new set of images. This lets us (and robots) transfer learning to new tasks.

The next two examples show a case where the last three layers of a network downloaded from the internet are retrained to distinguish between cars and SUVs. The fruit image example does no retraining but takes weights from the second last layer as inputs to a model trained in R.

- SUV / CAR example

ImageNet Network (MATLAB)

ImageNet Network (MATLAB)

Results - before

Results - after

Fruit Classification Demo

0 correct out of 57 images

Demo – Use the outputs from the second last layer (before classification), 1000 columns

Demo – R code using the 1000 columns library(readr) library(xgboost) library(caret) library(plyr)

setwd("~/MATLAB/matconvnet") cat("reading data file\n") train

View more...
July 2016 kpmg.com/nz

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Introduction • Last meeting – Machine Learning – what can it do • This meeting – Machine Learning – how does it work • Not covering – How to do Machine Learning (e.g. test / train split) • Not covering – Applications (see e.g. a long list at http://www.deeplearningpatterns.com/doku.php/applications)

Also not covering: • Speech • Text (Buffalo buffalo Buffalo, buffalo buffalo, buffalo Buffalo buffalo) • Audio • Time Series • Graphs • Internet of Things • Bots (e.g. Siri) • Big Data

Reminder - Last Meeting – Nickle Lu • Amazing things AI has done • Why AI can do those things • Why AI will eat everything • How I learned AI in my career • How can you apply it in yours

Presenter – Tony Cooper • 5 years Stanford PhD • Thesis – “Computer Intensive Statistics” project on numerical methods for the bootstrap (unfinished)

• DSIR - Consulting Statistician • Funds Management – Database Technology (not Big Data) • Double-Digit Numerics – Consulting Data Scientist • KPMG – Setup Data Science Innovation Lab • 3 years experience with Deep Learning (2 years CNNs, 3 years H2O)

KPMG Innovation Lab • Big Data • Spark (Spark Meetup, Auckland, 5 September 2016 at KPMG) •R • H2O

• Machine Learning • • • • •

Recommender Systems Computational Advertising Hyperpersonalisation (segmentation with segment size 1) Computer Vision GPU programming

KPMG Hardware • 7 Node Spark Cluster (7 x 2 Xeon) • 1 GPU Server

Tesla K80 GPU

4 x Tesla K80 (4 x 24GB GPU RAM 4 x 5000 cores 4 x 5.8 TeraFLOPs)

2 x Xeon 14 cores (56 threads) 1 TB RAM 6TB SSD

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

Machine Learning

SL ML

Statistical Learning

AI

Machine Learning Machine Intelligence

Types of Machine Learning

Unsupervised Supervised

Semi-supervised

Machine Learning Resources

Technical

especially 7.10.2, 7.10.3

Practical

Dummies (can download the R and Python code without buying the book)

Experts Machine Learning’s best kept secret

Interesting

Courses, MOOCs (e.g. Udacity Deep Learning)

Internet • http://deeplearning.stanford.edu/tutorial/ • http://cs231n.stanford.edu/ • (some images in this presentation taken from there)

• Contests (esp Kaggle.com) • Glossaries • http://envisat.esa.int/handbooks/meris/CNTR4-2-5.html • http://www.wildml.com/deep-learning-glossary/ • http://deeplearning4j.org/glossary.html

Kaggle.com – competitions (Titanic a good starter), scripts (“kernels”), and real data

Tip – use containers • Docker • Can run Ubuntu on Windows • All set up for you e.g. Google TensorFlow course at Udacity

A Taste of Machine Learning? Regression Example – Recommender System 1 Alice Bob Chad

4

4

2

3

4

2

1

3

3

3

2

1

3

5 5

Recommender System 4 4

4 4

2 3 2

1 3 1

2 3 2

1 3 1

5 3 3

=

h11 h21 h31

h12 h22 h32

=

h11 h21 h31

h12 h22 h32

5 3 3

𝑅12 = ℎ11 𝑤12 + ℎ12 𝑤22

x

w11 w21

w12 w22

w13 w23

w14 w24

w15 w25

x

w11 w21

w12 w22

w13 w23

w14 w24

w15 w25

11 equations in 16 unknowns

Generically: 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛 Solved using Alternating Least Squares (the machine chooses the features – latent features) The machine did the work for us in deciding what features to use

R pseudo code # ratings matrix R = matrix(nr=3, nc=5, data=c(4,2,1,NA,5,NA,3,3,3,NA,4,2,1,3,NA)) # initial users matrix. h = matrix(nr=3, nc=2, data=rnorm(6)); # initial items matrix. w = matrix(nr=5, nc=2, data=rnorm(10)); # find h, w to minimize the squared error For (iter in 1:5) { # update users for (i in 1:3) { h[i, ] = solve(...) } # update items for (j in 1:5) { w[j, ] = solve(...) } }

Another Taste - Beyond Linear Regression Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000

CONTENT GOES HERE

Linear Regression (200 years old) • Essentially a weighted combination of the inputs e.g. • SalePrice = w0 + w1*Suburb + w2*ListPrice + w3*AgreementDate + w4*Type + w5*Title + w6*SaleMethod + w7*Bedrooms + w8*LandArea + w9*FloorArea + w10*Existing + w11*ValuationYear • Pros • Simple to understand and interpret (taught at high school) • Simple to compute in Excel (Least Squares)

• Problems • The world isn’t linear • Doesn’t handle interactions easily (Samuel Johnson: Your Manuscript Is Good and Original, But What is Original Is Not Good; What Is Good Is Not Original) • Doesn’t handle missing values at all • Doesn’t handle correlated inputs well

Simple Example – Actual Function 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Simple Example – Noise Added Response samples 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Linear Fit (underfit) Linear Fit (Underfitting) 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Cubic Fit Cubic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Quartic Fit Quartic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Quintic Fit Quintic Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Overfitting Overfitting Fit 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Support Vector Regression Fit Support Vector Machine Fit C = 127.578, gamma = 1.22 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Sigmoid

Neural Network 1 Neuron (sigmoid) Neural Network Fit Hidden Layer Size = 1 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛

Linear hi is constant

𝑅 = 𝑆1 𝑤1 + 𝑆2 𝑤2 + ⋯ + 𝑆𝑛 𝑤𝑛

Neural Network Si is sigmoid

A neural network is just a bunch of weighted sigmoid regressions, n is the number of nodes

Neural Network 2 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 2 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 5 12

10

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 10 Neurons (sigmoid) Neural Network Fit Hidden Layer Size = 10 12

10

Overfitting

10 is too many neurons

Response

8

6

4

2

0

0

500

1000

1500 Input

2000

2500

3000

Neural Network 5 Neurons (sigmoid)

playground.tensorflow.org

Neural Networks are just weighted regressions 𝑅 = ℎ1 𝑤1 + ℎ2 𝑤2 + ⋯ + ℎ𝑛 𝑤𝑛

Neural Network – there is a theorem that says you can model anything with a single layer Neural Network But instead of going wide it can be more effective going deep

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

- Deep Learning • • • •

detect complex interactions among features learn low-level features from minimally processed raw data work with high-cardinality class memberships work with unlabelled data

Lots of hype – but it’s mostly true Fosbury flop analogy, gold rush analogy “Unreasonably effective”

Example - Drive a car

Which way to turn the steering wheel?

Same Problem as Suburb List Price Agreement Date Type Mt Roskill 308000 40972 R Mt Roskill 300000 40944 R Mt Albert 41007 R Mt Albert 40900 R Mt Albert 695000 40728 R Mt Roskill 760000 40862 R Mt Albert 40961 R Mt Albert 40996 R Mt Albert 41016 R Mt Eden 40856 R Mt Roskill 380000 40985 R Mt Albert 40975 R Mt Eden 160000 40757 R Mt Eden 40689 R Mt Eden 173000 40996 APT Mt Albert 249000 40967 APT Mt Eden 359000 40819 R Mt Albert 299000 40985 R Mt Eden 40709 R Mt Eden 40974 APT Mt Albert 380000 40750 R Mt Eden 40788 R Mt Albert 40985 R Mt Albert 399000 40994 R Mt Eden 665000 40966 R Mt Albert 319000 40711 APT Mt Albert 319000 40757 APT

Build a model to predict output from input

Title Sale Method Bedrooms Land Area Floor Area Existing/New Valuation Valuation Year Sale Price P 2 308000 P 3 400000 C P 3 E 484000 F P 3 E 625000 P 4 695000 P 3 760000 C A 3 E 790000 2011 945000 F P 2 511 E 730000 F P 5 556 E 670000 815000 F A 3 612 E 570000 810000 P 3 754 400000 F A 3 809 E 730000 2011 780000 P 1 32 220000 223000 S P 1 45 E 220000 2008 265000 P 1 51 173000 S P 2 51 E 230000 2011 230000 P 2 54 336575 P 2 59 270000 289800 F P 2 60 E 405000 C T 2 70 385000 P 2 70 340000 370000 A 2 70 340000 380000 C A 2 74 E 505000 P 2 74 375000 395500 F P 3 202 80 640000 U P 2 81 E 305000 P 2 81 315000

Example – Two Variable Classification e.g. X1 = House Price, X2 = House Area, Y = whether or not house sells at auction

Model: Predict whether or not the house will sell at auction (obviously fake data for illustration only)

Linear Regression

(X1, X2) model and (X1, X2, X1*X2) model

Tree

(X1, X2) model and (X1, X2, X1*X2) model

Random Forest

(X1, X2) model and (X1, X2, X1*X2) model

Support Vector Machine

(X1, X2) model and (X1, X2, X1*X2) model

Gradient Boosting

(X1, X2) model and (X1, X2, X1*X2) model

Single Layer Neural Network 5 nodes (X1, X2) model and (X1, X2, X1*X2) model

Going Deeper

playground.tensorflow.org

Adding X1*X2

Dropping X1 and X2 – Feature Engineering

Feature Engineering and Feature Selection is hard! An art and a science, computationally difficult – O(2n), n = no. of features, n can be thousands How did we know to add X1*X2? Can we get the machine to do it for us? Yes – Deep Learning

Deep Learning – no X1*X2

Go and Play!

(use ReLU)

ReLU (Rectified Linear Unit) (similar to sigmoid but has advantages)

Types of Neural Networks (different plumbing) • Recursive Neural Networks (including LSTM) • Deep Belief Network • Deep Boltzman Machines • Autoencoders • Convolutional •…

Recursive Neural Networks (including LSTM) Exploit repeated patterns that occur over, say, time or, say, sentences by feeding data repeatedly into the network

Recursive Neural Networks (including LSTM)

LSTM – Long Short-Term Memory

Autoencoders (strange but the most fun) Train the output to match the input

deeplearning4j.org

Autoencoders • • • •

Compression Dimension Reduction (resembles PCA) Noise reduction (MRI example) Drawing stuff

Drawing Stuff

More Drawing Stuff (DeepDream) – messing the picture instead of optimising the weights. with weights Optimise Find the best picture that turns on the “dog” neuron

More Drawing Stuff (DeepDream) – messing with weights

More Drawing Stuff

-Deep Learning – Convolutional Neural Networks Essentially networks of weights connected by activation functions (e.g. sigmoid)

Convolutions • Just functions that combine pixels in a weighted way • A way of getting a correlation between a shape and parts of an image • Example: find red circles in an image, find edges • Measure how much the image part matches the shape

Convolutions – Example – Gabor filters Find correlations with these shapes in the image

Example - Find edges in Images

The mathematics behind convolutions (animated gif)

Hierarchy of Image Features – “ Max Pooling”

We rescale the picture to find what we are looking for at different sizes and to find more complicated shapes

A Convolutional Neural Network is just stacked layers of convolutions and pooling

(clarifai.com)

It creates a hierarchy of features at decreasing resolutions

Another example (Le Cun)

Google’s Inception V3 network

Google’s Inception V3 network performance

Inception V4 is out – see ArXiv 1602.07261

It’s not all Convolutions (but still weights)

Deep Learning Software (examples) • Theano, Keras, Caffe, CNTK, MXNet, H2O, Neon, Deeplearning4j, etc • TensorFlow – has lots of community support • NVIDIA DIGITS – easy to use – a web interface to Caffe • MXNet and H2O can be used from R

Agenda • Introduction • Machine Learning • - Deep Learning • - Transfer Learning • - Reinforcement Learning

- Transfer Learning (one more reason why the robots will be eating our lunch)

Example – Cat detector to turn on sprinklers

Transfer Learning Transfer Learning uses an existing set of weights for an existing Deep Learning network and adapts (retrains) some of the layers for a new set of images. This lets us (and robots) transfer learning to new tasks.

The next two examples show a case where the last three layers of a network downloaded from the internet are retrained to distinguish between cars and SUVs. The fruit image example does no retraining but takes weights from the second last layer as inputs to a model trained in R.

- SUV / CAR example

ImageNet Network (MATLAB)

ImageNet Network (MATLAB)

Results - before

Results - after

Fruit Classification Demo

0 correct out of 57 images

Demo – Use the outputs from the second last layer (before classification), 1000 columns

Demo – R code using the 1000 columns library(readr) library(xgboost) library(caret) library(plyr)

setwd("~/MATLAB/matconvnet") cat("reading data file\n") train

Thank you for interesting in our services. We are a non-profit group that run this website to share documents. We need your help to maintenance this website.

To keep our site running, we need your help to cover our server cost (about $400/m), a small donation will help us a lot.