Machine Learning - White BG

August 6, 2018 | Author: Mike | Category: Principal Component Analysis, Machine Learning, Statistical Classification, Cross Validation (Statistics), Regression Analysis

Share Embed Donate

Report this link

Short Description

....

Description

Nominal - is for mutual exclusive, but not ordered, categories. Ordinal - is one where the order matters but not the di"erence between values. Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0. Interval - is a measurement where the di"erence between two values is meaningful.

Identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.

Correlation

Data Types

VariableIdentification

Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot

Continuous Features

Frequency, Histogram

CategoricalFeatures

Covariance Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.

PrincipalComponentAnalysis(PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

SingularValueDecomposition(SVD)

SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m!n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.

DimensionalityReduction

UnivariateAnalysis Feature Selection

Finds out the relationship between two variables. Data Exploration

Correlation Plot - Heatmap Two-way table Bi-variateAnalysis

WrapperMethods

Importance

Stacked Column Chart This test is used to derive the statistical significance of relationship between the variables.

Chi-Square Test Z-Test/ T-Test EmbeddedMethods

ANOVA One may choose to either omit elements from a dataset that contain missing values or to impute a value

Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insu$cient. AND. The significant computation time when the number of variables is large.

FeatureCleaning

RecursiveFeatureEllimination GeneticAlgorithms Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coe$cients. Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coe$cients.

Machine Learning Data Processing In One Hot Encoding, make sure the encodings are done in a way that all features are linearly independent.

Obvious inconsistencies One Hot Encoding

LabelEncoding

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.

Hot-Deck

Selects donors from another dataset to complete missing data. Another imputation technique involves replacing any missingvalue with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.

Cold-Deck

Rescaling Mean-substitution

A regression model is estimated to predict observed values of a variable basedon other variables, and that model is then used to impute values in cases where that variable is missing

Regression

Feature Normalisationor Scaling Methods

FeatureImputation

The simplest method is rescaling the range of features to scale the range in [0, 1] or [%1, 1].

Standardization

Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unitvariance.

Scaling to unit length

TrainingDataset

A set of examples used for learning

Some Libraries... Converting 2014-09-20T20:45:40Z into categoricalattributes like hour_of_the_day, part_of_day, etc.

Decompose

Test Dataset

ValidationDataset

Discretization CategoricalFeatures

FeatureEngineering

ReframeNumericalQuantities

Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.

A set of examplesused only to assess the performance of a fully-trained classifier

Dataset Construction

Continuous Features

Changing from grams to kg, and losing detail might be both wanted and e$cient for calculation

BackwardElimination

Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classificationsimultaneously.

FeatureEncoding

Outliers

The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.

Values for categorical features may be combined, particularly when there’s few samples for some categories.

Chi-Square ForwardSelection

Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this. Special values

They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision.

Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).

LinearDiscriminantAnalysis ANOVA: Analysis of Variance

Missing values

Numeric variables are endowed with several formalized special values including ±Inf, NA and NaN. Calculations involving special values often result in special values, and need to be handled/cleaned

A person's age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license.

Correlation

Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly e"ective in computation time and robust to overfitting.

Filter Methods

Scatter Plot

We can start analyzing the relationship by creating a two-way table of count and count%.

A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)

Crossing

CrossValidation

A set of examplesused to tune the he parameters of a classifier

To scale the components of a feature vector such that the complete vector has length one. To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using backprogapation. In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further. In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using di"erent partitions, and the validation results are averaged over the rounds.

Plot the variance per feature and select the features with the largest variance.

Regre ssio n When we are interested mainly in the predicted variable as a result of the inputs, but not on the each way of the inputs a "ect the prediction. In a real estate example, Prediction would answer the question of: Is my house over or under valued? Non-linear models are very good at these sort of predictions, but not great for inference because the models are much less interpretable. When we are interested in the way each one of the inputs a "ect the prediction. In a real estate example, Inference would answer the question of: How much would my house cost if it had a view of the sea? Linear models are more suited for inference because the models themselves are easier to understand than their non-linear counterparts.

A supervised problem, the outputs are continuous rath er than discrete. Inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one or more (multi-label classification) of these classes. This is typically tackled in a supervised way.

Classification Prediction Types

Motivation

A set of inputs is to be divided into groups. Unlike in classification, the groups are not known beforehand, making this typically an unsupervised task.

Clustering

Inference

Finds the distribution of inputs in some space.

DensityEstimation

Simplifies inputs by mapping them into a lower-dimensional space.

DimensionalityReduction

Step 1: Making an assumption about the functional form or shape of our function (f), i.e.: f is linear, thus we will select a linear model.

Confusion Matrix Parametric

Step 2: Selecting a procedure to fit or train our model. This means estimating the Beta parameters in the li near function. A common approach is the (ordinary) least squares, amongst others.

Kind Fraction of correct predictions, not reliable as skewed when the data set is unbalanced (that is, when the number of samples in di"erent classes vary greatly)

When we do not make assumptions about the form of our function (f). However, since these methods do not reduce the problem of estimating f to a small number of parameters, a large number of observations is required in order to obtain an accurate estimate for f. An example would be the thin-plate spline model.

Accuracy Non-Parametric

Precision

The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

Supervised

Out of all the examples the classifier labeled as positive, what fraction were correct?

No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

f1 score Categories

Recall

Unsupervised

Out of all the positive examples there were, what fraction did the classifier pick up?

A computer program interacts witha dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). The program is provided feedback in terms of rewards and punishments as it navigates its problem space.

Reinforcement Learning

Harmonic Mean of Precision and Recall: (2 * p * r / (p + r))

Decision tree learning Association rule learning

Performance Analysis

ROC Curve - Receiver Operating Characteristics

Artificial neural networks Deep learning Inductivelogicprogramming

True Positive Rate (Recall / Sensitivity) vs False PositiveRate (1-Specificity)

Support vector machines Clustering

Bias refers to the amount of error that is introduced by approximating a real-life problem, which may be extremely complicated, by a simple model. If Bias is high, and/or if the algorithm performs poorly even on your training data, try adding more features, or a more flexible model.

Approaches

Representationlearning

Variance is the amount our model’s prediction would change when using a di "erent training data set. High: Remove features, or obtain more data. 1.0 - sum_of_squared_errors / total_sum_of_squares(y)

Bayesian networks Reinforcementlearning

Bias-VarianceTradeo"

Similarity and metric learning Sparse dictionary learning Geneticalgorithms

Goodness of Fit = R^2

Rule-basedmachinelearning

The mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimatinganunobservedquantity)measures the average of the squares of the errors or deviations—that is, the di"erence between the estimator and what is estimated.

Learningclassifiersystems Mean Squared Error (MSE)

Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM)

Machine Learning Concepts Popular models The proportion of mistakes made if we apply out estimate model function the the training observations in a classification setting.

GenerativeMethods

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using di"erent partitions, and the validation results are averaged over the rounds.

Model class-conditional pdfs and prior probabilities.“Generative”sincesampling can generate synthetic data points.

Taxonomy

Directly estimate posterior probabilities. No attempt to model underlying probability distributions.Focus computational resources on given task– better performance DiscriminativeMethods

Cross-validation

Leave-p-outcross-validation Leave-one-out cross-validation k-fold cross-validation

Logisticregression,SVMs Popular Models

Methods

Prediction Accuracy vs Model Interpretability

Repeatedrandomsub-samplingvalidation

For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as support vector machines or logisticregression.

Selection Criteria

Grid Search Numpy

Pandas

Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more e"ective in high-dimensional spaces than exhaustive search.

Traditionalneuralnetworks,Nearest neighbor Conditional Random Fields (CRF)

Holdout method

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

Gaussians, Naïve Bayes, Mixtures of multinomials Sigmoidal belief networks, Bayesian networks, Markov random fields

Error Rate

Hyperparameters Random Search Scikit-Learn Tuning

There is an inherent tradeo" between Prediction Accuracy and Model Interpretability, that is to say that as the model get more flexible in the way the function (f) is selected, they get obscured, and are hard to interpret. Flexible methods are better for inference, and inflexible methods are preferable for prediction. Adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays O"ers data structures and operations for manipulating numerical tables and time series It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Gradient-basedoptimization

Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit, and stop the algorithmthen.

EarlyStopping(Regularization)

When a given method yields a small training MSE (or cost), but a large test MSE (or cost), we are said to be overfitting the data. This happens because our statistical learning procedure is trying too hard to find pattens in the data, that might be due to random chance, rather than a property of our function. In other words, the algorithms may be learning the training data too well. If model overfits, try removing some features, decreasing degrees of freedom, or adding more data. Opposite of Overfitting. Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and low bias). It is often a result of an excessively simple model. Test that applies Random Sampling with Replacement of the available data, and assigns measures of accuracy (bias, variance, etc.) to sample estimates. An approach toensemble learningthat is based onbootstrapping. Shortly, givena training set, we produce multiple di"erent training sets (called bootstrap samples), by sampling with replacement from the original dataset. Then, for each bootstrap sample, we build a model. The results in an ensemble of models, where each model votes with the equal weight. Typically, the goal of this procedure is to reduce the variance of the

Tensorflow

Overfitting

Underfitting

Bootstrap

Bagging

Libraries

Components

Python

MXNet

Does lazy evaluation. Need to build the graph, and then run it in a session.

Is an modern open-source deep learning framework used to train, and deploy deep neural networks. MXNet library is portable and can scale to multiple GPUs and multiple machines. MXNet is supported by major Public Cloud providers including AWS and Azure. Amazon has chosen MXNet as its deep learning framework of choice at AWS.

Keras

Is an open source neural network library written in Python. It is capable of running on top of MXNet, Deeplearning4j, Tensorflow, CNTK or Theano. Designed to enable fast experimentation with deep neural networks, it focuses on being minimal, modular and extensible.

Torch

Torch is an open source machine learning library, a scientific computing framework, and a script language based on the Lua programming language. It provides a wide range of algorithms for deep machine learning, and uses the scripting language LuaJIT, and an underlyingC implementation.

Find Collect Explore Clean Features Impute Features Data

Engineer Features Select Features

Classification Regression

Encode Features

Is this A or B?

How much, or how many of these? Anomaly Detection

Clustering

Is this anomalous?

Build Datasets Question

How can these elements be grouped?

Reinforcement Learning

Model

Machine Learning is math. In specific, performing Linear Algebra on Matrices. Our data values must be numeric.

Select Algorithm based on question and data available

What should I do now? The cost function will provide a measure of how far my algorithm and its parameters are f rom accurately representing my training data.

Vision API Speech API

Cost Function Sometimes referred to as Cost or Loss function when the goal is to minimise it, or Objective function when the goal is to maximise it.

Jobs API Google Cloud Video Intelligence API Language API

SaaS - Pre-built Machine Learning models

Translation API

Optimization Machine Learning Process

Rekognition Lex

AWS

Tuning

Having selected a cost function, we need a method to minimise the Cost function, or maximise the Objective f unction. Typically this is done by Gradient Descent or Stochastic Gradient Descent. Di"erent Algorithms have di "erent Hyperparameters, which will a "ect the algorithms performance. There are multiple methods for Hyperparameter Tuning, such as Grid and Random search.

Polly

Analyse the performance of each algorithms and discuss results.

Direction … many others ML Engine Amazon Machine Learning

Google Cloud AWS

Tools: Jupiter / Datalab / Zeppelin

Results and Benchmarking Data Science and Applied Machine Learning

Is the ML algorithm training and inference completing in a reasonable timeframe?

… many others Tensorflow

Scaling

MXNet

How does my algorithm scale for both training and inference? How can feature manipulation be done for training and inference in real-time?

Machine Learning Research Torch … many others

Are the results good enough for production?

Deployment and Operationalisatio n

How to make sure that the algorithm is retrained periodically and deployed into production? How will the ML algorithms be integrated with other systems? Can the infrastructure running the machine learning process scale?

Infrastructure

How is access to the ML algorithm provided? REST API? SDK? Is the infrastructure appropriate for the algorithm we are running? CPU's or GPU's?

Many costfunctions aretheresultof applyingMaximum Likelihood. For instanc e, theLeast Squarescostfunctioncanbeobtained viaMaximumLikelihood.Cross-Entropyisanother example. Thelikelihoodofa parameter value(or vector ofparameter values), & , given outcomes x, isequaltotheprobability (density) assumed for thoseobservedoutcomesgiven thoseparameter values, thatis Thenaturallogarithm ofthelikelihoodfunction, calledthelog-likelihood, ismoreconvenientto work with. Becausethelogarithm isamonotonicallyincreasingfunction, thelogarithm ofa function achieves its maximum valueatthesame points asthefunction itself,andhencethelog-likelihoodcan beusedin placeofthe likelihoodin maximum likelihoodestimation andrelatedtechniques.

Maximum Likelihood Estimation (MLE)

BasicOperations:Addition,Multiplication, Transposition

AlmostallMachineLearningalgorithms useMatrixalgebrain oneway or another. This is abroadsubject, toolargetobeincludedherein it’s full length. Here’s a start: https://en.wikipedia. org/wiki/Matrix _(mathematics)

Transformations

In general, forafixedsetof dataandunderlying statisticalmodel, themethodofmaximum likelihoodselects thesetofvalues ofthemodel parametersthatmaximizesthelikelihoodfunction. Intuitively, this maximizesthe"agreement" ofthe selectedmodelwith theobserveddata, andfor discreterandom variables itindeedmaximizes the probability oftheobserveddataunder theresulting distribution.Maximum-likelihoodestimationgivesa unifiedapproach toestimation, which iswelldefinedin thecaseofthenormaldistribution and manyotherproblems.

Matrices

Trace,Rank,Determinante,Inverse

http://setosa.io/ev/eigenvectors-andeigenvalues/

Inlinearalgebra,aneigenvectororcharacteristicvectorofa linear transformation Tfrom avector spac eV over afieldF intoitselfis a non-zero vector thatdoes notchangeits direction when thatlinear transformation is appliedtoit.

EigenvectorsandEigenvalues

Cross entropy can beusedtodefinethe loss function in machinelearningand optimization. Thetrueprobabilitypiis the truelabel, andthegiven distribution qi is thepredictedvalueofthe currentmodel.

DerivativesChainRule Rule

LeibnizNotation

Cross-Entropy

Thematrixofall first-order par tialderivatives ofavector-valued function. When thematrixis asquarematrix, both thematrixandits determinantarereferredtoas theJacobian in literature

JacobianMatrix Cross-entropyerrorfunctionandlogisticregression

LinearAlgebra Logisic t

Thegradientis amulti-variablegeneralization of thederivative. Thegradientis avector-valued function, as opposedtoaderivative, which is scalar-valued.

Gradient

Thelogisticlos sf unctionis de fi nedas:

Cost/Loss(Min) Objective(Max) Functions

Theuseofa quadraticloss function is common, for exam plewhen usingleastsquares techniques. Itis often moremathematically tractablethan other loss functionsbecauseoftheproperties of variances, as wellas beingsymmetric: an error abovethetarget causes thesameloss as thesamemagnitudeof error below the target. Ifthetargetis t, then aquadraticloss function is:

Quadratic For MachineLearningpurposes, aTensor can bedescribedas a MultidimentionalMatrixMatrix. Dependingon thedimensions, the Tensor can beaScalar, aVector, aMatrix, or aMultidimentional Matrix.

Tensors

In statistics anddecision theory, a frequently usedloss function is the0-1 loss function

0-1 Loss

When measuringtheforces appliedtoan infinitesimalcube, onecan storetheforce values in amultidimensionalmatrix.

Thehingeloss is aloss function usedfor trainingclassifiers. For an intendedoutput t= ±1 andaclassifier scorey, thehinge loss oftheprediction y is definedas:

HingeLoss

When thedimensionality incre ases, thevolumeofthespacei ncreases sofastthattheavailable databecomesparse. This sparsity is problematicfor any methodthat requires statistical significance. In orde r toobtain astatistically soundand reliableresult, theamountofdata neededtosupporttheresult often grows exponentially with thedimensionality .

CurseofDimensionality Exponential

Mean Valuein themiddleor an orderedlist, or aver ageoftwo in middle.

Median

MostFrequentValue

MeasuresofCentralTendency

Mode

Division ofprobability distributions based on contiguous intervals with equal probabilities. In short: Dividingobservations numbers in asamplelist equally.

Itis usedto quantify thesimilarity between twoprobability distributions.Itis atypeof f-divergence.

HellingerDistance

TodefinetheHellinger distancein terms of measuretheory, letP andQ denotetwo probabilitymeasuresthatareabsolutely continuous with resp ecttoa third probabilitymeasure '. Thesquareofthe Hellinger distancebetween P andQ is definedas thequantity

Quantile Range

Theaverageoftheabsolutevalue ofthedeviation ofeach valuefrom themean

Is ameasureofhow oneprobability distributiondivergesfromasecond expectedprobabilitydistribution. Applicationsincludecharacterizingthe relative(Shannon)entropyininformation systems,randomnessincontinuoustimeseries,andinformationgainwhen comparingstatisticalmodelsofinference.

MediumAbsoluteDeviation(MAD)

Threequartiles dividethedatain approximately four equally divided parts Theaverageofthesquareddi "erencesfromtheMean.Formally,isthe expectation ofthesquareddeviation ofarandom variablefrom its mean, andit informallymeasureshow far asetof(random)numbers arespreadoutfrom their mean.

Inter-quartileRange(IQR)

Kullback-Leibler Divengence

Definition

Dispersion

Variance Types Continuous

Continuous

Discrete

is ameasur eofthe di"erencebetweenan originalspectrumP( ( )andan approximation P^( ( )ofthat spectrum. Althoughitis nota perceptualmeasure , itisintended to reflectperceptual(dis)similarity.

Itakura–Saitodistance

https://stats.stackex change.com/ questions/154879/ a-list-of-cost-functions-usedin-neural-network s-alongsideapplications

Discrete

https://en.wikipedia.org/wiki/ Loss_functions_for_classification

Thesignednumber ofstandard deviationsanobservationor datum isabovethemean.

z-score/value/factor

Frequentist

StandardDeviation FrequentistvsBayesianProbability

sqrt(variance)

RandomVariable

dot(de_mean(x) ,de_mean(y))/(n- 1)

Statistics

Conditionality

Relationship

Benchmark smonotonicrelationship(whetherlinear ornot),Spearman's coe$cientis appropriatefor both continuous anddiscretevariables, including ordinalvariables.

Spearman

Correlation

Isastatistic usedtomeasuretheordinal association be tween twomeasur ed quantities.

BayesTheorem(rule, law)

Kendall

SimpleForm With Law ofTotalprobability

MachineLearning Mathematics

Theresultsarepresentedin amatrixformat, wherethecross tabulation oftwofields is acell value. Thecellvaluerepresents thepercentageoftimes thatthetwofields existin thesame events.

Co-occurrence

Po r ba bi il ty

Themarginaldistribution ofasubset ofa collection ofrandom variables is theprobability distribution ofthevariablescontainedin the subset. Itgivestheprobabilities ofvarious values ofthevariables in thesubsetwithout referencetothevalues oftheother variables.

C on ce pt s Marginalisation

Isageneralstatementor defaultposition thatthereis no relationshipbetween twomeasuredphenome na, or noassociation amonggroups. Thenullhypothesis is genera lly assumedtobetrue untilevidence indicatesotherwise.

NullHypothesis

This demonstrates thatspecifyingadirection (on asymmetricteststatistic)halves the p-value(increases the significance)andcan mean thedi "erencebetweendatabeingconsideredsignificantornot. Supposearesearcher flipsacoin fivetimes in arow andassumes anull hypothesis thatthecoin is fair. Thetest statisticof"total number ofheads" canbeone-tailedor two-tailed: aone-tailedtestcorresponds toseeingifthe coin is biasedtowards heads, bu tatwo-tailed testcorrespondstoseeingif thecoin is biasedeither way. The researche r flipsthecoin fivetimesand observesheadseach time(HHHHH), yie ldingatest statisticof5. In a one-tailedtest, this is theupper extremeofallpossibleoutcomes, andyields ap-valueof (1/2)5 =1/32) 0.03.If theresearcherassumedasignificancelevelof0.05, this resultwouldbedeemedsignificantand thehypothesis thatthecoin isfair wouldberejected. Inatwo-tailedtest, ateststatistic ofzeroheads (TTTTT)is justas extremeandthusthedataof HHHHHwouldyieldap-value of2 !(1/2)5=1/16 ) 0.06, which is notsignificantat the0.05 level.

Continuous

Discrete

Is afundamentalrulerelatingmarginal probabilities toconditionalprobabilities. It expressesthetotalprobability ofan outcome which can berealizedviaseveraldistinct events-hencethename.

Law ofTotalProbability Fiveheads in arow Example

p-value

ChainRule

Techniques

In this method, as partof experimentaldesign, beforeperformin gtheexperiment, one firstchooses amodel(thenullhypothesis)anda thresholdvaluefor p,calledthe significancelevelofthetest, traditionally 5% or 1% anddenotedas *. fthep-valuei I s less than thechosen significancelevel( * ), thatsuggests thattheobserveddatais su$ciently inconsistentwith thenullhypothesis thatthenull hypothesis may be rejected. However , thatdoes not provethatthetested hypothesis is true. For typical analysis, usingthestandard * =0.05 cuto", thenullhypothesis is rejectedwhenp< . 05 andnotrejectedwhen p> .05. Thep-valuedoes not, in itself, supportreasoning abouttheprobabilities ofhypotheses butis only atool for decidingwhether toreject thenullhypothesis.

Permits thecalculation ofany membe r ofthejoint distribution ofaset ofrandom variables usingonlyconditionalprobabilities.

p-hacking Definition

States thatarandom variabledefinedas theaverageofalargenumber of independentandidentically distributedrandom variables is itselfapproximately normallydistributed.

Is atableor an equation thatlinks each outcomeofa statistical experimentwiththeprobabilityofoccurence.WhenContinuous,isis describedby theProbability Density Function

CentralLimitTheorem

Is afirst-order iterativeoptimization algorithm for findingtheminimum of afunction. To findalocalminimum of afunction usinggradientdescent, onetakes steps proportional tothenegativeof thegradient(or oftheapproximategradient)ofthe function atthe currentpoint. Ifinsteadonetakes stepsproportionaltothepositive ofthegradient, oneapproaches alocalmaximum ofthatfunction; the procedureis then known as gradientascent.

GradientDescent Uniform Poisson

Normal(Gaussian)

Gradientdescentuses totalgradientover allexamples per update, SGD updates after only 1 or few examples:

StochasticGradientDescent(SGD) Distributions

Types(DensityFunction)

Optimization

Gradientdescentuses totalgradientover allexamples per update, SGD updates after only 1 example

Mini-batchStochasticGradientDescent (SGD)

Idea: Addafraction v ofprevious update tocurrentone. When thegradientkeeps pointingin thesamedirection, this will increasethesizeofthe steps take n towardstheminimum. Adaptivelearningrates for each p arameter

Momentum

ManhattanDistance

L1norm

L2-norm is alsoknown as leastsquares. It is basically minimizingthesum ofthe squareofthedi "erences(S)betweenthe targetvalueandtheestimated values:

EuclideanDistance

L2norm

Early stoppingrules provideguidanceas to how many iterations can berun beforethelearner begins toover-fit, andstopthe algorithmthen.

Gamma CumulativeDistributionFunction(CDF)

Dropout

Toevaluatealanguagemodel, weshouldmeasurehow much surpriseitgivesusfor realsequencesin that language. Foreach realwordencountere d, thelanguagemodelwillgive aprobabilityp. Andweuse log(p)toquantifythe surprise. Andweaveragethetotalsurpriseover along enough sequen ce. So, in case ofa1000-letter sequencewith 500 A and500 B, thesurprisegiven by the1/3-2/3 modelwill be: [-500*log(1/3 )-500*log(2/3)]/1000=1/2 *Log(9/2) Whilethecorrect1/2-1/2 modelwillgive: [-500*log(1/2 )-500*log(1/2)]/1000=1/2 *Log(8/2) So, wecan see, the1/3, 2/3 modelgives moresurprise, which indicates itisworsethan thecorrectmodel. Onlywhen thesequenceis longenough, theaveragee "ectwillmimic theexpectation over the1/2-1/ 2 distribution. Ifthesequenceis short, itwon'tgiveaconvincingresult.

Cross entropy between twoprobability distributions pandq over thesameunderlyingset ofevents measures theaverage number ofbitsneededtoidentify an eventdrawn from theset, ifacodingschemeis used thatis optimizedfor an "unnatural"probabilitydistributionq,ratherthanthe"true"distribution p.

CrossEntropy Sparseregularizeroncolumns Regularization

InformationTheory

Nuclearnormregularization This regularizer constrain s thefunctions learnedfor each task to be similar totheoverallaverageofthe functions across laltasks. This is usefulfor expressingprior information thateach task is expectedtoshare similarities with each other task. An exampleis predictingbloodiron levels measuredatdi"erenttimes oftheday, whereeach task represents adi "erentperson.

Entropy is ameasureof unpredictabilityofinformation content.

Entropy

EarlyStopping

Isaregularizationtechniqueforreducingoverfittinginneuralnetworksby preventingcomplexco-adaptations on trainin gdata. Itis a very e$cientway of performingmodelaveragingwithneuralnetworks.Theterm"dropout"refersto droppingoutunits (both hidden andvisible)in aneuralnetwork ThisregularizerdefinesanL2normon each column andan L1 norm over all columns. Itcan besolvedby proximal methods.

Binomial Bernoulli

Adagrad

L1-norm is alsoknown as leastabsolute deviations(LAD),leastabsoluteerrors (LAE). Itis basically minimizingthesum of theabsolutedi "erences(S)betweenthe targetvalueandtheestimatedvalues.

This regularizer is similar tothemeanconstrainedregularizer but , instead enforcessimilaritybetweentaskswithin thesamecluster. This can capturemore complexpriorinformation.Thistechnique has been usedtopredictNetflix recommendations.

Bayesian inferencederives theposterior probability as aconsequenceoftwo antecedents, aprior probability anda"likelihoodfunction" derivedfrom a statisticalmodelfor theobserveddata. Bayesian inferencecomputes theposterior probabilityaccordingtoBayes' theorem. Itcan beappliediteratively sotoupdate theconfidenceon outhypothesis.

BayesianInference

Theprocessofdatamining involves automatica llytestinghuge numbers ofhypothesesaboutasingledata setby exhaustively sear chingfor combinations ofvariablesthatmightshow acorrelation. Conventionaltests ofstatisticalsignificancearebasedon theprobability thatan observation aroseby chance, andnecessarily acceptsomerisk ofmistaken testresults, calledthesi gnificance. http://blog.vctr.m e/posts/central-lim it-theorem.html

Expectation (ExpectedValue)ofaRandom Variable

Twoeventsareindependent,statistically independent, or stochasticallyindependentifthe occurrenc eofone does nota "ecttheprobabilityof theother.

Independence

Pearson

Contrary totheSpearman correlation, theKendallcorrelation is nota "ectedby how farfromeachotherranksarebut onlybywhethertheranks betweenobservationsare equalor not, andis thusonly appropriatefor discretevariables butnot definedfor continuousvariables.

Basicno t ion of probability: # Res ults / # Attem pts Theprobability isnotanum ber, butadistributio n itself.

In probabilityandstatistics, a random variable, random quantity, aleator y variableor stochastic variableis avariablewhosevalueis subjectto variationsduetochance(i.e. randomness, ina mathematicalsense). A random va riablecan takeon asetof possibledi "erentvalues(similarlyto othermathematicalvariables),eachwithanassociatedprobability,incontrast toother mathematicalvariables.

Covariance

Benchmarkslinearrelationship,mostappropriatefor measurementstaken from an intervalscale, is ameasureof thelinear dependencebetween two variables

Bayes ian

http://www.behin d-the-enemy-lines.co m/2008/01/a re-you-bayesian-o r-frequentistor.html

A measureofhow much tworandom var iables changetogether. http://stats.stackex change.com/ questions/18058/h ow-would-youexplain-covarian ce-to-someone-wh o-understands-only-the -mean

JointEntropy

Mean-constraine dregularization

ConditionalEntropy

MutualInformation Clusteredmean-constra inedregularization

Moregeneralthanabove,similarity between tasks can bedefinedby a function.Theregularizerencouragesthe modeltolearn similar functions for similar tasks.

Kullback-Leibler Divergence MostlyNon-Parametric.Parametricmakesassumptionsonmydata/randomvariables,forinstance,thattheyare normallydistributed.Non-parametricdoes not.

Graph-basedsimilarity

Themethodsaregenerallyintendedfordescriptionratherthan formal inference non-negative it’satypeofPDFthatitiss ym metr ic real-valued symmetric Density Estimation

integralover functionisequalto 1 non-parametric

Methods KernelDensity Estimation

calculateskerneldistributionsforevery samplepoint, andthen adds allthe

Same,forcontinuousvariables

A unitoften refers to the activation function in alayer by which theinputs are transformedvia anonlinear activation function (for example by thelogistic sigmoid function). Usually, aunit has severalincoming connections andseveral outgoing connections.

With SGD, thetraining proceeds in steps, andateachstepweconsideraminibatch x1...m of sizem. The mini-batch is usedtoapprox-imatethegradientofthe loss function with respectto the parameters.

Comprisedof multipleReal-Valued inputs. Each inputmust belinearly independentfrom each other.

InputLayer

Layers other thantheinput and outputlayers. A layer is the highest-levelbuilding block in deeplearning. A layer is a container thatusually receives weightedinput, transforms it with aset ofmo stly non-linear functions andthen passes thesevalues as outputto the nextlayer.

Hidden Layers

Usingmini-batches ofexamples, as opposedto oneexample ata time, is helpfulin severalways. First, thegradientof theloss over amini-batch is an estimateofthe gradientover thetraining set, whosequality improves as thebatch sizeincreases. Second, computation over abatch can bemuch more e $cientthan m computations for individualexamples, dueto theparallelism a "ordedby the modern computingplatforms. However, ifyou actually try that, the weights willchangefar toomuch each iteration, whichwillmake them “overcorrect” andthe loss willactually increase/diverge. Soin practice, people usually multiply each derivativeby a small valuecalled the“learning rate” beforethey subtractit from its correspondingweight.

Unit(Neurons)

Linear Regression

Is aflexible generalization ofordinary linear regressionthatallows for responsevariables thathave error distributionmodelsother than anormal distribution. TheGLM generalizeslinear regression by allowingthelinear modeltoberelatedtotheresponsevariableviaalink function andby allowingthemagnitudeof thevarianceofeach measurementto bea function ofitspredictedvalue. Identity Batch Normalization

GeneralisedLinear Models (GLMs) Inverse Link Function Regression

Neuralnetworks areoften trainedby gradientdescent on theweights. This means ateach iteration weuse backpropagation tocalculate thederivative of theloss function with respectto each weightandsubtractitfrom thatweight.

Locally EstimatedScatterplot Smoothing (LOESS)

Simple strecipe: keepitfixedandusethe samefor allparameters.

RidgeRegression

LearningRate

LeastAbsolute Shrinkageand Selection Operator (LASSO)

Reduceby 0.5 when validation error stops improving Reduction byO(1/t)because oftheoretical convergenceguarantees, with hyperparameters +0 and , andt is iteration

Logit

CostFunction is foundvia Maximum LikelihoodEstimation

Tricks Better resultsby allowinglearning rates todecrease Options:

LogisticRegression

numbers.

LogisticFunction

Better yet: Nohand-setlearning of rates byusing AdaGrad But, this turns outtobe amistake, becauseif every neuron in thenetwork computes thesame output, then they will alsoallcomputethesamegradie ntsduring back-propagation andundergo theexact sameparameter updates. In other words, thereis nosourceo fasymmetry between neurons iftheir weights areinitialized tobe thesame. Theimplementation for weights might simply drawingvalues from anormal distribution with zeromean, and unit standarddeviation. Itis alsopossible to usesmall numbers drawn from auniform distribution, butthis seems tohave relatively littleimpact on thefinal performancein practice.

In theideal situation, with proper data normalization itis reasonableto assume thatapproximately halfofthe weights will bepositiveandhalfof them willbe negative. A reasonable-soundingidea then mightbetoset alltheinitialweights to zero, which you expectto bethe “best guess” in expectation.

NaiveBayes Classifier. Weneglect the denominator as wecalculatefor every classandpickthemaxof thenumerator

NaiveBayes

MultinomialNaive Bayes Bayesian BeliefNetwork (BBN)

Thus,youstillwanttheweightstobe very closeto zero, butnot identically zero. n I this way, you can random theseneurons to smallnumbers which arevery closeto zero, anditis treatedas symmetry breaking. Theideais thatthe neurons are allrandom andunique in thebeginning, so they willcomputedistinct updates and integratethemselves as diverseparts of thefullnetwork.

This ensures thatallneurons in the network initially haveapproximately the sameoutput distribution andempirically improves therate ofconvergence. The detailedderivations can befound from Page. 18 to23 of theslides. Pleasenote that, in thederivations, itdoes not consider theinfluence of ReLUneurons.

Bayesian

AllZero Initialization

PrincipalComponent Analysis (PCA) PartialLeast Squares Regression P ( LSR) Initialization with SmallRandom Numbers

DimensionalityReduction NeuralNetworks

PartialLeast Squares DiscriminantAnalysis

MachineLearning Models

QuadraticDiscriminantAnalysis (QDA) Linear DiscriminantAnalysis (LDA)

Oneproblem with theabovesuggestion is thatthe distribution ofthe outputs froma randomly initializedneuron has avariance thatgrows with thenumber ofinputs. It turns outthatyou can normalizethe varianceof each neuron's outputto1 by scalingits weightvector by thesquare root of its fan-in (i.e.,its number ofinputs)

NeuralNetwork taking 4 dimension vector representation ofwords.

PrincipalComponent Regression (PCR)

WeightInitialization

k-nearestNeighbour (kNN) LearningVector Quantization (LVQ) InstanceBased

Calibratingthe Variances

Self-OrganisingMap (SOM) Locally WeightedLearning (LWL) Random Forest Classification andRegression Tree(CART)

Is amethod usedin artificialneural networks tocalculatethe error contribution ofeach neuron after abatch ofdata. It calculates thegradient ofthe loss function. Itis commonly usedin thegradientdescent optimization algorithm. Itis alsocalled backwardpropagation oferrors, because theerror is calculatedatthe outputand distributedback through thenetwork layers.

Decision Tree

GradientBoosting Machines(GBM) ConditionalDecision Trees GradientBoosted Regression Trees(GBRT) complete single Linkage

average centroid

Backpropagation HierarchicalClustering

Euclidean In this method, wereuse partialderivatives computedfor higher layers in lower layers, for e$ciency.

Dissimilarity Measure Algorithms Manhattan k -M ea ns k-Medians

Clustering

Defines theoutputofthatnodegivenan inputorsetof inputs .

H ow m an y c lu st er s do w e s el ec t?

FuzzyC-Means Self-OrganisingMaps (SOM)

ReLU

ExpectationMaximization DBSCAN Dunn Index

Sigmoid/ Logistic

DataStructureMetrics

Connectivity SilhouetteWidth

Validation

Binary

Non-overlapAPN AverageDistanceAD StabilityMetrics

Tanh

Softplus

Softmax

Activation Functions Types

FigureofMeritFOM AverageDistanceB etween Means ADM

Euclidean distanceor Euclidean metricis the "ordinary" straight-line distancebetween twopoints in Euclidean space. Thedistance between two points measuredalongaxes at rightangles.

Machine Learning - White BG

Short Description

Description

Comments

We need your help!