Machine Learning - White BG
Short Description
....
Description
Nominal - is for mutual exclusive, but not ordered, categories. Ordinal - is one where the order matters but not the di"erence between values. Ratio - has all the properties of an interval variable, and also has a clear definition of 0.0. Interval - is a measurement where the di"erence between two values is meaningful.
Identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
Correlation
Data Types
VariableIdentification
Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot
Continuous Features
Frequency, Histogram
CategoricalFeatures
Covariance Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.
PrincipalComponentAnalysis(PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
SingularValueDecomposition(SVD)
SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m!n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.
DimensionalityReduction
UnivariateAnalysis Feature Selection
Finds out the relationship between two variables. Data Exploration
Correlation Plot - Heatmap Two-way table Bi-variateAnalysis
WrapperMethods
Importance
Stacked Column Chart This test is used to derive the statistical significance of relationship between the variables.
Chi-Square Test Z-Test/ T-Test EmbeddedMethods
ANOVA One may choose to either omit elements from a dataset that contain missing values or to impute a value
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insu$cient. AND. The significant computation time when the number of variables is large.
FeatureCleaning
RecursiveFeatureEllimination GeneticAlgorithms Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coe$cients. Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coe$cients.
Machine Learning Data Processing In One Hot Encoding, make sure the encodings are done in a way that all features are linearly independent.
Obvious inconsistencies One Hot Encoding
LabelEncoding
Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
Hot-Deck
Selects donors from another dataset to complete missing data. Another imputation technique involves replacing any missingvalue with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
Cold-Deck
Rescaling Mean-substitution
A regression model is estimated to predict observed values of a variable basedon other variables, and that model is then used to impute values in cases where that variable is missing
Regression
Feature Normalisationor Scaling Methods
FeatureImputation
The simplest method is rescaling the range of features to scale the range in [0, 1] or [%1, 1].
Standardization
Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unitvariance.
Scaling to unit length
TrainingDataset
A set of examples used for learning
Some Libraries... Converting 2014-09-20T20:45:40Z into categoricalattributes like hour_of_the_day, part_of_day, etc.
Decompose
Test Dataset
ValidationDataset
Discretization CategoricalFeatures
FeatureEngineering
ReframeNumericalQuantities
Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.
A set of examplesused only to assess the performance of a fully-trained classifier
Dataset Construction
Continuous Features
Changing from grams to kg, and losing detail might be both wanted and e$cient for calculation
BackwardElimination
Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classificationsimultaneously.
FeatureEncoding
Outliers
The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
Values for categorical features may be combined, particularly when there’s few samples for some categories.
Chi-Square ForwardSelection
Machine Learning algorithms perform Linear Algebra on Matrices, which means all features must be numeric. Encoding helps us do this. Special values
They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision.
Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).
LinearDiscriminantAnalysis ANOVA: Analysis of Variance
Missing values
Numeric variables are endowed with several formalized special values including ±Inf, NA and NaN. Calculations involving special values often result in special values, and need to be handled/cleaned
A person's age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license.
Correlation
Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly e"ective in computation time and robust to overfitting.
Filter Methods
Scatter Plot
We can start analyzing the relationship by creating a two-way table of count and count%.
A measure of how much two random variables change together. Math: dot(de_mean(x), de_mean(y)) / (n - 1)
Crossing
CrossValidation
A set of examplesused to tune the he parameters of a classifier
To scale the components of a feature vector such that the complete vector has length one. To fit the parameters of the classifier in the Multilayer Perceptron, for instance, we would use the training set to find the “optimal” weights when using backprogapation. In the Multilayer Perceptron case, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights) After assessing the final model on the test set, YOU MUST NOT tune the model any further. In the Multilayer Perceptron case, we would use the validation set to find the “optimal” number of hidden units or determine a stopping point for the back-propagation algorithm
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using di"erent partitions, and the validation results are averaged over the rounds.
Plot the variance per feature and select the features with the largest variance.
Regre ssio n When we are interested mainly in the predicted variable as a result of the inputs, but not on the each way of the inputs a "ect the prediction. In a real estate example, Prediction would answer the question of: Is my house over or under valued? Non-linear models are very good at these sort of predictions, but not great for inference because the models are much less interpretable. When we are interested in the way each one of the inputs a "ect the prediction. In a real estate example, Inference would answer the question of: How much would my house cost if it had a view of the sea? Linear models are more suited for inference because the models themselves are easier to understand than their non-linear counterparts.
A supervised problem, the outputs are continuous rath er than discrete. Inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one or more (multi-label classification) of these classes. This is typically tackled in a supervised way.
Classification Prediction Types
Motivation
A set of inputs is to be divided into groups. Unlike in classification, the groups are not known beforehand, making this typically an unsupervised task.
Clustering
Inference
Finds the distribution of inputs in some space.
DensityEstimation
Simplifies inputs by mapping them into a lower-dimensional space.
DimensionalityReduction
Step 1: Making an assumption about the functional form or shape of our function (f), i.e.: f is linear, thus we will select a linear model.
Confusion Matrix Parametric
Step 2: Selecting a procedure to fit or train our model. This means estimating the Beta parameters in the li near function. A common approach is the (ordinary) least squares, amongst others.
Kind Fraction of correct predictions, not reliable as skewed when the data set is unbalanced (that is, when the number of samples in di"erent classes vary greatly)
When we do not make assumptions about the form of our function (f). However, since these methods do not reduce the problem of estimating f to a small number of parameters, a large number of observations is required in order to obtain an accurate estimate for f. An example would be the thin-plate spline model.
Accuracy Non-Parametric
Precision
The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
Supervised
Out of all the examples the classifier labeled as positive, what fraction were correct?
No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
f1 score Categories
Recall
Unsupervised
Out of all the positive examples there were, what fraction did the classifier pick up?
A computer program interacts witha dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). The program is provided feedback in terms of rewards and punishments as it navigates its problem space.
Reinforcement Learning
Harmonic Mean of Precision and Recall: (2 * p * r / (p + r))
Decision tree learning Association rule learning
Performance Analysis
ROC Curve - Receiver Operating Characteristics
Artificial neural networks Deep learning Inductivelogicprogramming
True Positive Rate (Recall / Sensitivity) vs False PositiveRate (1-Specificity)
Support vector machines Clustering
Bias refers to the amount of error that is introduced by approximating a real-life problem, which may be extremely complicated, by a simple model. If Bias is high, and/or if the algorithm performs poorly even on your training data, try adding more features, or a more flexible model.
Approaches
Representationlearning
Variance is the amount our model’s prediction would change when using a di "erent training data set. High: Remove features, or obtain more data. 1.0 - sum_of_squared_errors / total_sum_of_squares(y)
Bayesian networks Reinforcementlearning
Bias-VarianceTradeo"
Similarity and metric learning Sparse dictionary learning Geneticalgorithms
Goodness of Fit = R^2
Rule-basedmachinelearning
The mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimatinganunobservedquantity)measures the average of the squares of the errors or deviations—that is, the di"erence between the estimator and what is estimated.
Learningclassifiersystems Mean Squared Error (MSE)
Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM)
Machine Learning Concepts Popular models The proportion of mistakes made if we apply out estimate model function the the training observations in a classification setting.
GenerativeMethods
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using di"erent partitions, and the validation results are averaged over the rounds.
Model class-conditional pdfs and prior probabilities.“Generative”sincesampling can generate synthetic data points.
Taxonomy
Directly estimate posterior probabilities. No attempt to model underlying probability distributions.Focus computational resources on given task– better performance DiscriminativeMethods
Cross-validation
Leave-p-outcross-validation Leave-one-out cross-validation k-fold cross-validation
Logisticregression,SVMs Popular Models
Methods
Prediction Accuracy vs Model Interpretability
Repeatedrandomsub-samplingvalidation
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as support vector machines or logisticregression.
Selection Criteria
Grid Search Numpy
Pandas
Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more e"ective in high-dimensional spaces than exhaustive search.
Traditionalneuralnetworks,Nearest neighbor Conditional Random Fields (CRF)
Holdout method
The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
Gaussians, Naïve Bayes, Mixtures of multinomials Sigmoidal belief networks, Bayesian networks, Markov random fields
Error Rate
Hyperparameters Random Search Scikit-Learn Tuning
There is an inherent tradeo" between Prediction Accuracy and Model Interpretability, that is to say that as the model get more flexible in the way the function (f) is selected, they get obscured, and are hard to interpret. Flexible methods are better for inference, and inflexible methods are preferable for prediction. Adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays O"ers data structures and operations for manipulating numerical tables and time series It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Gradient-basedoptimization
Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit, and stop the algorithmthen.
EarlyStopping(Regularization)
When a given method yields a small training MSE (or cost), but a large test MSE (or cost), we are said to be overfitting the data. This happens because our statistical learning procedure is trying too hard to find pattens in the data, that might be due to random chance, rather than a property of our function. In other words, the algorithms may be learning the training data too well. If model overfits, try removing some features, decreasing degrees of freedom, or adding more data. Opposite of Overfitting. Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and low bias). It is often a result of an excessively simple model. Test that applies Random Sampling with Replacement of the available data, and assigns measures of accuracy (bias, variance, etc.) to sample estimates. An approach toensemble learningthat is based onbootstrapping. Shortly, givena training set, we produce multiple di"erent training sets (called bootstrap samples), by sampling with replacement from the original dataset. Then, for each bootstrap sample, we build a model. The results in an ensemble of models, where each model votes with the equal weight. Typically, the goal of this procedure is to reduce the variance of the
Tensorflow
Overfitting
Underfitting
Bootstrap
Bagging
Libraries
Components
Python
MXNet
Does lazy evaluation. Need to build the graph, and then run it in a session.
Is an modern open-source deep learning framework used to train, and deploy deep neural networks. MXNet library is portable and can scale to multiple GPUs and multiple machines. MXNet is supported by major Public Cloud providers including AWS and Azure. Amazon has chosen MXNet as its deep learning framework of choice at AWS.
Keras
Is an open source neural network library written in Python. It is capable of running on top of MXNet, Deeplearning4j, Tensorflow, CNTK or Theano. Designed to enable fast experimentation with deep neural networks, it focuses on being minimal, modular and extensible.
Torch
Torch is an open source machine learning library, a scientific computing framework, and a script language based on the Lua programming language. It provides a wide range of algorithms for deep machine learning, and uses the scripting language LuaJIT, and an underlyingC implementation.
Find Collect Explore Clean Features Impute Features Data
Engineer Features Select Features
Classification Regression
Encode Features
Is this A or B?
How much, or how many of these? Anomaly Detection
Clustering
Is this anomalous?
Build Datasets Question
How can these elements be grouped?
Reinforcement Learning
Model
Machine Learning is math. In specific, performing Linear Algebra on Matrices. Our data values must be numeric.
Select Algorithm based on question and data available
What should I do now? The cost function will provide a measure of how far my algorithm and its parameters are f rom accurately representing my training data.
Vision API Speech API
Cost Function Sometimes referred to as Cost or Loss function when the goal is to minimise it, or Objective function when the goal is to maximise it.
Jobs API Google Cloud Video Intelligence API Language API
SaaS - Pre-built Machine Learning models
Translation API
Optimization Machine Learning Process
Rekognition Lex
AWS
Tuning
Having selected a cost function, we need a method to minimise the Cost function, or maximise the Objective f unction. Typically this is done by Gradient Descent or Stochastic Gradient Descent. Di"erent Algorithms have di "erent Hyperparameters, which will a "ect the algorithms performance. There are multiple methods for Hyperparameter Tuning, such as Grid and Random search.
Polly
Analyse the performance of each algorithms and discuss results.
Direction … many others ML Engine Amazon Machine Learning
Google Cloud AWS
Tools: Jupiter / Datalab / Zeppelin
Results and Benchmarking Data Science and Applied Machine Learning
Is the ML algorithm training and inference completing in a reasonable timeframe?
… many others Tensorflow
Scaling
MXNet
How does my algorithm scale for both training and inference? How can feature manipulation be done for training and inference in real-time?
Machine Learning Research Torch … many others
Are the results good enough for production?
Deployment and Operationalisatio n
How to make sure that the algorithm is retrained periodically and deployed into production? How will the ML algorithms be integrated with other systems? Can the infrastructure running the machine learning process scale?
Infrastructure
How is access to the ML algorithm provided? REST API? SDK? Is the infrastructure appropriate for the algorithm we are running? CPU's or GPU's?
Many costfunctions aretheresultof applyingMaximum Likelihood. For instanc e, theLeast Squarescostfunctioncanbeobtained viaMaximumLikelihood.Cross-Entropyisanother example. Thelikelihoodofa parameter value(or vector ofparameter values), & , given outcomes x, isequaltotheprobability (density) assumed for thoseobservedoutcomesgiven thoseparameter values, thatis Thenaturallogarithm ofthelikelihoodfunction, calledthelog-likelihood, ismoreconvenientto work with. Becausethelogarithm isamonotonicallyincreasingfunction, thelogarithm ofa function achieves its maximum valueatthesame points asthefunction itself,andhencethelog-likelihoodcan beusedin placeofthe likelihoodin maximum likelihoodestimation andrelatedtechniques.
Maximum Likelihood Estimation (MLE)
BasicOperations:Addition,Multiplication, Transposition
AlmostallMachineLearningalgorithms useMatrixalgebrain oneway or another. This is abroadsubject, toolargetobeincludedherein it’s full length. Here’s a start: https://en.wikipedia. org/wiki/Matrix _(mathematics)
Transformations
In general, forafixedsetof dataandunderlying statisticalmodel, themethodofmaximum likelihoodselects thesetofvalues ofthemodel parametersthatmaximizesthelikelihoodfunction. Intuitively, this maximizesthe"agreement" ofthe selectedmodelwith theobserveddata, andfor discreterandom variables itindeedmaximizes the probability oftheobserveddataunder theresulting distribution.Maximum-likelihoodestimationgivesa unifiedapproach toestimation, which iswelldefinedin thecaseofthenormaldistribution and manyotherproblems.
Matrices
Trace,Rank,Determinante,Inverse
http://setosa.io/ev/eigenvectors-andeigenvalues/
Inlinearalgebra,aneigenvectororcharacteristicvectorofa linear transformation Tfrom avector spac eV over afieldF intoitselfis a non-zero vector thatdoes notchangeits direction when thatlinear transformation is appliedtoit.
EigenvectorsandEigenvalues
Cross entropy can beusedtodefinethe loss function in machinelearningand optimization. Thetrueprobabilitypiis the truelabel, andthegiven distribution qi is thepredictedvalueofthe currentmodel.
DerivativesChainRule Rule
LeibnizNotation
Cross-Entropy
Thematrixofall first-order par tialderivatives ofavector-valued function. When thematrixis asquarematrix, both thematrixandits determinantarereferredtoas theJacobian in literature
JacobianMatrix Cross-entropyerrorfunctionandlogisticregression
LinearAlgebra Logisic t
Thegradientis amulti-variablegeneralization of thederivative. Thegradientis avector-valued function, as opposedtoaderivative, which is scalar-valued.
Gradient
Thelogisticlos sf unctionis de fi nedas:
Cost/Loss(Min) Objective(Max) Functions
Theuseofa quadraticloss function is common, for exam plewhen usingleastsquares techniques. Itis often moremathematically tractablethan other loss functionsbecauseoftheproperties of variances, as wellas beingsymmetric: an error abovethetarget causes thesameloss as thesamemagnitudeof error below the target. Ifthetargetis t, then aquadraticloss function is:
Quadratic For MachineLearningpurposes, aTensor can bedescribedas a MultidimentionalMatrixMatrix. Dependingon thedimensions, the Tensor can beaScalar, aVector, aMatrix, or aMultidimentional Matrix.
Tensors
In statistics anddecision theory, a frequently usedloss function is the0-1 loss function
0-1 Loss
When measuringtheforces appliedtoan infinitesimalcube, onecan storetheforce values in amultidimensionalmatrix.
Thehingeloss is aloss function usedfor trainingclassifiers. For an intendedoutput t= ±1 andaclassifier scorey, thehinge loss oftheprediction y is definedas:
HingeLoss
When thedimensionality incre ases, thevolumeofthespacei ncreases sofastthattheavailable databecomesparse. This sparsity is problematicfor any methodthat requires statistical significance. In orde r toobtain astatistically soundand reliableresult, theamountofdata neededtosupporttheresult often grows exponentially with thedimensionality .
CurseofDimensionality Exponential
Mean Valuein themiddleor an orderedlist, or aver ageoftwo in middle.
Median
MostFrequentValue
MeasuresofCentralTendency
Mode
Division ofprobability distributions based on contiguous intervals with equal probabilities. In short: Dividingobservations numbers in asamplelist equally.
Itis usedto quantify thesimilarity between twoprobability distributions.Itis atypeof f-divergence.
HellingerDistance
TodefinetheHellinger distancein terms of measuretheory, letP andQ denotetwo probabilitymeasuresthatareabsolutely continuous with resp ecttoa third probabilitymeasure '. Thesquareofthe Hellinger distancebetween P andQ is definedas thequantity
Quantile Range
Theaverageoftheabsolutevalue ofthedeviation ofeach valuefrom themean
Is ameasureofhow oneprobability distributiondivergesfromasecond expectedprobabilitydistribution. Applicationsincludecharacterizingthe relative(Shannon)entropyininformation systems,randomnessincontinuoustimeseries,andinformationgainwhen comparingstatisticalmodelsofinference.
MediumAbsoluteDeviation(MAD)
Threequartiles dividethedatain approximately four equally divided parts Theaverageofthesquareddi "erencesfromtheMean.Formally,isthe expectation ofthesquareddeviation ofarandom variablefrom its mean, andit informallymeasureshow far asetof(random)numbers arespreadoutfrom their mean.
Inter-quartileRange(IQR)
Kullback-Leibler Divengence
Definition
Dispersion
Variance Types Continuous
Continuous
Discrete
is ameasur eofthe di"erencebetweenan originalspectrumP( ( )andan approximation P^( ( )ofthat spectrum. Althoughitis nota perceptualmeasure , itisintended to reflectperceptual(dis)similarity.
Itakura–Saitodistance
https://stats.stackex change.com/ questions/154879/ a-list-of-cost-functions-usedin-neural-network s-alongsideapplications
Discrete
https://en.wikipedia.org/wiki/ Loss_functions_for_classification
Thesignednumber ofstandard deviationsanobservationor datum isabovethemean.
z-score/value/factor
Frequentist
StandardDeviation FrequentistvsBayesianProbability
sqrt(variance)
RandomVariable
dot(de_mean(x) ,de_mean(y))/(n- 1)
Statistics
Conditionality
Relationship
Benchmark smonotonicrelationship(whetherlinear ornot),Spearman's coe$cientis appropriatefor both continuous anddiscretevariables, including ordinalvariables.
Spearman
Correlation
Isastatistic usedtomeasuretheordinal association be tween twomeasur ed quantities.
BayesTheorem(rule, law)
Kendall
SimpleForm With Law ofTotalprobability
MachineLearning Mathematics
Theresultsarepresentedin amatrixformat, wherethecross tabulation oftwofields is acell value. Thecellvaluerepresents thepercentageoftimes thatthetwofields existin thesame events.
Co-occurrence
Po r ba bi il ty
Themarginaldistribution ofasubset ofa collection ofrandom variables is theprobability distribution ofthevariablescontainedin the subset. Itgivestheprobabilities ofvarious values ofthevariables in thesubsetwithout referencetothevalues oftheother variables.
C on ce pt s Marginalisation
Isageneralstatementor defaultposition thatthereis no relationshipbetween twomeasuredphenome na, or noassociation amonggroups. Thenullhypothesis is genera lly assumedtobetrue untilevidence indicatesotherwise.
NullHypothesis
This demonstrates thatspecifyingadirection (on asymmetricteststatistic)halves the p-value(increases the significance)andcan mean thedi "erencebetweendatabeingconsideredsignificantornot. Supposearesearcher flipsacoin fivetimes in arow andassumes anull hypothesis thatthecoin is fair. Thetest statisticof"total number ofheads" canbeone-tailedor two-tailed: aone-tailedtestcorresponds toseeingifthe coin is biasedtowards heads, bu tatwo-tailed testcorrespondstoseeingif thecoin is biasedeither way. The researche r flipsthecoin fivetimesand observesheadseach time(HHHHH), yie ldingatest statisticof5. In a one-tailedtest, this is theupper extremeofallpossibleoutcomes, andyields ap-valueof (1/2)5 =1/32) 0.03.If theresearcherassumedasignificancelevelof0.05, this resultwouldbedeemedsignificantand thehypothesis thatthecoin isfair wouldberejected. Inatwo-tailedtest, ateststatistic ofzeroheads (TTTTT)is justas extremeandthusthedataof HHHHHwouldyieldap-value of2 !(1/2)5=1/16 ) 0.06, which is notsignificantat the0.05 level.
Continuous
Discrete
Is afundamentalrulerelatingmarginal probabilities toconditionalprobabilities. It expressesthetotalprobability ofan outcome which can berealizedviaseveraldistinct events-hencethename.
Law ofTotalProbability Fiveheads in arow Example
p-value
ChainRule
Techniques
In this method, as partof experimentaldesign, beforeperformin gtheexperiment, one firstchooses amodel(thenullhypothesis)anda thresholdvaluefor p,calledthe significancelevelofthetest, traditionally 5% or 1% anddenotedas *. fthep-valuei I s less than thechosen significancelevel( * ), thatsuggests thattheobserveddatais su$ciently inconsistentwith thenullhypothesis thatthenull hypothesis may be rejected. However , thatdoes not provethatthetested hypothesis is true. For typical analysis, usingthestandard * =0.05 cuto", thenullhypothesis is rejectedwhenp< . 05 andnotrejectedwhen p> .05. Thep-valuedoes not, in itself, supportreasoning abouttheprobabilities ofhypotheses butis only atool for decidingwhether toreject thenullhypothesis.
Permits thecalculation ofany membe r ofthejoint distribution ofaset ofrandom variables usingonlyconditionalprobabilities.
p-hacking Definition
States thatarandom variabledefinedas theaverageofalargenumber of independentandidentically distributedrandom variables is itselfapproximately normallydistributed.
Is atableor an equation thatlinks each outcomeofa statistical experimentwiththeprobabilityofoccurence.WhenContinuous,isis describedby theProbability Density Function
CentralLimitTheorem
Is afirst-order iterativeoptimization algorithm for findingtheminimum of afunction. To findalocalminimum of afunction usinggradientdescent, onetakes steps proportional tothenegativeof thegradient(or oftheapproximategradient)ofthe function atthe currentpoint. Ifinsteadonetakes stepsproportionaltothepositive ofthegradient, oneapproaches alocalmaximum ofthatfunction; the procedureis then known as gradientascent.
GradientDescent Uniform Poisson
Normal(Gaussian)
Gradientdescentuses totalgradientover allexamples per update, SGD updates after only 1 or few examples:
StochasticGradientDescent(SGD) Distributions
Types(DensityFunction)
Optimization
Gradientdescentuses totalgradientover allexamples per update, SGD updates after only 1 example
Mini-batchStochasticGradientDescent (SGD)
Idea: Addafraction v ofprevious update tocurrentone. When thegradientkeeps pointingin thesamedirection, this will increasethesizeofthe steps take n towardstheminimum. Adaptivelearningrates for each p arameter
Momentum
ManhattanDistance
L1norm
L2-norm is alsoknown as leastsquares. It is basically minimizingthesum ofthe squareofthedi "erences(S)betweenthe targetvalueandtheestimated values:
EuclideanDistance
L2norm
Early stoppingrules provideguidanceas to how many iterations can berun beforethelearner begins toover-fit, andstopthe algorithmthen.
Gamma CumulativeDistributionFunction(CDF)
Dropout
Toevaluatealanguagemodel, weshouldmeasurehow much surpriseitgivesusfor realsequencesin that language. Foreach realwordencountere d, thelanguagemodelwillgive aprobabilityp. Andweuse log(p)toquantifythe surprise. Andweaveragethetotalsurpriseover along enough sequen ce. So, in case ofa1000-letter sequencewith 500 A and500 B, thesurprisegiven by the1/3-2/3 modelwill be: [-500*log(1/3 )-500*log(2/3)]/1000=1/2 *Log(9/2) Whilethecorrect1/2-1/2 modelwillgive: [-500*log(1/2 )-500*log(1/2)]/1000=1/2 *Log(8/2) So, wecan see, the1/3, 2/3 modelgives moresurprise, which indicates itisworsethan thecorrectmodel. Onlywhen thesequenceis longenough, theaveragee "ectwillmimic theexpectation over the1/2-1/ 2 distribution. Ifthesequenceis short, itwon'tgiveaconvincingresult.
Cross entropy between twoprobability distributions pandq over thesameunderlyingset ofevents measures theaverage number ofbitsneededtoidentify an eventdrawn from theset, ifacodingschemeis used thatis optimizedfor an "unnatural"probabilitydistributionq,ratherthanthe"true"distribution p.
CrossEntropy Sparseregularizeroncolumns Regularization
InformationTheory
Nuclearnormregularization This regularizer constrain s thefunctions learnedfor each task to be similar totheoverallaverageofthe functions across laltasks. This is usefulfor expressingprior information thateach task is expectedtoshare similarities with each other task. An exampleis predictingbloodiron levels measuredatdi"erenttimes oftheday, whereeach task represents adi "erentperson.
Entropy is ameasureof unpredictabilityofinformation content.
Entropy
EarlyStopping
Isaregularizationtechniqueforreducingoverfittinginneuralnetworksby preventingcomplexco-adaptations on trainin gdata. Itis a very e$cientway of performingmodelaveragingwithneuralnetworks.Theterm"dropout"refersto droppingoutunits (both hidden andvisible)in aneuralnetwork ThisregularizerdefinesanL2normon each column andan L1 norm over all columns. Itcan besolvedby proximal methods.
Binomial Bernoulli
Adagrad
L1-norm is alsoknown as leastabsolute deviations(LAD),leastabsoluteerrors (LAE). Itis basically minimizingthesum of theabsolutedi "erences(S)betweenthe targetvalueandtheestimatedvalues.
This regularizer is similar tothemeanconstrainedregularizer but , instead enforcessimilaritybetweentaskswithin thesamecluster. This can capturemore complexpriorinformation.Thistechnique has been usedtopredictNetflix recommendations.
Bayesian inferencederives theposterior probability as aconsequenceoftwo antecedents, aprior probability anda"likelihoodfunction" derivedfrom a statisticalmodelfor theobserveddata. Bayesian inferencecomputes theposterior probabilityaccordingtoBayes' theorem. Itcan beappliediteratively sotoupdate theconfidenceon outhypothesis.
BayesianInference
Theprocessofdatamining involves automatica llytestinghuge numbers ofhypothesesaboutasingledata setby exhaustively sear chingfor combinations ofvariablesthatmightshow acorrelation. Conventionaltests ofstatisticalsignificancearebasedon theprobability thatan observation aroseby chance, andnecessarily acceptsomerisk ofmistaken testresults, calledthesi gnificance. http://blog.vctr.m e/posts/central-lim it-theorem.html
Expectation (ExpectedValue)ofaRandom Variable
Twoeventsareindependent,statistically independent, or stochasticallyindependentifthe occurrenc eofone does nota "ecttheprobabilityof theother.
Independence
Pearson
Contrary totheSpearman correlation, theKendallcorrelation is nota "ectedby how farfromeachotherranksarebut onlybywhethertheranks betweenobservationsare equalor not, andis thusonly appropriatefor discretevariables butnot definedfor continuousvariables.
Basicno t ion of probability: # Res ults / # Attem pts Theprobability isnotanum ber, butadistributio n itself.
In probabilityandstatistics, a random variable, random quantity, aleator y variableor stochastic variableis avariablewhosevalueis subjectto variationsduetochance(i.e. randomness, ina mathematicalsense). A random va riablecan takeon asetof possibledi "erentvalues(similarlyto othermathematicalvariables),eachwithanassociatedprobability,incontrast toother mathematicalvariables.
Covariance
Benchmarkslinearrelationship,mostappropriatefor measurementstaken from an intervalscale, is ameasureof thelinear dependencebetween two variables
Bayes ian
http://www.behin d-the-enemy-lines.co m/2008/01/a re-you-bayesian-o r-frequentistor.html
A measureofhow much tworandom var iables changetogether. http://stats.stackex change.com/ questions/18058/h ow-would-youexplain-covarian ce-to-someone-wh o-understands-only-the -mean
JointEntropy
Mean-constraine dregularization
ConditionalEntropy
MutualInformation Clusteredmean-constra inedregularization
Moregeneralthanabove,similarity between tasks can bedefinedby a function.Theregularizerencouragesthe modeltolearn similar functions for similar tasks.
Kullback-Leibler Divergence MostlyNon-Parametric.Parametricmakesassumptionsonmydata/randomvariables,forinstance,thattheyare normallydistributed.Non-parametricdoes not.
Graph-basedsimilarity
Themethodsaregenerallyintendedfordescriptionratherthan formal inference non-negative it’satypeofPDFthatitiss ym metr ic real-valued symmetric Density Estimation
integralover functionisequalto 1 non-parametric
Methods KernelDensity Estimation
calculateskerneldistributionsforevery samplepoint, andthen adds allthe
Same,forcontinuousvariables
A unitoften refers to the activation function in alayer by which theinputs are transformedvia anonlinear activation function (for example by thelogistic sigmoid function). Usually, aunit has severalincoming connections andseveral outgoing connections.
With SGD, thetraining proceeds in steps, andateachstepweconsideraminibatch x1...m of sizem. The mini-batch is usedtoapprox-imatethegradientofthe loss function with respectto the parameters.
Comprisedof multipleReal-Valued inputs. Each inputmust belinearly independentfrom each other.
InputLayer
Layers other thantheinput and outputlayers. A layer is the highest-levelbuilding block in deeplearning. A layer is a container thatusually receives weightedinput, transforms it with aset ofmo stly non-linear functions andthen passes thesevalues as outputto the nextlayer.
Hidden Layers
Usingmini-batches ofexamples, as opposedto oneexample ata time, is helpfulin severalways. First, thegradientof theloss over amini-batch is an estimateofthe gradientover thetraining set, whosequality improves as thebatch sizeincreases. Second, computation over abatch can bemuch more e $cientthan m computations for individualexamples, dueto theparallelism a "ordedby the modern computingplatforms. However, ifyou actually try that, the weights willchangefar toomuch each iteration, whichwillmake them “overcorrect” andthe loss willactually increase/diverge. Soin practice, people usually multiply each derivativeby a small valuecalled the“learning rate” beforethey subtractit from its correspondingweight.
Unit(Neurons)
Linear Regression
Is aflexible generalization ofordinary linear regressionthatallows for responsevariables thathave error distributionmodelsother than anormal distribution. TheGLM generalizeslinear regression by allowingthelinear modeltoberelatedtotheresponsevariableviaalink function andby allowingthemagnitudeof thevarianceofeach measurementto bea function ofitspredictedvalue. Identity Batch Normalization
GeneralisedLinear Models (GLMs) Inverse Link Function Regression
Neuralnetworks areoften trainedby gradientdescent on theweights. This means ateach iteration weuse backpropagation tocalculate thederivative of theloss function with respectto each weightandsubtractitfrom thatweight.
Locally EstimatedScatterplot Smoothing (LOESS)
Simple strecipe: keepitfixedandusethe samefor allparameters.
RidgeRegression
LearningRate
LeastAbsolute Shrinkageand Selection Operator (LASSO)
Reduceby 0.5 when validation error stops improving Reduction byO(1/t)because oftheoretical convergenceguarantees, with hyperparameters +0 and , andt is iteration
Logit
CostFunction is foundvia Maximum LikelihoodEstimation
Tricks Better resultsby allowinglearning rates todecrease Options:
LogisticRegression
numbers.
LogisticFunction
Better yet: Nohand-setlearning of rates byusing AdaGrad But, this turns outtobe amistake, becauseif every neuron in thenetwork computes thesame output, then they will alsoallcomputethesamegradie ntsduring back-propagation andundergo theexact sameparameter updates. In other words, thereis nosourceo fasymmetry between neurons iftheir weights areinitialized tobe thesame. Theimplementation for weights might simply drawingvalues from anormal distribution with zeromean, and unit standarddeviation. Itis alsopossible to usesmall numbers drawn from auniform distribution, butthis seems tohave relatively littleimpact on thefinal performancein practice.
In theideal situation, with proper data normalization itis reasonableto assume thatapproximately halfofthe weights will bepositiveandhalfof them willbe negative. A reasonable-soundingidea then mightbetoset alltheinitialweights to zero, which you expectto bethe “best guess” in expectation.
NaiveBayes Classifier. Weneglect the denominator as wecalculatefor every classandpickthemaxof thenumerator
NaiveBayes
MultinomialNaive Bayes Bayesian BeliefNetwork (BBN)
Thus,youstillwanttheweightstobe very closeto zero, butnot identically zero. n I this way, you can random theseneurons to smallnumbers which arevery closeto zero, anditis treatedas symmetry breaking. Theideais thatthe neurons are allrandom andunique in thebeginning, so they willcomputedistinct updates and integratethemselves as diverseparts of thefullnetwork.
This ensures thatallneurons in the network initially haveapproximately the sameoutput distribution andempirically improves therate ofconvergence. The detailedderivations can befound from Page. 18 to23 of theslides. Pleasenote that, in thederivations, itdoes not consider theinfluence of ReLUneurons.
Bayesian
AllZero Initialization
PrincipalComponent Analysis (PCA) PartialLeast Squares Regression P ( LSR) Initialization with SmallRandom Numbers
DimensionalityReduction NeuralNetworks
PartialLeast Squares DiscriminantAnalysis
MachineLearning Models
QuadraticDiscriminantAnalysis (QDA) Linear DiscriminantAnalysis (LDA)
Oneproblem with theabovesuggestion is thatthe distribution ofthe outputs froma randomly initializedneuron has avariance thatgrows with thenumber ofinputs. It turns outthatyou can normalizethe varianceof each neuron's outputto1 by scalingits weightvector by thesquare root of its fan-in (i.e.,its number ofinputs)
NeuralNetwork taking 4 dimension vector representation ofwords.
PrincipalComponent Regression (PCR)
WeightInitialization
k-nearestNeighbour (kNN) LearningVector Quantization (LVQ) InstanceBased
Calibratingthe Variances
Self-OrganisingMap (SOM) Locally WeightedLearning (LWL) Random Forest Classification andRegression Tree(CART)
Is amethod usedin artificialneural networks tocalculatethe error contribution ofeach neuron after abatch ofdata. It calculates thegradient ofthe loss function. Itis commonly usedin thegradientdescent optimization algorithm. Itis alsocalled backwardpropagation oferrors, because theerror is calculatedatthe outputand distributedback through thenetwork layers.
Decision Tree
GradientBoosting Machines(GBM) ConditionalDecision Trees GradientBoosted Regression Trees(GBRT) complete single Linkage
average centroid
Backpropagation HierarchicalClustering
Euclidean In this method, wereuse partialderivatives computedfor higher layers in lower layers, for e$ciency.
Dissimilarity Measure Algorithms Manhattan k -M ea ns k-Medians
Clustering
Defines theoutputofthatnodegivenan inputorsetof inputs .
H ow m an y c lu st er s do w e s el ec t?
FuzzyC-Means Self-OrganisingMaps (SOM)
ReLU
ExpectationMaximization DBSCAN Dunn Index
Sigmoid/ Logistic
DataStructureMetrics
Connectivity SilhouetteWidth
Validation
Binary
Non-overlapAPN AverageDistanceAD StabilityMetrics
Tanh
Softplus
Softmax
Activation Functions Types
FigureofMeritFOM AverageDistanceB etween Means ADM
Euclidean distanceor Euclidean metricis the "ordinary" straight-line distancebetween twopoints in Euclidean space. Thedistance between two points measuredalongaxes at rightangles.
View more...
Comments