PROVOST (2013) Data Science for Business

December 12, 2016 | Author: Ariadna Setentaytres | Category: N/A

Short Description

Data Science for Business...

Description

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Preface Give this book to your collaborators who want to design top-notch solution This is a good textbook Minimal math www.data-science -for-biz.com Helps understand alignment business-technical development and data science teams General data strategies and proposals How Data Science fits into an organization Ways to create teams Ways for thinking how to be DS to lead to competitive advantage Tactical concepts for projects How to extract knowledge from data Ways for thinking data analytically Data mining process High level data mining tasks

1. Introduction: Data-Analytic Thinking Primary goals of this book Help view business problems from a data perspective Structure and principles Intuition, creativity and common sense Understand principles of extracting useful knowledge from data Data science aims to improving decision-making Statistically the more data- driven a firm is, the more productive it is Two types of decisions we will study in this book - Decisions for which "discoveries" need to be made within data (For example, the hurricane problem: what do people buy in a hurricane?) - Decisions that repeat, so decision - making can benefit from the smallest improvement in the accuracy of the data analysis (For example, the churn problem) Advertising Credit fraud discovery

Chemistry Is Not About Test Tubes: Data Science vs. the Work of the Data Scientist This book focuses on the science and not on the technology This book will still be current in 20 years Data Mining and Data Science, Revisited High-level definition of data science Set of fundamental principles that guide the extraction of knowledge from data Data mining is the extraction of knowledge from data using technologies that incorporate the principles of data science A big part of this book deals with the extraction of patterns and models from large bodies of data Fundamental concepts Extracting useful knowledge from data to solve business problems can be treated systematically following a well defined process Data Processing and “Big Data” - Many data-processing technologies are not Big data science - But the technologies are important and also linked to productivity From Big Data 1.0 to Big Data 2.0 - We are now in stage 1.0: firms are adopting Big Data and using it b try to improve their current operations - Stage 2.0 will be "what can I now do that I couldn't do before or do better than I could do before?"

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 1 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Uses of data mining Formulating Data Mining solutions and evaluating results involves careful consideration of the context - From a large mass of data IT can be used to find attributes of interest - If you look hard go will find something but it might not generalize Customer relationship management, customer behavior analysis, marketing Examples - Example: Hurricane Frances Walt-Mart wanted to predict consumer’s behavior during the hurricane Can be intuitive to predict that people will buy more water, but not so intuitive to predict the amount of increase in sales More useful would be to find unusual local demand for products

- Example: Predicting Customer Churn Churn: when customers move from one provider to another Imagine that you have to devise a plan for how the data science team should use the data to determine which customer to target with anti-churn offers Customer retention is a major use of data mining

Data and Data Science Capability as a Strategic Asset Data and Decision-support talent are complementary Managers with data-analytic skills are really on demand Companies need multiple data analyst manages because they need to make decisions in multiple areas of the business Example: Signet Bank bought the data it needed. It eventually became Capital One Summary: This book is about the extraction of useful information and Knowledge from large volumes of data in order to improve business decision-making To succeed, you must he able to think about how the fundamental concepts apply to particular business problems-to think data-analytically This book is also complementary to other important technologies such an statistical hypothesis testing and database querying

2. Business Problems and Data Science Solutions A set of canonical data mining tasks Data scientist decomposes a business problem into subtasks-THIS IS A CRITICAL SKILL Avoids reinventing the wheel Helps focus talent Then the results can be reassembled Important principle: Data Mining is a Process Has two types of stages Stages that apply Information technology - Automated discovery of patterns - There are many algorithms, but only a handful of tasks - Classification Attempt to predict for each individual in a population, which of a (small) set of classes this individual belongs to Classes are mostly mutually exclusive Related task: "scoring" or "class probability estimation' Informally, this would predict "Whether" something will happen

- Regression "Value estimation" Attempts to predict for each individual the numerical value of some variable for that individual Informally, regression will predict "How much" something will happen

- Similarity matching Attempt to identify similar individuals To better identify business opportunities

- Clustering Group individuals into a population together by their similarity but not driven by any specific purpose Useful because the groups may suggest other data mining tasks or approaches Generate input to decision-making processes

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 2 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

- Co-occurrence grouping Aka Frequent itemset mining, association rule discovery, and market-basket analysis Attempts to find associations between entities based on their transactions

- Profiling Aka Behavior description Attempts to characterize the typical behavior of an individual, group, or population Can be described generally or down to the level of small groups or even individuals

- Link prediction Predict connections between data items For example, if you share 6 friends with X, then you and X should be friends

- Data reduction Take a large set of data and replace it with a smaller set with more important information Involves loss of information Example: instead of having a list of movies µ individuals, have a list of genres

- Causal modeling Attempts to help understand what events and actions influence others

Stages that require Knowledge, creativity and common sense (Not subject of this chapter) Iteration is the rule, rather than the exception Business Understanding - The design team should think carefully about the use scenario - What exactly do we want to do? - How exactly we would do it? - What parts of this use scenario constitute possible data mining models? Data Understanding - Critical part: Estimating the costs and benefits of each data source and dialing whether further investment is merited - For example: the Medicare billing data have no reliable target variable indicating a fraud. this problem usually requires unsupervised approaches Data Preparation - Aside from the vaulting and cleaning, it is very important to beware of ''leaks": situation where a variable collected in historical data gives information on the target. For example: wanting to collect if a claim was legitimate in order to try to ''predict" if there was fraud! If we already know the answer to that variable, then there is nothing to predict! Modeling - The output is some sort of pattern - This is the part where mare science and technology can be used Evaluation - Looking carefully to validate the patterns - Ensure that the model satisfies the original business goals - Quantitative and qualitative assessments Think about the comprehensibility of the model to stakeholders

Deployment - Put the model to real one in order to realize some ROI - Reasons for deploying the data mining system itself, rather than the models produced by the system The world changes faster than the data science team can adapt A business has too many modeling tasks for their data science team to manually curate each model individually Could be as mundane as posting an instruction sheet

- This after returns to the business understanding phase

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 3 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Data Mining and Its Results Important distinction between the activity of Data Mining, and the use given to the results - Mining data to find patterns and build models - Using the results of DM: Should influence the data mining, but it is a distinct process Implications for Managing the Data Science Team - The data mining process should be viewed as exploratory projects that iterate on approach and strategy rather than on software designs - Software skills vs. analytical skills Evaluation for SW skills are # of tickets closed, amount of code written Analytics: More important to be able to formulate problems well, prototype solutions quickly to make reasonable assumptions in the face of ill-structured problems , to design experiments that represent good investments and to analyze the results

Supervised versus unsupervised data mining Supervised: when we divide individuals into subgroups with a target. For example, who will leave, who will stay Supervised tasks' results are much more useful The value for the target variable is often called the individual's label - A must: there MUST be data on the target (like a benchmark) - Acquiring data on the target is a key data science investment Classification, regression, and causal modeling are solved with supervised methods - Classification and regression are distinguished by the type of target Regression involves a numeric target Classification involves a categorical (often binary) target

- For business applications we often want a numerical prediction over a categorical target Unsupervised: When the sub-groups do not have a target Clustering, co-occurrence grouping, and profiling Vital part in the early stages of the data mining process To decide whether the line of attack will be supervised or unsupervised If supervised, produce a precise definition of the target variable Other Analytics Techniques and Technologies Statistics: 2 different uses in BA Term for computation of values (summary statistics) - Should pay attention to the business problem - Also to the distribution of the data they are summarizing Term for what it really is: statistics - Knowledge that underlies analytics - Quantification of uncertainty into confidence intervals - Correlation is an important statistical term often used in data mining Database Querying Appropriate when analysts have already an idea of what subpopulation might be interesting OLAP tools can be a useful complement to data mining tools for discovery from business data Data Warehousing Can be seen as a facilitating technology of data mining Not always necessary for all the projects Regression Analysis Extracting patterns that will generalize to other data Big debate over explanatory vs. predictive modeling: Not all explanatory applies to predictive Machine Learning and Data Mining Collection of methods for extracting predictive models from data Data mining was born here Answering Business Questions with These Techniques Who are the most profitable customers? - Classification question Is there really a difference between a profitable and a regular customer? Hour to identify these customers? Will a particular new customer be profitable?-prediction question ____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 4 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

3. Introduction to Predictive Modeling Model = simplified presentation of reality Model induction: The creation of models from data They are created with algorithms that create models for classification and regression The input data for the algorithm is called the training data This book studies mainly classification models because they are more relevant to business problems Predictive model: A formula for estimating the unknown value of interest: the target Supervised learning model creation: The model describes a relationship between a set of selected variables (attributes) and a predefined variable (target) Instance: fact or data point or a row in a table (a Set of attributes) Think of predictive modeling as SUPERVISED SEGMENTATION: Divide the entities into groups with a target Descriptive model: used to gain insight into the underlying phenomenon or process The distinction sometimes is not obvious Supervised Segmentation Fundamental concept: How Can we select one or more attributes that will best divide the sample with respect to our target? One of the fundamentals of Data Mining: finding an selecting important, informative variables or attributes of the entities described by the data - Information is a quantity that reduces uncertainty about something - Selection of the single most informative attribute has its complications - Attributes rarely split a group perfectly Which is better: a few pure subsets, lowers overall impurity How to compare attributes that are not binary? How should we think about creating supervised segmentations using numeric attributes?

We want our segments to be as pure as possible: all the members have the same target value - In the real world, there is not absolute purity, but if we can reduce impunity, we can win the attribute in a predictive model A natural measure of impurity for numeric values is variance Entropy = Disorder-how mixed (impure) the segment is with respect to the properties of interest

- A formula to evaluate how well each attribute splits the sample into segments, based on a "purity measure" The most common splitting criterion: information gain, based on entropy To measure how much an attribute improves entropy Defined as the change in entropy due to any amount of new information

Example: Attribute Selection with Information Gain - Data set descriptions of 23 species of mushrooms Color Shape Odor A lot more...

- Target variable is "edible?" Poisonous Edible Unknown edibility Not recommended

- Which single attribute the most useful? Select an attribute, calculate the segments and then calculate the weighted entropy The most useful attribute is the one that reduces the total entropy

Visualizing Segmentations It is important to see how a classification partition, the instance space Common Vizualization form: Scatter plot Hyperplane general separating surface. A generalization of a line or a plane Predictive modeling technique: tree induction

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 5 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Attribute selection alone does not seem to be sufficient Create a tree where each path is one of the instances. Each final leaf has a target value. This is a decision tree The goal of the tree is to partition the instances into subgroups of similar values Then apply the method recursively until reaching a desired level of purity or running out of variables to partition Trees as Sets of Rules: Interpretation of trees as logical statements: IF THEN-ELSE statements to represent the tree Probability Estimation Trees Contain nodes with probabilities rather than values More informative than just a classification When your find leaves with few members, you apply the Laplace Correction to minimize their influence on the overall calculation Example: Addressing the Churn Problem with Tree Induction Before starting to build a clarification tree, ask how good are each of the variables individually Put the highest information gain feature at the root of the tree The other features are evaluated not on the entire set of instances Stop subdividing the tree when detecting overfilling (discussed in chapter 5)

4. Fitting a Model to Data Crux of this chapter: What exactly do we mean when we say a model fits the data well? Parameter learning (or Parameter Modeling): Alternative method for learning a predictive model from a data set Finding “optimal” model parameters based on data First specify the structure of the model and leave certain numeric parameters unspecified Then the Data Mining calculates the best parameter values based on the model Optimizing an Objective Function (Heavy math) Choosing the goal for data mining We need to ask what should be our goal or objective in choosing the parameters. Use a support vector machine Finding the best line to separate the classes Linear Discriminant Functions It is helpful to represent the model mathematically - The function of the decision boundary is' a linear combination (weighted sum) of the attributes - Denominates between the classes - complicated example with flowers Used to decide how likely an instance is going to be part of a class Support Vector Machines, Briefly - SVM are linear discriminants - Classify instances based on a linear function of the features - The SUM looks for the widest bar in the data and draws a line in the middle, creating the Margin Regression via Mathematical Functions Class Probability Estimation and Logistic “Regression” - Logistic Regression: Some Technical Details - LR is one common procedure to give accurate estimates of class probability Example: Logistic Regression versus Tree Induction 102 Nonlinear Functions, Support Vector Machines, and Neural Networks 105

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 6 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

5. Overfitting and Its Avoidance Fundamental concepts Generalization We are interested in finding patterns in the data, so we can make generalizations But sometimes we generalize based in particular cases (all women are the same) That is called "Overfitting" the data Generalization is the property that a model has to be applied to new data, not the data used to build the model Overfitting Overfitting is to create a model out of the cases in the training data in such a way that it only works for that training set of data and it is not useful for anything else - Quote by Ronald Coax (Nobel Laureate) "If your torture the data long enough, it will confess" - All data mining processes do ovefit to some extent - Examples Overfitting in Tree Induction

A procedure that grows trees until the leaves are pure, tend to overfit You need to use trial and error to find a ''Sweet spot" where the model is fit enough for the training data, but loose enough for generalizing. A procedure has not been invented yet Overfitting in Mathematical Functions Overfitting Linear Functions

Why Is Overfitting Bad? - Overfilling hinders us from improving a model after a certain complexity - It causes models to worse The best strategy is to recognize overfitting and to manage complexity in a principled way - When creating a model, do not use all the training data, "Hold out'' some of the data and then use it to help evaluate the model - From Holdout Evaluation to Cross-Validation Cross validation is a more sophisticated holdout testing and training procedure Makes better use of a limited data set Divide de data at in different samples and them build and validate the model with each sample. This will give accuracy results that can be averaged The Churn Dataset Revisited - Example of cross validation

- Learning Curves: A plot of the generalization performance against the amount of training data Overfitting Avoidance and Complexity Control Avoiding Overfitting with Tree Induction A General Method for Avoiding Overfitting - * Avoiding Overfitting for Parameter Optimization Exemplary techniques Attribute selection Tree pruning Stop growing the tree before it gets too complex - Specify a minimum number of instances that must go in each leaf. - There is no ideal number From the tree until it is too large, and then prune it. Regularization Extremely complex math examples

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 7 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

6. Similarity, Neighbors, and Clusters Fundamental concepts Calculating similarity of objects described by data Data Analysts are always looking for similarities Using similarity for prediction - To retrieve similar things (for example a list of prospects) - To perform classification and regression - To group members into clusters (in an unsupervised segmentation) For example, to make movie recommendations Medicine and Lain: Diagnosis based in similar symptoms. Law: citing precedents

Clustering as similarity-based segmentation Similarity and distance (formal definitions) Searching for similar entities If you can represent an object as data, then you can more precisely compare objects Many methods in Data Science may be seen as methods for organizing the space of data instances, so that instances near to each other (for example near branches of a tree) are seen as similar One good way for calculating distance is using simple geometry Exemplary techniques Nearest-Neighbor Reasoning Nearest Neighbors for Predictive Modeling - Point to a member who has an interesting target, and analyze the members that are closest to that one - Classification is another strategy for using distance to find similar objects - Probability estimation: Used to do a more precise prediction - Regression: to try to calculate a crucial minable - Remember not to use the target variable to try to predict it How Many Neighbors and How Much Influence? - No simple answers, but odd numbers are convenient - Nearest neighbor algorithms refer to to K-NN where K is the number of neighbors considered similar If K=N then the entire dataset will be used for every prediction For classification, if we use K=N, thin would predict the majority class in the entire data set For regression: the average of all the target values For class probability estimation: the"base rate" probability We can think of the procedure as "Weighted Scoring"

Geometric Interpretation, Overfitting, and Complexity Control - In terms of oveifitting K ln a k-NN model is a complexity parameter - So, have do we choose K? That is a good question! Usually with algorithms

Issues with Nearest-Neighbor Methods - Intelligibility How to justify a specific decision?

"We recommended this movie to you because you liked a similar one" The intelligibility of an entire model

Sometimes the lane prohibits some biases for classification - Dimensionality and Domain knowledge Some models might suffer from containing too many attributes (Dimensions) Assign higher or lower weights to the attributes depending on the Analyst's Knowledge of the domain

- Computational Efficiency Applying there strategies can be very expensive in terms of data capturing, storage and calculations In some cases (Such as instant algorithms) the NN methods are inefficient

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 8 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Some Important Technical Details Relating to Similarities and Neighbors (Complicated) - Heterogeneous Attributes - * Other Distance Functions - * Combining Functions: Calculating Scores from Neighbors Clustering Used for unsupervised segmentation To find clusters of objects similar within the cluster, but very different from objects -en other clusters Hierarchical Clustering - Start each node as -its own cluster - then merge clusters until only one is remaining - Application: The "tree of Life". Nearest Neighbors Revisited: Clustering Around Centroids - Focus on the clusters and represent each one by -its center - Most popular: k-means clustering Example: Clustering Business News Stories - Used clustering to analyze 14 months of stories and look for mentioning of APPLE - First the data are prepared analyzing the texts - Then the found stories are classified in 9 clusters - And then use the dusters to make predictions Stepping Back: Solving a Business Problem Versus Data Exploration We should spend as much time as necessary understanding the business The CRISP data mining process

Clustering methods Distance metrics for calculating similarity Similarity and Distance ____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 9 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

7. Decision Analytic Thinking I: What Is a Good Model? Fundamental concepts Careful consideration of what is desired from data science results It is important for the Data Scientists and other stakeholders to consider carefully what they would like to achieve by mining data It is crucial to think carefully about what we'd really like to measure Frameworks and metrics for tasks of classification - Evaluating Classifiers So, how do we evaluate that a classification model is performing well?

We already know to holdout part of list data, but how should we measure generalization performance? Classification accuracy is a popular metric because it is very easy to measure

Accuracy will be defined as the rate # of correct decisions/total # of decisions The Confusion Matrix Rows = Prediction Columns = actual value This definition has several problems

Problems with Unbalanced Classes This problem occurs when one class is rare Accuracy in the wrong thing to measure Problems with Unequal Costs and Benefits It makes no distinction between false positive and false negative errors Example: consider the cost of wrongly diagnosis an illness vs. the cost of wrongly not detecting the illness - Generalizing Beyond Classification We must keep asking: what is important in the application? What is the goal? Is it meaningful? Is there a better metric?

A Key Analytical Framework: Expected Value The Expected Value computation decomposes data-analytic thinking into three aspects - The structure of the problem - The elements of the analysis that can be extracted from the data - The elements of the analysis that need to be acquired from other sources Procedure - The possible outcomes of a situation are enumerated - Expected value = Weighted Average (Values of the different possible outcomes) Weight = its probability of occurrence Examples - Using Expected Value to Frame Classifier Use In use, we want to classify so we can act in the different classes Sometimes the model will have such an extremely low probability that everyone will fall into a negative category Here is when the equations of expected value come in handy

- Using Expected Value to Frame Classifier Evaluation Shift focus from individual decisions to collection of decisions Uses the confusion matrix to calculate the costs

Evaluation, Baseline Performance, and Implications for Investments in Data It is important to consider carefully what would be a reasonable baseline against which to compare a model performance One good one is the Majority Classifier: taken from the training data Consideration of appropriate comparative baselines Exemplary techniques Various evaluation metrics Estimating costs and benefits Calculating expected profit Creating baseline methods for comparison ____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 10 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

8. Visualizing Model Performance Fundamental concepts It is better to present visualizations rather than just calculations The ranking methodology is intended to make decisions not on individual cases, but in multiple cases that rank at the top Exemplary techniques Profit curves With a ranking classifier we can produce a list of instances and their predicted scores Appropriate when you know the conditions under which a classifier will be used - The class priors or base rate (proportion of positives) in known - The costs and benefits are known ROC Graphs and Curves This is a method that can accommodate uncertainty by showing the entire space of performance possibilities Discrete classifier: outputs a class level (opposite to a ranking) Cumulative Response and Lift Curves This is an alternate visualization that in more intuitive than Roc Plots the "hit rate" (y axis) ie. The percentage of positives correctly classified as a function of the percentage of the population that is targeted (x axis) Sometimes it is called a Lift Curve Example: Performance Analytics for Churn Modeling (Very complicated)

9. Evidence and Probabilities Explicit evidence combination with Bayes’ Rule We can think about the things we know as the evidence for or against different values of the target Combining Evidence Probabilistically Joint Probability and Independence Applying Bayes’ Rule to Data Science - Conditional Independence and Naive Bayes 240 - Advantages and Disadvantages of Naive Bayes Probabilistic reasoning via assumptions of conditional independence CI is when knowing about one event doesn't affect our knowledge of the other Naive Bayes classification: Very efficient in terms of computation power The question is: how do different targets generate feature values? Apply the Bayes Rule to answer the question "Which class most likely generated this example?"

10. Representing and Mining Text The importance of constructing mining-friendly data representations: Data preparation Why Text Is Important It is everywhere Why Text Is Difficult It is unstructured Representation Bag of Words Ignores grammar, spelling, punctuation Term Frequency Measuring Sparseness: Inverse Document Frequency Combining Them: TFIDF Beyond Bag of Words N-gram Sequences - Transforming the text phrases into sequences ( Joined by underscores) - Disadvantage: increases the size of the set Named Entity Extraction Topic Models - Have the words point to a topic ____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 11 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

11. Decision Analytic Thinking II: Toward Analytical Engineering Analytical engineering is designing a solution to a business problem The point is to promote thinking about problems data analytically

12. Other Data Science Tasks and Techniques It is better to cast a new problem into a set of known problems Co-occurrences and Associations: Finding Items That Go Together Measuring Surprise: Lift and Leverage Example: Beer and Lottery Tickets Association and co-occurrences Associations among Facebook Likes Profiling: Finding Typical Behavior Link Prediction and Social Recommendation Causal reasoning from data Latent information mining Example: Movie recommendation Data Reduction, Latent Information, and Movie Recommendation

13. Data Science and Business Strategy Discussion of the chapter: The interaction between Data Science and Business strategy-High level perspective on choosing problems to be solved with data science The fundamental concepts of data science allow us to think clearly about strategic issues As a whole, the array of concepts in useful for tactical business decisions such as evaluating proposals for data science projects Curator of data science capability Acquiring and sustaining competitive advantage via data science Hour does a business ensure that it gets the most of the Data? (i) The management must think analytically - Managers do not have to be Data Scientists They must understand the fundamentals

So they understand and appreciate the Data Science opportunities So they supply the appropriate resources 6 the Data Science teams So they are willing to invest in data experimentation, They need to accept that as data science projects span so much of a business a diverse team is essential

Managers don't need to be data scientists, and Data Scientists do not necessarily need deep expertise in business solutions An effective data science team involves the two and each one needs an understanding of the fundamentals of the other - If there is no management support, the company will be at disadvantage to a competitor who invests on doing the Data science well - If management has a vague idea of the potential of predictive modeling, but does not invest in prop per training, then the success will be partial at best (ii) Management will create a culture to make data science and scientists thrive - A data-rich system is developed for one application - other applications throng the business become obvious Achieving Competitive Advantage with Data Science Data and Data science capabilities are complementary assets - Do we have a unique data asset? - If not, do we have an asset the utilization of which is better aligned with our strategy than with the strategy of our competitors? - ... Or are we able to the advantage of the data asset due to our better data science capability? The value of the asset must be carefully considered - The asset must be valuable in the context of the strategy - For example, big data can be more valuable for a web-based vendor, than a retail-based one

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 12 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Sustaining Competitive Advantage with Data Science Even if we can achieve competitive advantage, Can we sustain it? - Competitors must either not possess the asset or must not be able to obtain the same value from it - Plan to keep ahead of the competitors Keep investing in new data sets If you have a great learn you may will to keep ahead of the competition Always be developing new techniques and capabilities

- Plan to make it impossible far the competition to replicate your model The data products themselves can increase the cost to competitors of replicating the data set (think Google or Amazon) Formidable Historical Advantage Unique Intellectual Property Unique Intangible Collateral Assets

Your model is not what your data Scientists design, it is what your engineers implement Superior Data Science Management - The good data science managers must possess exceptional abilities Be able to communicate well: Be respected by ''techies" and "suits"

Coordinate technical complex activities Anticipate outcomes of Data science projects Truly appreciate and understand the needs of the business Be ready to accept creative ideas from any source

Many times the least data-savvy workers can provide insight on horn to improve the business with data science They need to do all this within the particular context of the firm

- Be ready to evaluate proposals for data science projects Example Data Mining Proposal Flaws in the Big Red Proposal Each stage in the data mining process reveals questions that should be asked both in formulating proposals for projects and evaluating them

Is the business problem well specified? Does the Data Science solution solve the problem? Is it clear how we would evaluate the solution? Would we be able to see evidence of success before making a huge investment in deployment? Does the firm have the data assets it needs? - Attracting and Nurturing Data Scientists and Their Teams The market for data scientists is very competitive

There is a huge variance in the quality and ability of data scientists In the yearly KDD competition, the same teams always win Only top-notch data scientists can hire superb data scientists Aspiring data scientists work as apprentices to masters They must be around top-notch data scientists A top-notch Data scientist must have a strong professional network Data science problems are unique and one - hammer for all nails approach do not fit here Data science is a craft learned by experience Example of visions for Data Scientists

Many want to have more individual influence Many want more responsibility and the experience that comes with the process to produce a Data Science solution Some want to become Chief Scientists for a firm Some may want to become entrepreneurs Some would simply enjoy the thrill of taking part of a put-growing venture A firm can find a PHD for 50k / year and have a good project for a fraction of the price

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 13 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Examine Data Science Case Studies - The best way to position oneself for success is to work through may examples of the application of data science to business problems - Mining data is helpful but even more/ important is working through the connection between the business problem and the possible data science solutions. A Firm’s Data Science Maturity How systematic and well founded are the processes used to guide the firms' data science projects? A firm with a medium level of maturity employs well- trained data scientists as well as business managers who understand the fundamental principles of Data Science High-end maturity firms are continuously trying to improve the data science processes Danger: to use the same sort of processes that work for software engineering, or worse, for manufacturers or operations. Doing so will send a firm's best data scientist out the door before the management even knows what happened

14. Conclusion The Fundamental Concepts of Data Science Data science = Analytical Engineering + Exploration. It has 3 types of fundamental concepts How to fit Data Science in the organization - How to build successful teams - Thinking about how Data Science leads to competitive advantage - Tactics far doing well with Data Science projects Ways of thinking data analytically - Data mining process - High - level Data Science tasks - Principles Data is an asset The expected value help us structure the problems so we can identify the data mining problems and the connections in terms of Costs, benefits and restrictions Generalization and overfitting Applying data science requires different levels of effort than practicing exploratory data mining

How to extract knowledge from data - Identifying informative attributes - Fitting a numeric function model by choosing an objective and finding the parameters - Controlling complexity to find a good balance between generalization and overfitting - Calculating similarities When we understand the concepts, we can see that every method and technique we learn is an instance of one of the principles Understanding the concepts facilitates He communication across the business You should be more confident that you will understand explanations on data science projects The author presents an example of how the principles help us better understand and attack a problem on how to target mobile devices with ads Changing the Way We Think about Solutions to Business Problems Be very aware of the times when you slightly change the problem because you have a particular arrangement in your data, because doing thin may cause communication problems down the line What Data Can’t Do: Humans in the Loop, Revisited Only humans can tell what is the best objective criterion to optimize for a particular problem The data used has been produced with some level of human intervention, so we never should expect the data to be representative of the objective truth We need to discern for which problems Data Science is likely to add value

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 14 of 15

Summary of 2013; Provost; Data Science for Business Created by Ariadna73 using a beta version of EASY+

(Please click, and share the video. Thank you, Ari!)

Privacy, Ethics, and Mining Data about individuals This in a big problem because the more personal the data, the more effective the Data Science techniques There are problems even defining Privacy (See Daniel Solove's article "A taxonomy of privacy" (2006)) And Helen Nissenbaum book "Privacy in context" Is There More to Data Science? Yes, there are more concepts not covered in this book Final Example: From Crowd-Sourcing to Cloud-Sourcing Final Words The concepts are always useful, even if you have decades of experience Thin is the Data Scientist's challenge: Exactly why your work is relevant to helping the business?

A. Proposal Review Guide Business data understanding What is exactly the business problem to be solved? Is the Data Science solution well formulated? What business entity doe, au instance correspond to? Is the problem supervised or unsupervised? If supervised - Is a target variable defined? Is it defined precisely? Think about the values it can take

- Will modeling this target variable improve the stated business problem? A subproblem? Is the rest of the business problem addressed? Does framing the problem help to structure the sub-tasks to be solved?

If unsupervised - Is then an "Exploratory Data Analysis" Path well defined - Where is the analysis going? Are the attributes defined precisely? Think about the values they can take Data preparation Will it be practical to create a single table? If not, what are the alternatives? If supervised, be super-careful with the target variable Where are the data drawn from? A population similar to that where the model will be applied? Modeling Question your choice of model Classification, class probability estimation, ranking, regression, clustering, etc Does the model meet the other requirements of the task? Generalization performance, comprehensibility, speed of learning Is it compatible with prior Knowledge of problem? Should various models be tried and compared? Evaluation and Deployment Is there a plan for Domain-knowledge validation? Is the evaluation metric appropriate for the business task? Are business costs and benefits considered? How a classification threshold is chosen? Is ranking more appropriate? Does the evaluation use holdout data?

____________________________________________________________________________________ Created by Ariadna73 using a beta version of EASY+ (Please click, and share the video. Thank you, Ari!) Page 15 of 15

PROVOST (2013) Data Science for Business

Short Description

Description

Comments

We need your help!