Quantitative Methods Management tutorials
Short Description
Quantitative Methods Management tutorials...
Description
Quantitative Methods Pre-Assessment Test Introduction Welcome to the pre-assessment test for the HBS Quantitative Methods Tutorial. Students with a strong statistics background may take the pre-assessment test to satisfy the quantitative methods requirement without taking the tutorial. To satisfy the requirement, you will need to answer at least 75% of the questions correctly. This is an open-book multiple-choice exam. To advance from one question to the next, you must select one of the four answer choices and click the Submit button. After submitting your answer, you will not be able to change it or return to the question, so make sure you are satisfied with your selection before you submit each answer. In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text. Your exam results will be displayed immediately upon completion of the exam. The exam results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for the questions you answered incorrectly. After completing the exam, you can review your test results at any time by returning to this screen and clicking OK. If you haven't yet taken the test, click Pre-Assessment Test on the navigation on the left to begin. Good luck! Frequently Asked Questions How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the course. Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book examination. May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question. Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with the material, but you may take longer if you need to. What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to the exam site. How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for any questions you answered incorrectly. Pre-Assessment Test [Exam content not shown] Overview & Introduction Welcome to QM... Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis. This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve management decision-making problems. Click on the link labeled "The Tutorial and its Method" in the left menu to get started. The Tutorial and its Method QM is designed to help you develop quantitative analysis skills in business contexts. Mastering its content will help you
evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon below to advance to the next page. This isn't a formal or comprehensive tutorial in quantitative methods. QM won't make you a statistician, but it will help you become a more effective manager. The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to improve your ability to formulate, analyze, and solve managerial problems.
You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual management problems. You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM offers many features we hope you will explore, utilize, and enjoy.
The Story and its Characters Naturally, the most appropriate setting for a course on statistics is a tropical island... Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after all, staying at a 5-star hotel as a Summer Associate with Avio Consulting. This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer internship, as you prepared to enroll in a two-year MBA program this fall.
You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio partners consider you a high potential intern — they are willing to invest in you with the hope that you will later return after you complete your MBA program. Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead: providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai.
Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine cuisine, a spa, sports activities. And above all, the pristine beach and glorious ocean. After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo. Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach to his management decision-making processes.
Using the Tutorial: A Guide to Tutorial Resources Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage. QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link labeled "Using the Tutorial" on the left. These navigation links open interactive clips (like this one) here. There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips. Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip will have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice will present your results to Leo for his consideration. The Kahana clips will give you exposure to the types of business problems that benefit from the analytical methods you'll be learning, and a context for practicing the methods and interpreting their results. To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiple-choice assessment exam will evaluate your understanding of the material. In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. Complementing the text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be asked questions even in Explanatory Clips that you should answer to check your understanding of the concepts. Some explanatory clips give you directions or tips on how to use the analytical and computational features of Microsoft Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this course. QM is supplemented with spreadsheets of data relating to the examples and problems presented. When
you see a Briefcase link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel functions to reproduce the graphs and analyses that appear in the clips. You will also see Data links that you should click to view summary data relating to the problem. Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource that you can use to make sure that you have mastered the important concepts in each section. Work through exercises to solidify your knowledge of the material. Challenge exercises provide opportunities to tackle somewhat more advanced problems. The challenge exercises are optional - you should not have to complete them to gain the mastery needed to pass the tutorial assessment test. The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the one on the right to move forward. Use the one on the left if you want to back up a page or two. In the upper right of the QM tutorial screen are six buttons. From left to right they are links to the Help, Discuss, Notes, Briefcase, Glossary, and Print. To access additional Help features, click on the Help icon. Use the discussion board to discuss course materials with your classmates, ask questions, and share any previous on-the-job experiences you may have had applying the concepts in the course. HBS staff and faculty will also use the discussion board to post clarifying information from time to time. To access the discussion board, click on the Discuss icon. The Notes summarize the content of the Explanatory Clips. Can't recall all the essential steps of a hypothesis test? Find them in the Notes. In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel Workbooks. In many of the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time. In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of the Excel functions used in the course. At the end of the tutorial, you'll have the opportunity to evaluate the tutorial. In the meantime, as you work through QM, you may have comments or feedback on the material. We invite your feedback at any time: click on the Feedback icon on the navigation bar below. The page you are currently viewing will be recorded with your feedback. We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition for quantitative analysis that you will need as an effective and successful manager.
... and Welcome to Hawaii! The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat, and you watch as the foggy West Coast recedes behind you. I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them.
This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful than others. Apparently, he once owned and managed a gourmet spam restaurant! Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but that didn't do so well. He had to declare bankruptcy. Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large operation on his hands. It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to take risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how to approach management problems in a more sophisticated, analytical fashion. We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way. You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the analytic work soon.
Leo and the Hotel Kahana
Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport. Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that she would leave it to me. Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs. Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I make decisions. That's where you come into the picture. I used to be quite a risk-taker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want to get it right. After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in the main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."
Basics: Data Description Leo's Data Mine After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You wake up early enough to take a short walk on the beach before you make your way to Leo's office. Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip.
Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important data on the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over. There's just so much to keep track of. Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and the type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse your files? Yes. There are two things in particular that have been on my mind recently. For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract out the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or discontinue offering scuba lessons all together. I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and how it compares to the school I've been using. I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the year, and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given month. These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it. Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates. That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on diving school prices and occupancy patterns. Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, your suite, Alice, includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool whenever you like. Thanks! We'll certainly take advantage of that.
Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to sea. Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise and useful information. But no matter what data you are working with, always make sure you really understand the data before doing a lot of analysis or making managerial decisions." What is Alice getting at when she tells you to "understand the data?" And how can you develop such an understanding? Describing and Summarizing Data Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the distribution of data.
Working with Data Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do? The data we encounter each day have valuable information buried within them. As managers, correctly analyzing financial, production, or marketing data can greatly improve the quality of the decisions we make. Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information and insight as possible from our data we have available. When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How were they collected? How can we help the data tell their story? Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was three and a half feet. We might be surprised... ... until we learn that the building is an elementary school. We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she measured height: with or without shoes. Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we might first try to detect any patterns, trends, or relationships that exist in the data. We might start by grouping the data into logical categories. Grouping data can help us identify patterns within a single
category or across different categories. But how do we do this? And is this often time-consuming process worth it? Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to comprehend. In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask: Are operating expenses increasing or decreasing? Do office space costs vary much from year to year? Comparing data across different years or different categories can give us further insight. Are selling costs growing more rapidly than sales? Which division has the highest inventory turns?
Histograms In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed graphically can significantly deepen our understanding of a data set and the situation it describes.
To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What questions might we want to answer with the energy data? Which country is the largest consumer? How much energy do most countries use? Source In order to create a graph that provides good visual insight into these questions, we might sort the countries by their level of energy consumption, then group together countries whose consumption falls in the same range — e.g., the countries that use 100 to 199 million tonnes per year, or 200 to 299 million tonnes. Source We can find the number of countries in each range, and then create a bar graph in which the height of each bar represents the number of countries in each range. This graph is called a histogram. A histogram shows us where the data tend to cluster. What are the most common values? The least common? For example, we see that most countries consume less than 100 million tonnes per year, and the vast majority less than 200 million tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year.
Why are there so many countries in the first range — the lowest consumption? What factors might influence this? Population might be our first guess. Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's energy usage.
Outliers In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from the rest of the data are known as outliers. How do we interpret them?
First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error? Was it collected in a different way than the rest of the data? At a different time? We might discover that the data point refers to a 75 year-old retiree, taking the course for fun.
After making an effort to understand where an outlier comes from, we should have a deeper understanding of the situation the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one of three things: leave the outlier alone, or — very rarely — remove it or change it to a corrected value. A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we truly want to understand the age distribution of all students in the class, we would leave the point in. Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in fulltime degree-granting programs, we would exclude the senior citizen and all other non-degree program students enrolled in the course. Occasionally, we might change the value of an outlier. This should be done only after examining the underlying situation in great detail. For example, if we look at the inventory graph below, a data point showing 80 pairs of roller-blades in inventory would be highly unusual. Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6 on April 14th. Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales and purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable value, we correct the data point. Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we want to draw. Such changes to a data set should be made on a case-by-case basis only after careful investigation of the situation.
Summary With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can provide insight into a situation, they can help us to make the right decisions.
Creating Histograms Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms using the Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can construct them in the future when you do have access to the Data Analysis Toolpak. To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis" appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak. Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis Histogram tool. To prepare the data, we enter or copy the values into a single column in an Excel worksheet. Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins," into a second column of data.
In the Tool bar, select the Data tab, and then choose Data Analysis. In the Data Analysis pop-up window, choose Histogram and click OK. Click on the Input Range field and enter the range of data values by either typing the range or by dragging the cursor over the range. Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't specify our own bins, Excel will create its own bins, which are often quite peculiar. Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the summary table, which is created by default. Click New Worksheet Ply, and enter the name you would like to give the output sheet. Finally, click OK, and the histogram with the summary table will be created in a new sheet. Central Values for Data Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise way with a single number.
The Mean Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data we're investigating and the type of questions we'd like the data to answer. What number would best describe employee satisfaction data collected from annual review questionnaires? The numerical average would probably work quite well as a single value representing employees' experiences.
To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result by 11, the number of surveys. The Greek letter mu represents the mean of the data set. The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set. However, it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean value towards one extreme. In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that side will pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US consumption pulls the mean to a value higher than the consumption of most other countries. What other numbers can we use to find the central tendency of the data? The Median
Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion. How should we interpret this number? How well does this average represent the revenues of these companies?
When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of revenue a year. If this is true, why is the mean so high? Source As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These higher revenues pull up the average considerably. Source In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In these cases, we can use another central value called the median. Source The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than the median, and half are lower. Source For income, the median revenues of the top 100 US companies is $30 billion; significantly less than $42 billion. Half of all the companies earn less than $30 billion, and half earn more than $30 billion. Source Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of high-revenue earners. How can we find the median? Source With an odd number of data points, listed in order, the median is simply the middle value. For example, consider this set of 7 data points. The median is the 4th data point, $32.51. In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values — and obtain a median of $41.92. When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the pros and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed distribution. By contrast, the median is not biased by outliers and is often a better value to represent skewed data.
The Mode A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might use the mode to represent data when knowing the average value isn't as important as knowing the most common value. In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more than one peak. A distribution that has two peaks is called a bimodal distribution.
Summary To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central tendency" of the data set, but we need to understand how they differ before using them: Finding The Mean In Excel To find the mean of a data set entered in Excel, we use the AVERAGE function. We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas. In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are located. Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it does not represent an actual data point.
Finding The Median In Excel Excel can find the median, even if a data set is unordered, using the MEDIAN function. The easiest way to calculate a data set's median is to select a range of cell references.
Finding The Mode In Excel Excel can also find the most common value of a data set, the mode, using the MODE function. If more than one mode exists in a data set, Excel will find the one that occurs first in the data. Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating.
Variability The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently around the center value," Alice tells you. "We need a way to measure variation in the data."
The Standard Deviation It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely dispersed?
Let's look at an example. To identify good target markets, a car dealership might look at several communities and find the average income of each. Two communities — Silverhaven and Brighton — have average household incomes of $95,500 and $97,800. If the dealer wants to target households with incomes above $90,000, he should focus on Brighton, right? We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is there a wide range around the average income? A market might be less attractive if fewer households have an income above the dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a closer look at the data. Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the dealer's target income range. Without understanding the variability in the data, the dealer might have chosen Brighton, which has fewer targeted homes. Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two communities. Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a data set, we need a summary statistic that captures the level of dispersion in a set of data. The standard deviation is a common measure for describing how much variability there is in a set of data. We represent the standard deviation with the Greek letter sigma: The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a conceptual level first. Then we'll build up step by step to help understand where the formula comes from. The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.
Calculating A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays, typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the difference between unhappy guests and satisfied customers who want to return. On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services during a shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her invaluable additional information about how those requests might vary across different days. The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To staff properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 and 195, for example, or between 120 and 220.
To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to calculate a summary statistic called the variance. Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the variance, we first sum the squares of these differences. Why square the differences? A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or zero. If we simply summed the differences between each Saturday's requests and the mean, positive and negative differences would cancel each other out. But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get only positive numbers that do not cancel each other out in a sum. The formula for variance adds up the squared differences and divides by n-1 to get a type of "average" squared difference as a measure of variability. (The reason we divide by n-1 to get an average here is a technicality beyond the scope of this course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the variability of the data? Sure, but variances don't come out in the most convenient form. Because we square the differences, we end up with a
value in 'squared' requests. What is a request-squared? Or a dollar-squared, if we were solving a problem involving money?
We would like a way to express variability that is in the same units as the original data — front-desk requests, for example. The standard deviation — the first formula we saw — accomplishes this. The standard deviation is simply the square root of the variance. It returns our measure to our original units. The standard deviation for the hotel's Saturday desk traffic is 25.2 requests.
Interpreting What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests. With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation would translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long lines; when traffic is very low, desk staff are idle. For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is more representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both customer dissatisfaction and staff idle time. Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it easy for us to calculate variance and standard deviation.
Summary The standard deviation measures how much data vary about their mean value.
Finding in Excel Excel's STDEV function calculates the standard deviation. To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas. In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation. To calculate variance, we can use Excel's VAR function in the same way.
The Coefficient of Variation The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so much. How can you compare the variability in different data sets? A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two data sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is "more variable"?
Standard deviations must be considered within the data's context. The standard deviations for two stock indices below — The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent over a period. But were the two indices equally variable? Source If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the average); if the average is $700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's average index price was over three and half times higher than TSC's average index price. Source To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of the data to the data's mean. Source We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent of the mean. To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative variation? Click the answer you select. Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out which data set is most variable in this relative sense. The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of variability.
Summary
The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation in different data sets of different scales or units.
Applying Data Analysis After a good night's sleep, you meet Alice for Breakfast. "It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem." Pricing the Scuba Schools In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course. Prices You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own bins. If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the finished histogram.
Prices Histogram This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape of the distribution suggests that: a. The mean = the median This is not the best answer. When the histogram is skewed to one side, the mean and the median are different. If the histogram you constructed from the pricing data looks symmetric, try using the recommended bin sizes.
b. The mean > the median This is the best answer. The prices of the few expensive schools "pull" the mean towards the right.
c. The mean < the median This is not the best answer. When the histogram is skewed to the left, the mean is less than the median.
d. None of the above relationships can be determined from the histogram. This is not the best answer. It should be apparent from your histogram that the distribution is skewed to the right, in which case the mean is greater than the median. If the histogram you constructed from the pricing data looks symmetric, try using the recommended bin sizes.
Prices Histogram You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation):
a. $307, $326, $60 This is not the correct answer. You may be confusing the mean and the median.
b. $307, $326, $67 This is not the correct answer. You may be confusing the mean and the median.
c. $326, $307, $60 This is not the correct answer. The standard deviation is $67.
d. $326, $307, $67
This is the correct answer.
Prices Histogram Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average for the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand, maybe these more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with the course offered by my current contractor... Exercise 1: VA Linux Stock Bonanza After a company completes its initial public offering, how is the ownership of common stock distributed between individuals in the firm, often termed "named insiders"? Let's examine a company, VA Linux, that choose to sell its stock in an Initial Public Offering (IPO) during the IPO craze in the late 1990s. According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of common stock owned by insiders:
Source From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer)
a. The total number of shares of common stock owned by the named insiders. This is not the best answer. To find the total number of shares, it would be best to add up the raw data in tabulated form. Since histograms place data points in ranges, we'd have trouble finding the total of the individual values from a histogram.
b. The percentage of common stock owned by each of the named insiders in VA Linux's prospectus. This is not the best answer. The histogram specifies neither the exact number of shares owned by each individual nor the total number of outstanding shares of common stock, both of which we would need to compute the percentage of common stock owned by each insider.
c. How the ownership stakes are distributed among named insiders. This is the best answer. By converting the data into a histogram, the distribution of stock among the named insiders is apparent, and we get a good idea of how ownership is distributed inside this young company.
d. How the named insiders' shares compare to the holdings of outside investors who purchased shares in the IPO. This is not the best answer. Although this analysis would be interesting, we simply don't have the necessary data. We have no information about how much stock individuals other than the named stockholders will own after the IPO.
Exercise 2: Employee Turnover Here is a histogram graphing annual turnover rates at a consulting firm. Which summary statistic better describes these data?
a. The mean This is not the best answer. As you can see in the histogram, the data are strongly skewed to the right. A few years of uncharacteristically high turnover have a strong influence on the value of the mean. In cases such as this, the median is often a better descriptor for the center of the data.
b. The median This is the best answer. A few years of uncharacteristically high turnover have a strong influence on the value of the mean. In cases such as this, the median is often a better descriptor for the center of the data.
Exercise 3: Honidew Internship The J. B. Honidew Corporation offers a prestigious summer internship to first-year students at a local business school. The human resources department of Honidew wants to publish a brochure to advertise the position. To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic expectations. The human resources manager calculates the mean GPA of the previous 8 interns, to include in the brochure. The mean GPA of the former interns is:
a. 3.86 This is the correct answer. Simply sum up the GPA's and divide by 8, the number of values in the data set.
b. 3.91 This is not the correct answer. Be sure you are calculating the mean, and not the median.
c. 3.93 This is not the correct answer. If we exclude the lowest GPA 3.35 as an outlier this would be the correct answer, but we must include it because it is an actual value of a previous intern's GPA.
Interns' GPA's In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35. In the presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'. What's the median GPA in this data set?
a. 3.87 This is not the correct answer. 3.87 is one of the two central GPA data points, but the median is the average of the two central points.
b. 3.91 This is the correct answer. The median is the average of the two central GPA data points, 3.87, and 3.95.
c. 4.0 This is not the correct answer. As the most frequently occurring data point, 4.0 is the mode of the sample.
Interns' GPA's Exercise 4: Scuba Regulations Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for example, to scuba equipment using a device called a "rebreather" to recycle oxygen from exhaled air. Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With too little oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning. Minimizing the deviation of oxygen concentration levels from the specified level is clearly a matter of life and death! A scuba equipment-testing lab compared the oxygen concentrations of two different brands of rebreathers, A and B. Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen concentration appear to have a lower standard deviation?
a. A This is the correct answer. Much more of the data are clustered near the mean of the data set: 21.00%.
b. B This is not the correct answer. The data for model B are spread farther from its mean of 20.98% than the data for model A are spread from its mean, 21.00%. Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the set. Even without calculations, we have a good knack for seeing which set is more variable. We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we can calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.
Exercise 5: Fluctuations in Energy Prices After decades of government control, states across the US are deregulating energy markets. In a deregulated market, electricity prices tend to spike in times of high demand. This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable. To provide a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard deviation of prices during the 1990s, when electricity prices were largely regulated. From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and $50.55. What is the standard deviation of these eleven prices?
a. $2.02 This is the correct answer. Either using Excel or calculating the formula by hand, the standard deviation is $2.02, fairly low compared to the mean price of $48.40.
b. $4.08 This is not the correct answer. You may have forgotten to take the square root of the variance. Try using Excel's STDEV formula to double-check your answer.
c. $6.38
This is not the correct answer. If you calculated the standard deviation by hand, did you forget to divide by n-1?
Electricity Prices Source Excel makes the job much easier, because all that's required is entering the data into cells and inputting the range of cells into the =STDEV() function. The result is $2.02. On the other hand, to calculate the standard deviation by hand, use the formula: First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the sum of these squared differences, 40.79. Divide by the number of points minus one (11 - 1 =10 in this case) to obtain 4.08. Taking the square root of 4.08 gives us the standard deviation, $2.02.
Exercise 6: Big Mart Personal Care Products Suppose you are a purchasing agent for a wholesale retailer, Big-Mart. Big-Mart offers several generic versions of household items, like deodorant, to consumers at a considerable discount. Every 18 months, Big-Mart requests bids from personal care companies to produce these generic products. After simply choosing the lowest individual bidder for years, Big-Mart has decided to introduce a vendor "score card" that measures multiple aspects of each vendor's performance. One of the criteria on the score card is the level of year-to-year fluctuation in the vendor's pricing. Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in relation to their average price, as measured by the coefficient of variation?
a. Personal Care International This not the correct answer. The coefficient of variation is 0.17, in between the two other brands: 0.12 and 0.20. Take the ratio of the standard deviation to the mean to find the coefficient of variation.
b. Beautica This is the correct answer. The coefficient of variation is 0.12, lower than for both of the other companies.
c. BMKIP This is not correct answer. This coefficient of variation is 0.20, the largest coefficient of variation of the three. Take the ratio of the standard deviation to the mean to find the coefficient of variation.
Summary Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater share of the project.
Relationships Between Variables So far, you learned how to work with a single variable, but many managerial problems involve several factors that need to be considered simultaneously.
Two Variables We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with two variables? Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship between the two? Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we let the data tell their story about the strength and nature of that relationship? As always, one of our first steps is to try to visualize the data. Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one heightweight pair for each athlete.
Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a relationship between height and weight. This type of graph is called a "scatter diagram." Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes. In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier. We need to be careful not to draw conclusions about causality when we see these types of relationships. Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights. Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier certainly doesn't make us taller! The direction and extent of causality might be easy to understand with the height and weight example, but in business situations, these issues can be quite subtle. Managers who use data to make decisions without firm understanding of the underlying situation often make blunders that in
hindsight can appear as ludicrous as assuming that gaining weight can make us taller. Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort.
We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort, the more massages purchased — to a point, where massages level off. Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our understanding of the underlying context from which the data are drawn.
Variable and Time Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable over time. In such cases, we can consider time as our second variable. Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics distributor. Experience tells us these components have high price volatility. Should we make the purchase now? Or wait? Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same way we plotted height and weight. Because time is one of the variables, we call this graph a time series. Time series are extremely useful because they put data points in temporal order and show how data change over time. Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal patterns, with prices in some months consistently higher than in others? Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on visual analysis when looking for relationships and patterns.
False Relationships Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be careful to avoid some of these common pitfalls. Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first victim of this unfortunate relationship.
Source James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office.
Source Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the president elected in 2020? Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is no more than an interesting coincidence.
Hidden Variables Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the relationship. We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they are instead mutually related to another underlying factor. Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store. The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the two data sets actually related? If so, why? A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer, people play baseball. If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at all. We could have neglected a critical variable driving the sales of both products. In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two variables.
A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable causes the other, but simply illustrate how the data behave. Summary Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or may stem from a relationship each variable has with a third, often hidden variable.
Creating Scatter Diagrams To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart tools to plot the data.
To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns. If the data sets are next to each other, simply select both sets.
Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type: Scatter with Only Markers. Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-axis and the second column of data on the Y-axis. We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1. Then we can add the chart title and label the axes by selecting and editing the text.
Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of your chart. Correlation By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that relationship? Can we describe the relationship in a standardized way? Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two variables looks strong ... ... or weak ... ... linear ... ... or nonlinear ... ... positive (when one variable increases, the other tends to increase) ... ... or negative (when one variable increases, the other tends to decrease).
Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data points are close to an imaginary line running through a scatter plot. Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the two variables. However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two variables. To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Here's a strong positive correlation (about 0.85) ... ... and here's a strong negative correlation (about -0.90). If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1. At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle? Even when the correlation coefficient is 0, a relationship might exist ! just not a linear relationship. As we've seen, scatter plots can reveal patterns and help us better understand the business context the data describe. To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates into a correlation coefficient, let's revisit the examples we analyzed visually earlier.
Influence of Outliers In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance patterns of their employees. For example, do workers' absence rates vary by time of year? Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466.
While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the weather might indeed be the culprit. But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager might realize that the three outliers correspond to a late-summer, three-day transportation strike that kept some workers homebound the previous year. Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the outliers influence our measure of linearity so much? As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of every data point. Just as it does with the mean, this inclusiveness can get us into trouble... Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.
Summary
The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value of the correlation coefficient ranges between -1 and +1.
Finding in Excel Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height and weight. Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is aligned with its corresponding value in the other set. To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as shown below. The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height and weight, both values certainly need to refer to the same person!
Occupancy and Arrivals Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me with Leo's hotel occupancy problem." In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage of available rooms occupied by guests. Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month. On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can gather very precise data on arrivals. Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship, and a measure of its strength. Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by the Hawaii Department of Business, Economic Development, and Tourism.
Kauai Data Source The best way to graphically represent the relationship between arrivals and occupancy is:
a. A histogram This is not the best answer. A histogram is used to gain insight into the behavior of a single variable. It represents the frequency at which certain ranges of values of the variable occur in a data set.
b. A scatter diagram This is the best answer. We use scatter diagrams to represent the relationship between two variables.
c. A time series This is not the best answer. We use time series to display the behavior of a variable over time.
d. A series of concentric burning wheels This is not the best answer. It is simply a more exciting way of saying "none of the above," which is also not the best answer.
Kauai Data Source You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as:
a. Weakly negative and linear This is not the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals.
b. Strongly negative and non-linear This is not the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals.
c. Strongly positive and linear This is the best answer. The relationship is positive. Higher levels of occupancy generally correspond to higher numbers of arrivals. The trend appears to be reasonably linear.
d. Strongly positive and non-linear This is not the best answer. The trend appears to be generally linear.
Kauai Data
Source You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.
Kauai Data Source To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you hit enter, the correct answer, 0.71, will appear. Kauai Data Together with Alice, you compile your findings and present them to Leo.
Source I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals, the occupancy rates were very different — 68% in one month and 82% in the other. But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers come to Kauai in one month, and more hotel patrons in the other? Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of the arrivals would be what we call a hidden variable. Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than if they spend only 3 days. I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a later time. The scuba school contract is more pressing at the moment.
Exercise 1: The Effectiveness of Search Engines As online retailing expands, many companies are interested in knowing how effective search engines are in helping consumers find goods online. Computer scientists study the effectiveness of such search engines and compare how many results search engines recall and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for example a page containing both the phrases "winter parka" and "Eddie Bauer." What could you say about the relationship between the Precision and the number of Results Recalled?
a. The amount of information a search engine recalls decreases over time. This is not the best answer. Time isn't graphed on the scatter plot, and we do not know how it might be involved in a relationship between these two variables.
b. An increase in precision causes the amount retrieved to decrease. This is the not the best answer. Although we do observe higher values of precision with lower values of recall, and vice versa, we have no idea if one causes the other. With a scatter diagram, we can never make claims about causality! c. Recall and precision seem to be related: a large number of results typically pairs with low precision. This is the best answer. From the scatter plot, we can see that the variables demonstrate a relationship, but maybe not a linear one. However, even when we recognize a clear relationship, we cannot conclude that greater precision causes the amount of information recalled to decrease.
Source Exercise 2: Education and Income Is an education a good investment in your future? Some very successful business executives are college dropouts, but is there a relationship in the general population between income and education level? Consider the following scatter plot, which lists the income and years of formal education for 18 people. Is the correlation:
a. Strongly positive This is the best answer. The level of income is strongly associated with the number of years of education for our data.
b. Weakly positive This is not the best answer. The correlation between income and level of education is fairly pronounced. Weak correlations scatter widely around the imaginary line we can trace through the data.
c. Weakly negative This is not the best answer. In general, as education increases, incomes do as well. In a negative correlation, as education increases, income would decrease.
Source Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-education data, the coefficient is nearest to:
a. 0.1 This is not the best answer. A correlation coefficient of 0.1 indicates data with a weak linear relationship, but for our data, the relationship is fairly strong.
b. -0.5 This is not the best answer. At -0.5, the correlation coefficient indicates a negative linear relationship. Education and income tend to increase at the same time, which occurs with a positive linear correlation.
c. 0.9 This is the best answer. A fairly strong linear relationship has a correlation coefficient closer to 1.0, making 0.9 a reasonable guess for what we see occurring between income and education level.
Sampling & Estimation Introduction: The Scuba Problem Leo asks you to help him evaluate the Kahana's contract with the scuba school. Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers. We started our year-round diving program last year, contracting a local diving school to do a scuba certification course. The oneyear trial contract is now up for renewal. Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide whether or not to renew the contract. The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the month.
Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted but exhilarated. Alice is especially enthusiastic. "Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet! "But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed the scuba certification course. After all, we may have caught the instructor on his best day this year." Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school. Generating Random Samples Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests, and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample?
How to Create a Representative and Unbiased Sample As managers, we often need to know something about a large group of people or products. For example, how many defective parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How many people in our industry plan to attend the annual
conference? When it is too costly to gather the information we want to know about every person or every thing in an entire group, we often ask the question of a subset, or sample of the group. We then try to use that information to draw conclusions about the whole group. To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that sample and try to infer something about the total population we're interested in. For example, we could select a sample of people in our industry, ask them if they plan to attend the annual conference, and then infer from their answers how many people in the entire industry plan to attend. For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that between 7% and 13% of our entire population will attend. This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go forward. We want to know something about a population large enough to make examining every population member impractical. We first select elements from the population at random... ...then analyze that sample... ...and then draw an inference about the total population we're interested in. Taking a Random Sample The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're interested in. For example, we couldn't just ask the conference organizers if they wanted to attend. They would not be representative of the whole group — they would be biased in favor of attending the conference! To get a good sample, we must make sure we select the sample "at random" from the full population. This means that every person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15 — of being selected. Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see examples of some major mistakes people have made while trying to select a random sample, and provide some advice about how to avoid the most common types of sampling errors. In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the group in a database, we can just assign a unique number to each member of the group. We then let a computer draw random numbers from the list. This would ensure that each element of the population has an equal likelihood of being selected. If the population about which we need to obtain information is not listed in an easy-to-access database, the task of selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to introduce a bias in the way we select the sample. For example, if we want to know something about the opinions of an entire company, we cannot just pick employees from one department. We have to make sure that each employee has an equal chance of being included in the sample. A department as a whole might be biased in favor of one opinion.
Sample Size Learning about a Sample Once we select our sample, we need to make sure we obtain accurate information about each member of the sample. For example, if we want to learn about the number of defects a plant produces, we must carefully measure each item in the sample. When we want to learn something about a group of people and don't have any existing data, we often use a survey to learn about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to resolve. First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer them in a certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a question one way, and others interpret it differently, our results will be meaningless! Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the phone? Should we interview survey participants in person, or distribute handouts at a meeting?
There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively inexpensive, but might have a very low response rate. This is a major problem if those who respond have a different opinion than those who don't respond. After all, the sample is meant to learn
about the entire population, not just those with strong opinions! Creating a telephone survey creates other issues: When do we call people? Who is home during regular business hours? Most likely not working professionals. On the other hand, if we call household numbers in the evening the "happy hour crowd" might not be available. When we decide to conduct a survey in person, we have to consider whether the presence of the person asking the questions might influence the survey results. Are the survey participants likely to conceal certain information out of embarrassment? Are they likely to exaggerate? Clearly, every survey will have different issues that we need to confront before going into the field to collect the data.
Response Rates With any type of survey, we must pay close attention to the response rate. We have to be sure that those who respond to the survey answer questions in much the same way as those who don't respond would answer them. Otherwise, we will have a biased view of what the whole population thinks. Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to follow up with the people who did not respond the first time. We either need to increase the response rate by getting answers from those who originally did not respond, or we must demonstrate that the non-respondents' opinions do not differ from those of the respondents on the issue of interest. Tracking down everyone in a sample and getting their response can be costly and time consuming. When our resources are limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample and settle for a low response rate.
Summary Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences:
Classic Sampling Mistakes To understand the importance of representative samples, let's go back in history and look at some mistakes made in the Literary Digest poll of 1936. The Literary Digest, a popular magazine in the 1930's, had correctly predicted the outcome of U.S, presidential elections from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who would become the next president? Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you vote for in this year's presidential election?" This may well be the largest poll in American history." Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!" Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and become President of the United States." As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was re-elected to a third term in office in the largest landslide victory recorded to that date. This was a devastating blow to the Digest's reputation. What went wrong? How could such a large survey be so far off the mark? The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the survey to people on three different lists: the magazine's subscribers, car owners, and people listed in telephone directories. What was wrong with choosing a sample from these lists? The sample was not representative of the American public. Most lower-income people did not subscribe to the Digest and did not own phones or cars back in 1936. This led the poll to be biased towards higher-income households and greatly distorted the poll's results. Lower-income households were more likely to vote for the Democrat, Roosevelt, but they were not included in the poll. Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten million voters who were sent a poll, over two million responded. Two million is a huge number of people. What was wrong with this survey? The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than Democrats. Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the survey. Among those who received the survey, a disproportionate number of Republicans responded, and the results became even more biased.
The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the outcome of the election. Its reputation wounded, the Digest went out of business soon thereafter. During the same election year, a little known psychologist named George Gallup correctly predicted what the Digest missed: Roosevelt's victory. What did Gallup do that the Literary Digest did not? Did he create an even bigger sample? Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of accurate results if they weren't randomly selected from the population. Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly representative of the US population. He also instructed his team to be persistent in asking the opinion of each person in the sample, which generated a high response rate. Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling soon became a standard for public opinion polls. Today's polls usually consist of a sample of around a thousand randomly selected people who are truly representative of the underlying populations. For example, look at poll reported in a leading newspaper: the sample size will likely be around a thousand. Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a look at a recent example of a biased question. In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mail-in survey to show that the public supported his desire to abolish special interest groups. This is the question he asked: Source In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in America agreed with Perot's stance. Source Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two interesting follow-up surveys. In the first survey, it used the same question that Perot asked and found that 80 percent of the population favored passing the law. YP attributed the difference to the fact that it was able to create a more representative sample than Perot. Source Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way: Source The response to this question was strikingly different. Only 40 percent of the sampled population agreed to prohibit contributions. As it turned out, the results of the survey all came down to the way the question was phrased. Source For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in the sample results. Source The real lesson of these two examples is this: How data are collected is at least as important as how data are analyzed. A sample that is unrepresentative, biased, or not drawn at random can give highly misleading results. How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need to be representative and unbiased, you conduct a survey of the hotel guests.
Solving the Scuba Problem (Part I) How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you determine that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year-round. The database includes each guest's name, address, phone number, age, date of arrival, length of stay, and room number. Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good sample of all the guests who took scuba lessons. Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will respond, or conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest contacted? To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by phone. Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a random, representative sample. You open up the list of names in the hotel database. The names were entered as guests arrived. To make things simple, you randomly select a date and then record the first 50 guests arriving after that date who took the course. You ask the hotel operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of the guests on the list. He asks the guests to rate their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to view your sample. Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter "5" as "5.0"). Round if necessary. Hotel Database You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes. Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data again. Back in your room, you look over your list of data. What should you tell Leo? a. You should have mailed out your survey. Perhaps you would have received a different result, but the fact that the survey was conducted via phone is not the main problem with your survey.
b. Your survey is not representative of the guests who took the scuba course. Your observation is correct. Although mailing out the survey might have changed your result, that was not the main problem with your survey.
c. Your survey is unbiased and representative, and Leo should accept the survey results as true. Don't talk to Leo yet! There is a problem with your survey.
What factor is biasing your results? a. By bothering people at home, you got negative responses. Although this may be the case, this is not the main problem with your survey.
b. The income levels of the customers you phoned were not representative of the scuba-diving guests. The hotel database does not record income levels of guests and there is no reason to think that the sample you selected was biased in regards to income level.
c. The dates that the surveyed customers visited the resort were not representative of the scubadiving guests. Correct! Since you choose guests only from the month of April, any usual event that happened in that period could bias your results. In addition, your sample would be biased if more of a certain type of guests (for example business travelers versus tourists) visited during April than during the rest of the year.
When you report this news to Leo, he begins to laugh. We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did meet had to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually guests see a manta ray every week, and the guests in April could barely see the underwater coral. No wonder they weren't happy. You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that the guests are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests that show the average satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54. Exercise 1: The Bell Computer Problem Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This morning, he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell Computers. The article praised the high quality and low cost of the PCs made by Bell. However, it also included some negative comments about Bell's customer service. Currently, customer service is only available to customers of Bell Computers over the phone. Collins wants to understand more fully what customers think of Bell's customer service. His marketing department designs a survey that asks customers to rate Bell's customer service from 1 to 10. How should he conduct the survey?
a. Bell Computers should mail a survey to every customer in Bell's database asking them to write Bell about their experiences with the customer service department. This is not the best answer. This survey has a hidden bias. The customers who are irritated or frustrated with customer service offered by Bell Computers are more likely to respond than others.
b. Bell's sales peak during the holidays, when people give gifts, including computers. Bell should send a mail survey along with each of its outbound computer shipments in December. This is not the best answer. Because sales volume is high during the holiday season, the customer experience might be different than during other times of the year.
c. Bell is located in the Southern United States. 55% of Bell's customers are also located in the South. Bell should conduct a phone survey in one of the major Southern cities. This is not the best answer. If the survey focuses on the Southern United States, it will be biased towards Southern customers. Bell needs a sample that is representative of all of its customers.
d. Every month, on a random day and time, Bell should conduct a phone survey immediately after a Customer Service Representative has spoken to a customer. New answers should be added to a rolling average. This is the best answer. Conducting a phone survey immediately after a randomly chosen customer service session will create a random sample that is representative of all of Bell's customers.
Exercise 2: The Wave Problem "Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the competition among laundry detergents is fierce. The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his marketing team to come up with a strong advertising campaign stressing Wave's superiority over other competitors. Wave conducted a survey in the month of June. They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you notice a difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave aired a major television campaign claiming that 75% of the population
thought that Wave was a good product. You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a study for Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water. Coral Reef wonders how the Wave results are possible, considering that Coral Reef holds over 45% of the current market share. Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not representative and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed?
a. The sample was taken in the month of June and not over a whole year, so the sample is biased. This is not the best answer. We know that the sales per month of Wave have been constant over the past five years. So it is reasonable to assume that the month of the year is not a factor in laundry detergent sales.
b. The interviewers asked biased questions. This is the best answer. The interviewers should have asked neutral questions like, "Which detergent do you use?", "Which is the best detergent, in your opinion?", "Which detergents do you think are good products?" The questions asked by the interviewers had a bias towards Wave.
c. Ms. Mehta is mistaken. There is nothing wrong with the study. This is not the best answer. The study is flawed in one of the ways described in the other answer choices.
d. Wave should have given a range for the percent of people who think Wave is a good product, not just a number. This is not the best answer. There is nothing wrong with stating the most likely value as an estimate (called a "point estimate"). A range of values (called a "confidence interval") can, indeed, be stated, and provides more information about the accuracy of the estimate, but it is not wrong to make a point estimate.
Challenge: The Airport You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that were filled to capacity that morning. You decide to stand outside the airport's single exit door and ask a sample of 60 passengers leaving the airport how full their flight was. Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you quickly realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely have come from only a couple of flights, and you want to get a good sense of what percent of all flights arriving that morning were filled to capacity. Thus, you decide to randomly select 60 people from all the passengers departing the building that morning. After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their flight was filled to capacity, and 20 people tell you that their flight was not filled to capacity. What can you conclude from your survey results so far?
a. The best estimate is that 60% of the flights were filled to capacity. This is the correct answer. There is a problem with your survey.
b. The best estimate is that 50% of the flights were filled to capacity. This is not the best answer. There is problem with the survey approach. However, this answer would be incorrect even if the survey approach had been valid: you should count only actual responses in your calculation of the percentage of late flights (30 out of 50 = 60%, instead of 30 out 60 = 50%).
c. There is a problem with the survey approach. This is the correct answer. There is a problem with your survey. What is the problem with your survey?
a. A sample of 60 passengers is not large enough to provide a good estimate. This is not the correct answer. A sample size of 60 is not large, but the beauty of sampling is that you can use small samples to make fairly good estimates about large populations. There is a systematic bias in your sample that you have not identified yet.
b. Only those passengers that feel most strongly about the issue are likely to respond. This is not the correct answer. With 50 out of 60 people responding, you have obtained a response rate of 83%. You have to ask whether the people that responded might give different answers than those that did not respond.
In this case, passengers who did not respond were most likely in a hurry, which should not be a cause for a systematic bias about how full their planes were. There is a systematic bias in your sample, but it is due to a different problem.
c. Passengers from full planes are likely to be selected more frequently than passengers from relatively empty planes. This is the correct answer. There is a systematic bias in your sample: When you sample passengers at the exit door of an airport, you will, on average, select more people from full planes, simply because when a plane is full, there are more passengers on it - and hence more leaving the airport - than when a plane is relatively empty.
To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers each) and five of which had only a single passenger on the plane. In this case, half of the planes were full. However, almost all of the passengers (500 of the total 505) departing from the airport would report (correctly!) that they had been on a full plane. Since people from a full plane are more likely to be selected, there is a systematic bias in your response. It is important, in every survey, to try to make your sample as representative as possible. In this case, your sample was not representative of the planes arriving to the airport. A better approach might be to ask the people you select what their flight number was, and then ask them how full their flight was. Make sure you have at least one passenger from every plane. Then count the responses of only one person from each flight. By including only one person per flight in your sample, you ensure that your sample is an accurate prediction of how many planes are filled to capacity. Sampling is complicated, and it is important to think through all the factors that might influence your results. In this case, the mistake is that you are trying to estimate a population of planes by sampling a population of passengers. This makes the sample unrepresentative of the underlying population. By randomly sampling the passengers rather than the flights, each flight is not equally likely to be selected, and the sample is biased. The Population Mean You report the results of your survey, the sample mean, and its standard deviation to Leo.
The Scuba Problem II A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've only collected 45 responses.
If you'd chosen different people, they likely would have given different responses. What if — just by chance — these 45 people loved the scuba course, and no one else did? You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from 4.4, but at this point we're not sure exactly how far away it might be. Without more calculations, all we can say is that 4.4 is the best estimate we have. That is why... Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey result? If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're off by two whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would be completely different.
I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving! The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the sample reflects the population. Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be able to pin down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with enough information to make a sound business decision." That sounds like a good idea, but you wonder how to actually do it. Using Confidence Intervals The sample mean is the best estimate of our population mean. However, it is only a point estimate. It
does not give us a sense of how accurately the sample mean estimates the population mean. Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of our scuba school, what can we say about the average satisfaction rate of all scuba-diving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0?
To make decisions as a manager, we need to have more than just a good point estimate. We need to have a sense of how close or far away the true population mean might be from our estimate. We can indicate the most likely values of the true population mean by creating a range, or interval, around the sample mean. If we construct it correctly, this range will very likely contain the true population mean.
For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average customer satisfaction for all scuba guests falls between 4.2 and 4.6. Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision than if he simply knew the estimated average of 4.4. Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean xbar, the standard deviation s, and the sample size n. We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For any level of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly what we mean by "confidence," and how to compute z. For now, just keep in mind that for higher levels of confidence, we'll need to put in a larger value of z. Using these numbers, we can create a range around the sample mean according to the following formula: Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should the range be centered? How wide must the range be to make us confident that it contains the true population mean? What factors would lead us to need a wider or narrower range? Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample mean. The sample mean is our best estimate of the population mean. This suggests that the sample mean should always be the center of the range. Move the slider bar to see how the sample mean affects the range. Second, the width of the range depends on the standard deviation of the sample. When the sample standard deviation is large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the population mean. Thus, we have to create a wider range to be confident that it includes the true population mean. On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an accurate predictor of the true population mean. In this case, we can draw a more narrow range. The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard deviation affects the range. Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or two atypical points in the sample could throw the sample mean off considerably from the true population mean. So with a small sample, we need to create a wide range to feel comfortable that the true mean is likely to be inside it. The larger the sample, the more certain we can be that the sample mean represents the population mean. With a large sample, even if our sample includes a few atypical points, there are likely to be many more typical points in the sample to compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range. Move the slider bar to see how the sample size influences the range. Finally, the width of the range depends on our desired level of confidence. The level of confidence states how certain we want to be that the range contains the mean of the population. The more confident we want to be that the range contains the true population mean, the wider we have to make the range. If our desired level of confidence is fairly low, we can draw a more narrow range. In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident" that the range contains the true population mean. This means there is a 95% chance that the range contains the true population mean. Move the slider bar to see how the confidence level affects the range. These variables determine the size of the range that we want to construct. We will learn exactly how to construct this range in a later section. For now, all we have to understand is that the population mean can best be estimated by a range of values and that the range depends on three sample statistics as well as the level of confidence that we want to assign to the range.
Summary The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we construct a range around the sample mean that likely contains the population mean. The width of the range is determined by the sample size, sample standard deviation, and the level of confidence. The confidence level measures how certain we are that the range we construct contains the true population mean. Alice recommends taking a step back from sampling and learning about the normal distribution.
The Normal Distribution Alice recommends taking a step back from sampling and learning about the normal distribution. The normal distribution helps us create a range around a sample mean that is likely to contain the true population mean. You can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a precisely defined concept. Understanding the normal distribution will also give you deeper insight into how sampling works. The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes called
the "bell curve."
Like any probability distribution, the normal distribution is shown on two axes: the x-axis for the variable we're studying — women's heights, for example — and the y-axis for the likelihood that different values of the variable will occur. For example, few women are very short and few are very tall. Most are in the middle somewhere, with fairly average heights. Since women of average height are so much more common, the distribution of women's heights is much higher in the center near the average, which is about 63.5 inches. As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a specific range is equal to the area under the curve over that range. For example, the percentage of all women who are between 61 and 66 inches tall is equal to the area under the curve over that range. The percentage of all women taller than 66 inches is equal to the area under the curve to the right of 66 inches. Like any probability distribution, the total area under the curve is equal to 1, or 100%, because the height of every woman is represented in the curve. Over the years, statisticians have discovered that many populations have the properties of the normal distribution. For example, IQ test scores follow a normal distribution. The weights of pennies produced by U.S. mints have been shown to follow a normal distribution. But what is so special about this curve? First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence, the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a value greater than the mean is 50%. Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends solely on the distribution's standard deviation. In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard deviation of the distribution. Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values very close to the mean. How is this information useful? Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we can create a few useful "rules of thumb" for the normal distribution. For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one standard deviation away from the mean on either side. It's easiest to see this with a standard normal curve, which has a mean of zero and a standard deviation of one. If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability.
The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its mean or standard deviation. For example, about two thirds of all women have heights within one standard deviation, 2.5 inches, of the average height, which is 63.5 inches. 95% of women have heights within two standard deviations (or 5 inches) of the average height. To see how these rules of thumb translate into specific women's heights, we can label the x-axis twice to show which values correspond to being one standard deviation above or below the mean, which values correspond to being two standard deviations above or below the mean, and so on. Essentially, by labeling the x-axis twice we are translating the normal curve into a standard normal curve, which is easier to work with.
For women's height, the mean is 63.5 and the standard deviation is 2.5. So, one standard deviation above the mean is 63.5 + 2.5, and one standard deviation below the mean is 63.5 - 2.5. Thus, we can see that about 68% of all women have heights between 61 and 66 inches, since we know that about 68% of the probability is between -1 and +1 on a standard normal curve. Similarly, we can read the heights corresponding to two standard deviations above and below the mean to see that about 95% of all women have heights between 58.5 and 68.5 inches.
The z-statistic The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve, as we did with women's heights simply by re-labeling the x-axis. To do this more formally, we use something called the z-statistic. For a normal distribution, we usually refer to the number of standard deviations we must move away from the mean to cover a particular probability as "z", or the "z-value." For any value of z, there is a specific probability of being within z standard deviations of the mean. For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the probability of being between -1 and +1 on a standard normal curve. A good way to think about what the z-statistic can do is this analogy: if a giant tells you his house is four steps to the north, and you want to know how many steps it will take you to get there, what else do you need to know? You would need to know how much bigger his stride is than yours. Four steps could be a really long way. The same is true of a standard deviation. To know how far you must go from the mean to cover a certain area under the curve, you have to know the standard deviation of the distribution.
Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of 0 and a standard deviation of 1. We are translating the real value in its
original units — inches in our example — into a z-value. The z-statistic translates any value into its corresponding z-value simply by subtracting the mean and dividing by the standard deviation. Thus, for the women's height of 66 inches, the z-value, z = (66-63.5)/2.5, equals 1. Therefore, 66 is exactly one standard deviation above the mean. Essentially, the z-statistic allows us to measure the distance from the mean in terms of standard deviations instead of real values. It gives everyone the same size feet in statistics. We can extend the rules of thumb we've developed beyond the two cases we've looked at. For example, we may want to know the likelihood of being within 1.5 standard deviations from the mean, or within three standard deviations from the mean.
Select different values of z — that is, select different numbers of standard deviations from the mean — and see how the probability changes. Be sure to try z values of 1 and 2 to verify that our rules of thumb are on target! Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard deviations are necessary on either side of the mean to capture that probability. For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the probability.
Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the mean. This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard deviations from the mean. Select different probabilities and see how many standard deviations we have to move away from the mean to cover that probability. We can create a table that shows which values of z correspond to each probability or we can calculate z using a simple function in Microsoft Excel. We'll explain how to use both of these approaches in the next few clips.
z-table Remember, the probabilities and the rules of thumbs we've described apply ONLY to a normal distribution. Don't think you can use them for any distribution! Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just one example of a cumulative probability. A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left. There are easy ways to find cumulative probabilities using spreadsheet packages such as Microsoft Excel. You'll have opportunities to practice solving these types of problems shortly. Cumulative probabilities can be used to find the probability of any range of values. For example, to find the percentage of all women who have heights between 63.5 and 68 inches, we would simply subtract the percent whose heights are less than 63.5 inches from the percent whose heights are less than 68 inches.
Summary The normal distribution has a unique symmetrical shape whose center and width are completely determined by its mean and its standard deviation. For every normal distribution, the probability of being within a specified number of standard deviations of the mean is the same. The distance from the mean, as measured in standard deviations, is known as the z-value. Using the properties of the normal distribution, we can calculate a probability associated with any range of values.
Using Excel's Normal Functions To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean zero and standard deviation one. For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1). The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of obtaining a value less than 1 for a standard normal curve is about 84%. We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is symmetric, so there is a 50% chance of being below the mean. Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1.
Since the normal curve is symmetric, half of that 68% — or 34% of the probability — must lie between 0 and 1. Putting these two facts together confirms that there is an 84% chance of obtaining a value less than 1 for a standard normal curve.
If we want to find the cumulative probability of a value in a general normal curve — one that does not
necessarily have a mean of zero and a standard deviation of one — we have two options. One option is to first standardize the value in question to find the equivalent z-value, and then use the NORMSDIST to find the cumulative probability for that z-value. For example, if we have a normal distribution with mean 26 and standard deviation 8, we may wish to know the probability of obtaining a value less than 24. Standardizing can be done easily by hand, but Excel also has a STANDARDIZE function. We enter the function in a cell and insert three values: the value to be standardized, and the mean and standard deviation of the normal distribution. We find that the standardized value (or z value) of 24 for a normal curve with mean 26 and standard deviation 8 is -0.25. Now, to find the cumulative probability for the z-value -0.25, we enter the Excel function =NORMSDIST(-0.25), which tells us that the probability of a value less than -0.25 on a standard normal curve is 40%. Thus, the probability of a value less than 24 on a normal curve with mean 26 and standard deviation 8 is 40%. The second way to find a cumulative probability in a general normal curve is to use the NORMDIST function. Here, we enter the function in a cell and insert four values: the number whose cumulative probability we want to find, the mean and standard deviation of the normal distribution, and the word "TRUE." As with our previous approach, we find that the probability of obtaining a value less than 24 on a normal curve with mean 26 and standard deviation 8 is 40%.
The value "TRUE" tells Excel to return a cumulative probability. If instead of "TRUE" we enter "FALSE," Excel returns the y-value of the normal curve — something we are usually not interested in. Quite often, we have a cumulative probability, and want to work backwards, translating it into a value on a normal curve. Suppose we want to find the z-value associated with the cumulative probability 95%.
To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function NORMSINV. Note once again the S, which tells us we are working with a standard normal curve. We find that the z-value associated with the cumulative probability 95% is 1.65. Sometimes we may want to translate a cumulative probability back to a value on a general normal curve. For example, we may want to find the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard deviation 8. If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function. NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution in question. We find that the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard deviation 8 is 39.2.
Using the z-table Practice with Normal Curves Find the cumulative probability associated with the z-value 2. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel Find the cumulative probability associated with the z-value 2.36. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel Find the cumulative probability associated with the z-value -1. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel Find the cumulative probability associated with the z-value 1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary.
z-table Excel Find the cumulative probability associated with the z-value -1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 80. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than 80 but less than 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 80 and standard deviation 5, find the probability of obtaining a value greater than 85 but less than 95. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 45. Enter your answer in decimal notation with 3 digits to the right of the decimal, (e.g., enter "5" as "5.000"). Round if necessary. z-table Excel For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 38 but less than 45. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel
Find the z-value associated with the cumulative probability of 60%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel Find the z-value associated with the cumulative probability of 40%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel Find the z-value associated with the cumulative probability of 2.5%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Excel For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 88%. Enter your answer as an integer (e.g., "5"). Round if necessary. z-table Excel For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 28%. Enter your answer as an integer (e.g., "5"). Round if necessary. z-table Excel The Central Limit Theorem How can the normal distribution help you sample Leo's hotel guests? How do the unique properties of the normal distribution help us when we use a random sample to infer something about the underlying population? After all, when we sample a population, we usually have no idea whether or not the population is normally distributed. We're typically sampling because we don't even know the mean of the population! If the normal distribution is such a great tool, when can we use it?
It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem". Definition Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and plotted the means of each sample, then — assuming the samples we take are sufficiently large — the resulting plot of the sample means would look normally distributed. Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to the true mean of the population.
To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or completely bizarre — if we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal distribution centered around the true mean of the population.
The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a distribution of the means of many samples, but that is exactly what we are doing. We'll call this distribution the Distribution of Sample Means. (Statisticians also often call it the Sampling Distribution of the Mean).
Let's walk through this step-by-step. If we have a population — any population — we can take a random sample. This sample has a mean. We can plot that mean on a graph. Then we take another sample. That sample also has a mean, which we also plot on the graph. Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the population's mean.
The more samples we take, the more the graph of the sample means would look like a normal distribution. Eventually, the graph of the sample means — the Distribution of the Sample Means — would form a nearly perfect replica of a normal distribution. Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a normal distribution with them. We're taking a lot of samples here just to let you see that graphing the means of many samples would give you a normal curve. In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the Central Limit Theorem allow us to say based on that single sample? The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true population mean. The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally distributed and centered at the true population mean, we can completely disregard the underlying distribution of the population.
As we'll see shortly, because we know so much about the normal distribution, we can use the information about the Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual population mean. Summary The Central Limit Theorem states that for any population distribution, the means of samples from that population are distributed approximately normally. The more samples, and the larger the sample size, the closer the Distribution of Sample Means fits a normal curve. The mean of a single sample lies on this normal curve, so we can use the normal curve's special properties to extract more information from a single sample mean.
Illustrating Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked "Uniform," "Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation will display a different kind of distribution.
On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for which all values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution that has two separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will display a distribution that is not symmetrical — values are more likely to fall above the mean than below. Uniform The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts.
Bimodal The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts.
Skewed
Skewed The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts.
The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key insight that will allow you to estimate the population mean from a sample. Confidence Intervals Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values that is almost certain to contain the population mean.
Estimating a Population Mean II For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of the distribution's mean 95% of the time. The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large samples are normally distributed, regardless of the distribution of the underlying population. Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean. Combining these facts can give us a measure of how accurately the mean of a sample estimates the population mean.
Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points — from a population, there is a 95% chance that the mean of that sample falls within two standard deviations of the true population mean. Let's build this up step by step to make sure we understand the logic.
First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point on a normal distribution — the Distribution of Sample Means. Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that the sample mean is within two standard deviations of the mean of the distribution. The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean. Thus, we can conclude that there is a 95% chance that the sample mean is within two standard deviations of the population mean. We have argued that 95% of our samples will have a mean within the range shown around the true population mean.
Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence interval is. Let's look at intervals around the means of two different types of samples: those whose sample means fall within the 2 standard deviation range around the population mean (which should be the case for 95% of all samples) and those whose sample means fall outside the 2 standard deviation range around the population mean (which should be the case for 5% of all samples). First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population mean.
Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population mean. Since the population mean is more than 2 standard deviations away from this sample mean, an interval of width 2 standard deviations around this sample mean could not contain the true population mean. We know that 5% of all samples should have sample means outside the 2 standard deviation range around the population mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population mean.
Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range around the population mean. If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would contain the true population mean. Thus, 95% of all samples we obtain will have intervals that contain the population mean.
We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a 95% confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain the true population mean. It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are saying that 95% of the time a range that is two standard deviations wide centered
around the sample mean contains the population mean. To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population and drawing a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19 out of 20 — would actually contain the population mean. What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an approach that 1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't comfortable with those odds, we should raise the confidence level. If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not contain the true population mean. However, this higher confidence comes at a cost. If we keep the same sample size, then the confidence interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can increase our sample size. How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an interval, that interval is too wide. Let's look at an example. To estimate the percent of people in our industry who will attend the annual conference, we might construct a confidence interval that ranges from 7% to 13%. If we would select a different conference venue if the true percentages is 7% than if it is 13%, we need to tighten our range.
Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be acquainted with. We need to know that the standard deviation of the Distribution of Sample Means is !, the standard deviation of the underlying population, divided by the square root of n, the sample size. We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be tightly clustered around the true population mean, and thereby form a narrow distribution.
Summary A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the population mean. A confidence interval is centered at the mean of a sample randomly drawn from the population under study. When we have a confidence level of 95% we expect equally wide confidence intervals centered at 95 out of 100 such sample means to contain the population mean.
Finding a Confidence Interval You understand the theory behind a confidence interval. But how do you actually construct one? We can now translate the previous discussion into a simple method for finding a confidence interval for the mean of any population. First, we randomly select a sample of size 30 from the population. We then compute the mean and standard deviation of the sample. Next, we assign the sample mean as the center of the confidence interval. To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want a 95% confidence interval, the interval should be 2 times the standard deviation of the population divided by the square root of n, the sample size. Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do have — the standard deviation of the sample. Here's what the equation looks like for our example. If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the z-value corresponding to the desired level of confidence. We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the sample size has to be at least 30. Wine Lover's Magazine Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the average age of their subscribers so they can better target potential advertisers. We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with a sample, but emphasize that they want to be 95% confident that the range we give them contains the true average age of its full set of subscribers.
We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has a mean of 52 and a standard deviation of 40. To find an appropriate confidence interval, we incorporate information about the sample into the formula: The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95% confidence interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12. We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling them they can be 95% confident that the true population mean falls between these values. What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size from the sample data, but how do we obtain the right z-value? Obtaining the z-value The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a less common confidence interval? To be 98% confident that our interval contains the population mean, how do we obtain the appropriate z-value? To find the z-value for 98% confidence level, we are essentially asking: How far to the left and right of the standard normal curve's mean do we have to go to capture 98% of the area? Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering 1% of the area under the curve. The z-value of the right boundary is the z-value associated with a cumulative probability of 99% — the sum of the central 98% and the 1% in the left tail. Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is essential because Excel's NORMSINV function and the z-table work with cumulative probabilities. To find the z-value associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which returns the z-value 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to 0.9900. The z-value is 2.33, the sum of the row-value 2.3 and the column-value 0.03. Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using the appropriate normal distribution function in Excel or using the Standard Normal Table (z-table) in your briefcase. The correct z-value for a confidence level of 99.5% is: a. 2.00 This is not the correct answer.
b. 2.33 This is not the correct answer.
c. 2.57 This is not the correct answer.
d. 2.81 This is the correct answer.
z-table Excel Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the standard normal curve. To do this, note that to have 99.5% probability in the middle of the
standard normal curve, we must exclude a total area of 1 - 99.5% = 0.5% from the curve. That area is divided into two equal parts in the distribution's tails: 0.25% in each tail. We can now see that the cumulative probability associated with confidence level of 99.5% is 1 — 0.25% = 99.75%. Thus, the z-value for a confidence level of 99.5% is the same as the z-value of a cumulative probability of 99.75%. We find the z-value in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the z-value in the z-table by looking up the probability 0.9975. Summary To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a range around the sample mean with a specified level of confidence. The confidence level indicates how confident we are that the sample mean we collected contains the population mean. Using Small Samples We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we have only a small sample? Let's consider a different survey, one that concerns a delicate matter. The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value of her guests' belongings. She needs this value to determine the correct insurance protection in case guest belongings disappear from their cabins, are destroyed in a fire, or sink with the ship. She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information. She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample, we need to prepare an estimate. With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample size. A small sample increases our uncertainty about two important aspects of our estimate of the population mean. First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure that the sample means follow a normal distribution. Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the population standard deviation. Due to these additional uncertainties, we cannot use z-values to construct confidence intervals. Using a z-value would overstate our confidence in our estimate. Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a handful of data points? It depends: if we don't know anything about the underlying population, we cannot create a confidence interval with fewer than 30 data points. However, if the underlying population is normally distributed — or even roughly normally distributed — we can use a confidence interval to estimate the population mean. In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can construct a confidence interval, even when we have a small sample. However, we do need to modify our approach slightly. To estimate the population mean with a small sample, we use a t-distribution, which was discovered in the early 20th century at the Guinness Brewing Company in Ireland. A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values. What is the difference between the normal distribution and the t-distribution? A t-distribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails, because it is more likely than the normal distribution to have values fall farther away from the
mean. Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For example, we must go more than 2 standard deviations on either side of the mean to capture 95% of the probability for a t-distribution. Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will be wider than one based on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with a smaller sample, so we need a wider interval to achieve a given level of confidence. The t-distribution is also different because it varies with the sample size: For each sample size, there is a different t-value associated with a given level of confidence. The smaller the sample size n, the shorter the height and the thicker the tails of the t-distribution curve, and the farther we have to go from the mean to reach a given level of confidence. On the other hand, as the sample size increases, the shape of the t-distribution becomes more and more like the shape of a normal distribution. Once we reach a sample size of 30, the t-distribution becomes virtually identical to the z-distribution, so t-values and z-values can be used interchangeably. Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most people use the z-distribution for larger samples, partially out of habit and partially because it's easier, since the z-value doesn't vary based on the sample size. Finding the t-value To find the right t-value, we first have to identify the t-distribution that corresponds to our sample size. We do this by finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample size minus one. If our sample size is 16, we have 15 degrees of freedom, and so on. Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate tvalue. For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15) would return the value 2.131.
Excel Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For example, for t = 2.131, the appropriate confidence interval is:
Excel If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with the degrees of freedom and the confidence level we specify. When using different t-value tables, we need to be careful to note which probability the table depicts.
Excel t-table Some tables report values associated with the confidence level, like 0.95. Others report values based on the area in the tails, which would be 0.05 for a 95% confidence interval. Our t-table, like many others, reports values associated with a cumulative probability, so for a 95% level of confidence, we would have to look at a cumulative probability of 97.5%.
Excel t-table The Good Ship Demiurgos Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers' belongings. The manager samples 16 guests, and reports that they have an average of $10,200 worth of clothing, jewelry, and personal effects in their cabins. From her survey numbers, we calculate a standard deviation of $4,800. We need to double check that the distribution isn't too skewed, which we might expect, since some of the passengers are quite wealthy. The manager explains that the insurance policy has a limited liability clause that limits a passenger's maximum claim to $20,000. Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest reported values above $20,000, the manager simply reported $20,000 as the value to be covered for our data set. We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel comfortable
using the t-distribution. Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15) tells us that the appropriate t-value is 2.131.
Excel Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times $4,800 over the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and we can report that we are 95% confident that the average value of passengers' belongings is between $7,643 and $12,757.
Excel t-table What if the Demiurgos' manager thinks this interval is too large? Excel t-table She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size of the denominator (the square root of n). Both factors narrow the confidence interval.
Excel t-table For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the t-value would drop to 2.06 and the square root of n in the denominator would increase. The width of the interval would decrease significantly, from $2,557 to $1,939.
Excel t-table Summary Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is roughly normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a small sample, use a tdistribution. T-distributions are a set of distributions that resemble the normal distribution, but with shorter heights near the mean and thicker tails. To find a confidence interval for a small sample size, place the appropriate t-value into the confidence interval formula.
Choosing a Sample Size When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For example, when estimating car owners' average spending on car repairs each year, we might want to be 95% confident that our estimate is within $50 of the true mean. We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample size, the tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a confidence interval that extends a distance of d on either side of the mean: To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a preliminary survey to obtain a rough estimate of sigma.
In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of confidence, we set z = 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly sample at least 139 people. In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose a sample size n that satisfies the expression below. We can do this with simple algebra, or by using the attached Excel utility.
Confidence Interval Utility Summary When estimating a population mean, we can ensure that our confidence interval extends a distance of at most d on either side of the mean by choosing an appropriate sample size.
Step-by-Step Guide Here is a step-by-step process for creating a confidence interval: First, we choose a level of confidence and a sample size n appropriate to the decision context. Second, we take a random sample and find the sample mean. This is our best estimate for the population mean. Third, we find the sample's standard deviation. Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size is over 30, we find the zvalue for our confidence level. If not, we find the t-value for our confidence level and with degrees of freedom = sample
size - 1. Fifth, we calculate the end points of the confidence interval using the formulae below.
Summary Construct confidence intervals using the steps outlined below. With a confidence interval derived from an unbiased random sample, we can say that the true population mean falls within the interval with the corresponding level of confidence.
Excel Utility Click here to open an Excel utility that allows you to create confidence intervals by providing the sample mean, standard deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the Wine Lover's Magazine and the Demiurgos examples. Solving the Scuba Problem II The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem. You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the average satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you decide to create a 95% confidence interval for Leo.
Your calculations show the following: a. We can be 95% sure that the population mean is 4.4. This is not the correct answer. We cannot estimate a 95% confidence interval for the population mean with just a single number, or a point estimate.
b. We can be 95% sure that the population mean falls between 3.95 and 4.85. This is the correct answer.
c. We cannot create an estimate for the population mean with the information given. This is not the correct answer. We can create a 95% confidence interval for the population mean using the mean, standard deviation, and size of sample.
Confidence Interval Utility z-table t-table To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.45 points above and below the sample mean of 4.4, which translates into a confidence interval from 3.95 to 4.85. You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85. Leo looks at your numbers and shakes his head. That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's closer to 4, I'm concerned. Can we narrow the range at all?
Looking over your notes, you think you can give Leo some options. a. We can lower the confidence interval, perhaps to 90%. That way we will obtain a narrower range, but we will be less certain that the true population mean falls within it. This is not the best answer. It is not a good idea to choose your confidence level based on how wide you want your confidence interval to be. You need to pick the desired confidence level first and accept the results that this confidence level provides you.
b. We can survey a larger group of people. This is the best answer. By increasing the sample size, you can narrow your confidence interval even if the standard deviation stays constant. Why don't you create a larger sample and report the results back to me? You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able to reach 25 guests. You combine the two samples, which gives a new sample size of 70. For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2. Armed with more data, you create another confidence interval.
We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is between: a. 3.95 and 4.85 This is not the correct answer. Remember that with this new data you need to recalculate your confidence interval.
b. 4.14 and 4.86 This is not the correct answer. Remember that you have a new sample standard deviation.
c. 4.22 and 4.78 This is the correct answer.
d. 4.12 and 4.68 This is not the correct answer. Remember that you have a new sample mean.
Confidence Interval Utility z-table t-table To create this 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5, which translates into a confidence interval from 4.22 to 4.78. Thank you. I am much happier with this result. I have enough information now to decide whether to keep the current scuba diving school.
Exercise 1: The Veetek VCR Gambit Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread throughout Japan. For over a year, the sales of high-end VCRs have lagged, due to a shift towards DVD players. Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk shipment of highend VCRs. This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently languishing on the shelves of his stores. Before he calls Veetek, he wants to know how many high-end VCRs he can promise. After two days of furious phone calls, his deputy has gathered data from 36 representative outlets in his retail chain. The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi needs you to find a 95% confidence interval for the average VCR inventory per store. The interval is:
a. From 490 to 510 This is not the correct answer. Remember to take the square root of the sample size when calculating the confidence interval.
b. from 451 to 549 This is not the correct answer. For a 95% confidence level, use a z-value of 1.96.
c. From 441 to 559 This is the correct answer. Good work!
d. From 433 to 567 This is not the correct answer. For a 95% confidence level, use a z-value of 1.96.
Confidence Interval Utility z-table t-table Exercise 2: Pulluscular Pig Disorder Paul Segal manages the pig-farming division of the agricultural company Bowman-Lyons-Centerville. A rumored outbreak of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public relations disaster. The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain size postmortem. Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to find out. At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed. For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95% confidence, in what range does the herd's average brain weight lie?
a. [0.127 lbs, 0.233 lbs] This is not the correct answer. The sample size is less than 30, so using a t-value instead of a z-value is
appropriate.
b. [0.123 lbs, 0.237 lbs] This is not the correct answer. Make sure you are finding the t-value for a 95% confidence level, not a 90% confidence level.
c. [0.117 lbs, 0.243 lbs] This is not the correct answer. Make sure you are finding the t-value using n - 1 = 4 degrees of freedom, not n = 5 degrees of freedom.
d. [0.106 lbs, 0.254 lbs] This is the correct answer. The right t-value for a 95% confidence level and 4 degrees of freedom is 2.78.
Confidence Interval Utility z-table t-table Proportions The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.
The Customer Response Problem I'm sorry to disturb you, but I have another problem, and I think you might be able to help. The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops significantly during the off-season, from September through February and then April through May. We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant for weeklong periods during the slow season this year. I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay during our slow period. They won't have complete date flexibility of course, but the steep discount should make the offer attractive for them. To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The deadline by which they had to respond to the offer has passed. Ten guests responded with the required room deposit prior to the deadline — that's a solid 10 percent. I figure if we send out 2,000 promotions, we'll get about 200 responses. This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end up disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the discount, you'll have to turn away customers willing to pay full price. That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests, will it hold for 2,000? What if 11% actually respond to the promotions? Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the offer is not valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a discount. I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return guests can I safely send the discount offer and be confident that no more than 200 will respond? You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as interested in them as he is in your results. Confidence Intervals And Proportions Sometimes, the question we pose to members of a sample calls for a yes or no answer. We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask if they plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant produced yesterday and count how many are defective.
Even though our question has only two answers, we still have to address an inherent uncertainty: We
know what values our data can take — yes or no — but we don't know how often each response will be given. In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion, p-bar. This is our best estimate of p, the true percentage of "yes" responses in the underlying population. Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to know what percentage of all passengers remembers seeing our ad. We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300 passengers respond to our survey, of which 100 passengers report remembering the ad. Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of respondents, 300. The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always add up to 1 because survey respondents report either remembering the ad or not. Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best estimate, or point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%.
As managers, we typically want more than this simple point estimate — we want to know how accurate the estimate is. How far from 33% might the true percentage be? Can we say confidently that it is between 30% and 36%, for example? When we work with proportions, how do we find a confidence interval around our point estimate? The process for creating a confidence interval around a proportion is nearly identical to the process we've used before. The only difference is that we can approximate the standard deviation of the population with a simple formula rather than calculating it directly from the raw data. Based on our sample, our best estimate of the true population proportion is p-bar, the percentage of "yes" responses in our survey. Statistical theory tells us that our best estimate of the standard deviation of the true population proportion is the square root of [(p-bar)*(1 - (p-bar)]. We can use this approximate standard deviation to determine a confidence interval for the proportion. For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A 95% confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to 0.33 plus or minus 0.053, or 27.7% to 38.3%. Unfortunately, there is one catch when we calculate confidence intervals around proportions...
Sample Size Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are sampling New Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S., the odds of having the disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people? No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very small, we need to have a large enough sample to make sure we find at least a few people with the disease. Otherwise, we will not have enough data to get a good estimate of the true proportion. There is a guideline we must meet to make sure that our sample is large enough when estimating proportions. Two conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second, the product of the sample size and 1 minus the proportion must also be at least 5. If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our sample contains a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be useful otherwise. To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However, since we don't know the proportion p-bar before sampling, we don't know if the two conditions are met before setting the sample size. How can we get around this problem?
Finding a Preliminary Estimate of p-bar We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past experience. For example, to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general population. This is a reasonable first estimate for p-bar. In many cases, however, we are sampling for the first time. Without past experience, we don't know what p-bar might be. In this case, it may well be worth our time to take a small test sample to estimate the proportion, p-bar. For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our preliminary estimate of p-bar. Substituting 3% for p-bar in our two requirements, n(p-bar) " 5 and n(1 - (p-bar)) " 5, tells us that n must satisfy n*0.03 " 5 and n*0.97 " 5. Thus the sample size we need for our real sample must be at least 167. We would then use a real sample — with at least 167 respondents — to find an actual sample value of p-bar to create a confidence interval for the population proportion. Summary
Proportions are often used to indicate the frequency of some characteristic in a population. The sample proportion p-bar is the number of occurrences of the characteristic in the sample divided by the number of respondents, the sample size. It is our best estimate of the true proportion in the population. We can construct a confidence interval for the population proportion. Two guidelines for the sample size must be met for a valid confidence interval: n(p-bar) and n(1 - (p-bar)) must each be at least five. Solving the Customer Response Problem Creating confidence intervals around proportions is not much different from creating them around means. Finding the right number of Leo's promotional brochures to mail should be easy. Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will this response rate hold for 2,000 guests? To how many guests can he send the discount offer for his 200 rooms?
First, you calculate a 95% confidence interval for the response rate. Enter the lower bound as a decimal number with two digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. z-table Confidence Interval Utility The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that answer by using the sample data and applying the familiar formula:
Then after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of guests. Enter the number of guests as an integer, (e.g., "5"). Round if necessary. z-table Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most 1,259 past customers. Leo is pleased with your work. He tells you to relax and enjoy the resort.
Exercise 1: GMW Automotive GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez heads the Mexican sales division of the company's Latin American subsidiary. GMW earns additional profit when customers choose to finance their car purchase with a GMW financing package. Arturo has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW customers who opt for financing. Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You take a random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%, opted for GMW financing. If you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all Mexican customers opting for GMW financing falls in the range:
a. from 12.0% to 28.6% This is not the correct answer. The appropriate z-value for a 95% confidence interval is 1.96.
b. from 10.4% to 30.2% This is the correct answer.
c. from 15.1% to 25.5% This is not the correct answer. The appropriate standard deviation for the sample is the square root of [p*(1 - p)] = square root of [0.203*(1 - 0.203)] = 0.40 .
d. You do not have sufficient information to solve this problem This is not the correct answer. All the data needed to solve this problem are present: the sample size, the sample proportion, and the desired confidence level. Random sampling and the leeway granted by the confidence interval allow us to infer facts about an entire population from a small sample.
z-table Confidence Interval Utility
Exercise 2: Crown Toothpaste Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores selling discount merchandise. Tar-Mart has a huge national presence, and manufacturers compete fiercely to get their products onto TarMart's shelves. Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4 weeks and display it prominently. After that period, she will stop stocking Crown unless 5% of Tar-Mart's customers bought Crown or were considering buying Crown within the next month. The trial period is now over. Kayleigh has asked you to take a sample of customers to see if Tar-Mart should continue stocking Crown. She would like you to be at least 95% confident in your answer. The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when Tar-Mart introduced a new product, the percentage of people who expressed interest ranged between 2% and 10%. What sample size should you use?
a. 50 This is not the best answer. This sample size will satisfy the two rules of thumb (n(p-bar) " 5 and n(1 - (p-bar)) " 5) for all proportions falling in the range 2% to 10%. b. Anything between 50 and 250 This is not the best answer. This sample size will satisfy the two rules of thumb (n(p-bar) " 5 and n(1 - (p-bar)) " 5) for all proportions falling in the range 2% to 10%. c. 250 This is the best answer. This sample size will satisfy the two rules of thumb (n(p-bar) " 5 and n(1 - (p-bar)) " 5) for all proportions falling in the range 2% to 10%.
You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had bought Crown or were considering buying Crown within the next month. What is the 95% confidence interval for the population proportion?
a. From 1.6% to 6.4% This is the correct answer.
b. From 2.0% to 6.0% This is not the correct answer. The appropriate z-value for a 95% confidence interval is 1.96.
c. From 3.5% to 4.5% This is not the best answer. You may have forgotten to take the square root of the standard deviation in the formula for the confidence interval.
z-table Confidence Interval Utility First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar) = 250*0.04 = 10 " 5 and n(1 - (p-bar)) = 250*0.96= 240 "5. Then, using the formula, you find the confidence interval around the sample proportion. The endpoints of that interval are 1.6% and 6.4%. Challenge: OOPS! Package Deliveries OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard increasing complaints about late deliveries, and wants to know how many of the shipments are late by one day or more. Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered late, a proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence interval calculation, the interval is:
a. Between -0.6% and 2.2% This is not the correct answer. One of the rules of thumb for the sample size is not being satisfied.
b. Between 0.7% and 0.9% This is not the correct answer. One of the rules of thumb for the sample size is not being satisfied.
c. No valid inferences can be drawn from these data. This is the best answer. One of the rules of thumb for the sample size is not being satisfied: n(p-bar) = (256) (0.008) = 2 is less than 5. Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that the population proportion of late packages is between:
a. 0.1% and 2.1% This is the correct answer. The new sample size is sufficiently large to investigate a population proportion of 0.011.
b. 0.3% and 1.9% TThis is not the correct answer. The appropriate z-value for a confidence interval of 99% is 2.58.
c. 0.0% and 1.6% This is not the correct answer. The new sample has a new sample proportion of 0.011. Use this sample proportion in your confidence interval calculation.
d. The sample size is still too small to make a valid inference. This is not the correct answer. Both n(p-bar) and n(1 - (p-bar)) are greater than 5, so both rules of thumb are satisfied.
z-table Confidence Interval Utility First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size satisfies the rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5. Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].
Hypothesis Testing Introduction After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you are gathering your beach gear, Leo gives you another call.
Improving the Kahana Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me with next. I've been working on ideas to increase the Kahana's profits. Is it possible to increase profits by raising the room prices? That would be an easy solution.
I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take into consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where we started from with profits — or even worse. What other factors influence your profits? The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money guests spend on non-room amenities. You know, food, drinks, spa services, sports activities, and so on. As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant, for example, or the types of amenities offered. And you'd like us to help you understand your guests' discretionary spending patterns better. Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if some of my recent efforts to increase guest spending have paid off. I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For example, I hired a new executive chef last year. I'd like to know if restaurant revenues per person have changed since then. I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on beverages. Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I might change our marketing campaigns to better suit each of those market segments. What records do you have for us to work with? We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the results. You're really getting into these statistical methods, aren't you, Leo?
Definition Leo made some important changes to his business and he has some ideas of what the impact of these changes has been. How do you put his ideas to the test?
As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions. Based on whether or not our claim is statistically supported, we may wish to take managerial action. Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to substantiate. To begin, we will learn how to test hypotheses about population means. For instance, suppose we know that the historical average number of defects in a production process is 3 defects per 1,000 units produced. We have a hunch that a certain change to the process — a new machine, say — has changed this number. The hypothesis we wish to substantiate is that the average defect rate has changed — that it is no longer 3 per 1,000. How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then, we see whether or not what we learn about the sample supports our hypothesis that the defect rate has changed. Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently say that the defect rate has changed? That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that has been cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it is very likely for the mean of a sample taken from the population to fall within that range. The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range or not. If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect rate is still 3. However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the defect rate has changed. This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example and define our new statistical technique more precisely. To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the alternative hypothesis. Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null hypothesis states that the population mean is equal to that initial value: in our example, the null hypothesis states that the current population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population mean, in this case the current average defect rate. The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the average defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not hold. As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not reject it. Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we necessarily "accept" the alternative hypothesis. In our example, the evidence from our sample will help us determine whether or not we should reject the null hypothesis that the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed. Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly unlikely that our sample mean would come from a population with the mean stated by the null hypothesis. For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null
hypothesis. Drawing a sample with 14 defects from a population with an average defect rate of 3 would be very unlikely. "We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to draw a definite conclusion." For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis, since it wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect rate of 3. Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that we "accept" the null hypothesis — we simply don't reject it. It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence. It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two types of conclusions with an analogy: a US jury trial. In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the accused is innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is trying to prove. The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it is certain beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude that the accused truly is innocent. The jury simply declares that the accused is "not guilty. Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it. A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave unchallenged a particular statement: the null hypothesis. Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results support his claims that food and beverage spending patterns have changed.
Summary We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the population mean is equal to an initial value that is based on our experience or conventional wisdom. We test the null hypothesis to learn if we should reject it in favor of our claim, the alternative hypothesis, which states that the null hypothesis does not hold. Single Population Means The next morning, Leo explains the measures he has undertaken to increase customer spending on food and beverages. "I'd like to see if they've had a discernable impact on my guests' restaurant-related spending patterns." The Restaurant Revenue Problem Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the main cocktail lounge. The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate items on the menu, like that mango and brie tart I recommended to you. She also has offerings that cater to simpler tastes. But the question is, have restaurant profits been affected by the new chef? Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits have increased too. Based on last year's consolidated reports, the average spending on
food per person per day was $55. I'm curious to see if that has changed. In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now more guests can be seated in the lounge, and more seats have good views of the ocean. I also invested in a large machine that makes a wide variety of frozen drinks. Frozen pina coladas are very, very popular. I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high margins, but I'm not sure if beverage sales have increased enough to cover the investments. Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?" Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average spending on beverages per guest per day was $21. Isn't that high? Well, we have some very nice wines in our restaurants. We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests. We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food revenues, total beverage revenues, and number of guests at the table. Using this information, we should be able to estimate the daily spending on food and beverages per guest. You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar renovations — have influenced the resort's profits. Hypothesis Tests for Single Population Means Leo has prepared data for you. How are you going to put it to use? Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of test. Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started showing old classics instead of recent releases. He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He would like to know if average customer satisfaction has changed since he altered his theater's artistic focus. The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the Greek letter mu to represent the current mean satisfaction rating of the theater's entire filmgoing population. His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now different. To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is careful to sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a standard deviation of 2.8. Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this year's moviegoers really are more satisfied? Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average satisfaction rating? This is equivalent to asking the question: If the null hypothesis is true — the average satisfaction is still 6.7 — would we be likely to randomly draw the sample that we did, with average satisfaction 7.3? To answer this question, we have to first define what we mean by "likely." As in sampling and estimation,
we typically use 95% as our threshold level of likelihood. We then construct a range around the population mean specified by our null hypothesis. The range should be drawn so that if the null hypothesis is true, 95% of all samples drawn from the population would fall in that range. In other words, we create a range of likely sample means. The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its familiar properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed population mean, mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population standard deviation, so we use the sample standard deviation as our best estimate. As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on either side of the mean. However, when we conduct a hypothesis test, we center the range around the mean specified in the null hypothesis because we always start a hypothesis test by assuming the null hypothesis is true. In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence level translates into a z-value of 1.96. We construct the range of likely sample means: This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample will fall between 6.3 and 7.1. Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis. Why? Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside this range. The region outside the range of likely sample means is called the "rejection region," since we reject the null hypothesis if our sample mean falls into it. In the movie theater example, the rejection region contains all values less than 6.3 and all values greater than 7.1. In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we reject the null hypothesis, we in effect accept the alternative hypothesis. We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7. If our sample mean had fallen within the range around 6.7, we could not make a definite statement about moviegoers' satisfaction. We would not have enough evidence to state that things have changed, but we can never claim that they have definitely remained the same. Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only with sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go wrong in two ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the first of these: the null hypothesis is true, but we reject it. We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the rejection region when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely we mean a 5% chance. However, 5% of all samples from a population with the null hypothesis mean would fall in the rejection region, so when we reject a null hypothesis, there is a 5% chance we will do so erroneously. Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified in rejecting the null hypothesis. Hence we continue to speak of a confidence level of 95%. A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean when they say that something is "statistically significant at a 5% significance level. If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type.
The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As managers, we need to choose the confidence level of our test based on the relative costs of making each type of error. The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always constructed around sample means, never around population means. When we construct a confidence interval, we don't even have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population mean, not for testing particular claims about that mean. Summary In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we construct a range of likely sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level measures how confident we are that we are justified in rejecting the null hypothesis. One-sided Hypothesis Tests The movie theater manager did not have a strong conviction about the direction of change for customer satisfaction prior to performing the hypothesis test. He wanted to test for change in both directions — up or down — and thus he used a two-sided hypothesis test. The null hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer satisfaction may have increased or decreased. The two-tailed nature of the test was reflected in the two-sided range we drew around the population mean. Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a specific direction. For instance, if the theater manager were quite sure that satisfaction had not decreased, he wouldn't have to test in that direction; rather, he'd only have to test for positive change. In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of tests are called one-sided hypothesis tests. Here, we substantiate the claim that the mean has increased only if the sample mean is sufficiently higher than 6.7, so our rejection region extends only to the right. Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial understanding of the population: the population mean is equal to a specified initial value. If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that the mean still equals that initial value. The alternative hypothesis should state that the mean does not equal that initial value. If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean — then the null hypothesis should state that the population mean has at most that value. The alternative hypothesis states that the mean is greater than the null hypothesis mean. Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null hypothesis should state that the mean is at least that initial value. The alternative hypothesis should state that the mean is less than the null hypothesis mean, and the rejection region extends only to the left. When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely sample means. Suppose the theater manager claims that satisfaction improved. As usual, he states the claim he wants to substantiate as his alternative hypothesis. The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to substantiate the claim that mean satisfaction increased? To find out, the manager creates a one-sided range: he assumes the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95% of all sample means.
To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means be less than that value? To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of 95% corresponds to a z-value of 1.645. z-table Why is this different from the z-value for a two-sided test with a 95% confidence level? For a two-sided test, the z-value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a one-sided test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail. z-table We now have all the information we need to find the upper bound on the range of likely sample means. The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the manager rejects the null hypothesis. He is confident that customer satisfaction is higher. Summary When we want to test for change in a specific direction, we use a one-sided test. Instead of finding a range containing 95% of all sample means centered at the null hypothesis mean, we find a onesided range. We calculate its endpoint using the cumulative probability under the normal curve. Excel Utility (Single Populations) The Excel Utility link below allows you to perform hypothesis tests for single populations. Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the theater manager's example. Excel Utility for Single populations Solving the Restaurant Revenue Problem A single-population hypothesis test tests a claim using a sample from a single population. With a plan in mind, you take a look at Leo's sample data. You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to organize the data from your sample on daily guest spending on restaurant food. One change Leo made to his restaurant operations was to hire a new chef. He wants to know whether average restaurant spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a hypothesis test. Last year's average spending on food per person was $55; this gives you an initial value for the mean. Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your null hypothesis, which states that the average revenue per guest is still $55. If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be accounted for by chance. You add the alternative hypothesis to your notes. Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range of likely sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in the rejection region? Leo didn't specify what level of confidence he wanted for your results. You call him for clarification.
I suppose a 95% confidence level is okay. I'd like to be more confident, of course. After you point out that higher confidence would reduce his chances of being able to substantiate a change in spending if a change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to compute a range around the null hypothesis mean of $55. Consulting your notes, you find the correct formula: You find the range containing 95% of all sample means. Its endpoints are:
a. [$50.07; $59.93] This is not the correct answer. In a two-sided test, the z-value for 95% confidence is 1.96. Find the cumulative probablility for 0.975, not 0.95.
b. [$49.12; $60.88] This is the correct answer. The z-value for 95% confidence in a two-sided test is 1.96.
c. [$56.27; $71.73] This is not the correct answer. Make sure you center your range at the null hypothesis mean, not the sample mean.
z-table Utility for Single Populations You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19 out of 20 samples of this size from the population of hotel guests would have means that would fall in the calculated range. The sample mean of $64 falls outside of this range. You and Alice report your results to Leo. Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has increased. I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has affected average beverage spending? Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage spending per guest. He wants to know if spending has gone up. You decide to do a one-sided test. First, you write down all of Leo's data, along with the hypotheses: You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a z-value of 1.645. The upper bound is $24.29. What is the correct interpretation of this number? Given that the null hypothesis is true,
a. If the sample mean is $21, 19 out of 20 samples have means LESS than $24.29 This is the correct answer. $24.29 is an upper bound: 95% of all sample means collected from a population with the null hypothesis mean fall below $24.29.
b. If the sample mean is $21, 19 out of 20 samples have means EQUAL to $24.29 This is not the correct answer. Sample means will generally not cluster on one point. $24.29 is an upper bound: 95% of all sample means collected from a population with the null hypothesis mean fall below $24.29.
c. If the sample mean is $21, 19 out of 20 samples have means GREATER than $24.29 This is not the correct answer. Since $24.29 is an upper bound, the range that contains 95% of the sample means collected from a population with $21 must be less than the bound. The range of likely sample means contains the collected sample mean of $24. This tells you that:
a. The null hypothesis should be rejected. This is not the correct answer. The null hypothesis should be rejected when the sample means falls outside of the range of likely sample means.
b. The null hypothesis should NOT be rejected.
This is the correct answer. The difference between the sample means and the population mean may well be due to the randomness of the sample. There is not enough evidence to reject the null hypothesis.
Presenting your full report to Leo, he appears confused and disappointed. How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't pay off, shouldn't the increase in seats have helped? First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact that our sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in average beverage revenue. Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we are unable to conclude that spending has increased. You added seats and now more people can be seated in your lounge. But a greater number of guests does not necessarily translate into more spending per person. That does make a lot of sense. Your overall revenues may have actually increased, because more guests can be seated in the lounge. Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic I'll see a jump in total beverage revenues in the consolidated report at the end of the year. Why don't we go fill three of those new seats right now? Exercise 1: Oma's Pretzels Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks, and advertises that these pretzels contain an average of 112 calories per serving. In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample was 19. Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed? a. No This is not the correct answer. The sample mean falls outside of the range of likely sample means around the null hypothesis mean.
b. Yes This is the correct answer. The data indicate that the null hypothesis should be rejected. The calorie content has probably changed.
c. The answer cannot be determined from the information provided. This is not the correct answer. To solve the problem, we need the initial value of the mean, the confidence level, and the sample's mean, size, and standard deviation, all of which are provided.
z-table Utility for Single Populations You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states that the population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric content in the actual population is what Oma's has always advertised. The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the alternative hypothesis simply states that the mean does not equal the initial value. A two-sided test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed.
You assume that the null hypothesis is true and construct a range of likely sample means around the population mean. Using the data and the appropriate formula, you find the range [103; 121]. The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99% confident that the population mean is not 112. Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test? a. She is generally risk averse, and wants to minimize the chance of making any type of erroneous decision based on the test. This is not the correct answer. There is a tradeoff in setting confidence levels. A high confidence level decreases our chance of erroneously rejecting the null hypothesis, but increases our chance of not being able to reject a null hypothesis that is false.
b. She feels that it would be very costly to change her marketing campaign if there is in fact no change in the average number of calories. This is the correct answer. A high confidence level decreases our chance of erroneously rejecting the null hypothesis. In this case, Blanche wants to minimize the chance of saying that the caloric content has changed if it really is still 112 calories per serving.
c. She feels that it would be very costly to not change her marketing campaign if there is in fact a change in the average number of calories. This is not the correct answer. A high confidence level decreases our chance of erroneously rejecting the null hypothesis, but increases our chance of not being able to reject a null hypothesis that is false. Choosing a higher confidence level would thus increase the chance that if the caloric content has changed, she would not be able to substantiate that fact.
z-table Utility for Single Populations Exercise 2: The Clearwater Power Company The Clearwater Power Company produces electrical power from coal. A local environmental group claims that Clearwater's emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town downwind of the plant. According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts per billion (ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the environmentalist's argument. The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a mean sulfur dioxide content of 35 ppb with a standard deviation of 24 ppb. The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are higher than permitted. Which of the following is an appropriate null hypothesis for this problem? a. The average sulfur dioxide level is no higher than 30 ppb, the EPA's standard of acceptability. This is the best answer. The null hypothesis states the conventional wisdom: that the population mean of the population under investigation the sulfur dioxide concentration of air in Blue Sky is less than or equal to 30 ppb, the acceptability standard for which the EPA does not require a remedy. The environmentalists will pose as the alternative hypothesis the claim they are trying to substantiate: that Blue Sky's levels exceed the acceptable standard.
b. The average sulfur dioxide level is higher than 30 ppb, the EPA's standard of acceptability. This is not the best answer. This would make a good alternative hypothesis, since this is the claim the environmentalists are trying to substantiate.
c. The average sulfur dioxide level is 35 ppb. This is not the best answer. The null hypothesis should state that the population mean takes on the intial value, not the value of the sample mean.
z-table Utility for Single Populations
The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a onesided test. The alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they will choose a 95% confidence level. What is the range of likely sample means? a. All values above 22.16 ppb. This is not the correct answer. In a one-sided test, the z-value for 95% confidence is 1.645, the cumulative probability for 0.95.
b. All values above 23.42 ppb. This is not the correct answer. We want to test for positive change, not negative change.
c. All values below 36.58 ppb. This is the correct answer.
d. All values below 37.84 ppb. This is not the correct answer. In a one-sided test, the z-value for 95% confidence is 1.645, the cumulative probability for 0.95.
z-table Utility for Single Populations They calculate the one-sided range around the null hypothesis mean that contains 95% of all samples. The z-value for a one-sided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb. Based on your calculations, you should: a. Reject the null hypothesis. The data indicate that the sulfur dioxide content in Blue Sky is above EPA standards. This is not the correct answer. You reject the null hypothesis only when the sample mean falls outside the range of likely sample means.
b. Accept the alternative hypothesis. This is not the correct answer. Accepting the alternative hypothesis is equivalent to rejecting the null hypothesis. You can reject the null hypothesis only when the sample mean falls outside of the range of likely sample means.
c. Accept the null hypothesis. This is not the correct answer. Since the sample mean falls within the range of likely sample means, you cannot reject the null hypothesis. However, this does not mean that you have proven that the null hypothesis is true. We never accept a null hypothesis based on sample data.
d. Do not reject the null hypothesis. This is the correct answer. 35 ppb falls within the range of likely sample means. Ata a 95% confidence level, these sample data do not provide enough evidence to reject the null hypothesis.
z-table Utility for Single Populations Exercise 3: Neshey's Smooches You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its former capacity. If the machine isn't working at top capacity, you will need to have it replaced. Which type of hypothesis test is most appropriate for this problem? a. One-sided test This is the best answer. You want to know if the machine's performance has been impaired, not simply if the performance has changed.
b. Two-sided test This is not the best answer. You use a two-sided test when testing for change in either direction.
The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44. You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, you should: a. Have the machine replaced. This is the correct answer. The sample mean falls below the lower bound of the one-sided range of likely sample means around the null hypothesis mean. You can be 95% confident that the machine's performance has been impaired.
b. Continue to use the machine. The lower output in the sample hours you observed was due solely to chance. This is not the correct answer. Be sure you correctly calculated the lower bound of the range of likely sample means to be 327.
The null hypothesis is that # " 340. The alternative hypothesis is that # < 340 since you are using a one-tail test and you are assuming that the new population mean is lower than the population mean before the flood. Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate zvalue is 1.645 if you want to capture 95% of all sample means in a one-sided range around the null hypothesis mean. Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of the calculated range of likely sample means. You accept this as strong evidence against the null hypothesis, substantiating the alternative hypothesis that the mean output rate has dropped. You should replace the machine. Single Population Proportions Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just the revenue of the restaurants that I care about," Leo says, "It's also my guests' satisfaction with their restaurant experience." The Restaurant Ambiance Problem When I go out to eat, I expect more than just excellent food. The whole dining experience is essential — everything from the service, to the décor, to the design and quality of the silverware. And it's not just that all of these factors must be excellent individually — they have to fit together. The restaurant has to have ambiance! I'm sure my guests have similar expectations, and I want to be sure my restaurant meets them. Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I think have improved the ambiance. It took me a long time and a substantial amount of money to get everything right, but I'm pleased with the result: the restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine. In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate the Kahana's restaurants' ambiance on a scale of one to five. Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea of how well we were doing. That percentage has been very high: 72%. I've collected this year's data for you. Can you figure out if my guests are happier with my restaurants' ambiance?
Hypothesis Tests for Single Population Proportions Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean. Often the summary statistic we want to make a claim about is a proportion. How do we test a hypothesis about a population proportion instead of a population mean? We know from our work with confidence intervals that the processes for estimating population proportions and population means are virtually identical. Similarly, hypothesis tests for proportions are much like hypothesis tests for means. Because we are examining a population proportion instead of a population mean, we use slightly different notation: we use a lower case p to represent the population proportion in place of ! for a population mean. We construct a hypothesis test to test a claim about the value of p. Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past experience, we have an initial understanding of the population proportion. The null hypothesis for a proportion test states the initial understanding. For example, in a two-sided test, the null hypothesis asserts that the population proportion, p, is equal to the initial value we had in mind. The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The alternative hypothesis typically states the opposite of the null hypothesis: it states that our initial understanding is incorrect. As with population means, we collect a random sample and calculate the sample proportion, "p bar." However, for a hypothesis test about a population proportion, we don't need to calculate a standard deviation from the sample. Statistical theory tells us that ", the standard deviation of the population proportion, is the square root of [p*(1 - p)]. Since we always start the test assuming the null hypothesis is true, we will calculate " using the null hypothesis proportion. Analogously to population mean tests, we create a range of likely sample proportions around the null hypothesis proportion. To create the range, we substitute for ", the standard deviation of the underlying null hypothesis population. If our sample proportion falls outside the range of likely sample proportions, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. Summary In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then, we construct a range of likely sample proportions around the null hypothesis proportion. If the sample proportion we collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. Solving the Restaurant Ambiance Problem Once you understand hypothesis testing for means, using the same techniques on proportions is easy. By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's restaurant ambiance problem calls for a hypothesis test for a population proportion. Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent" has increased. Historically, that population proportion has been 0.72. Since Leo wants to see if there has been positive change, you do a one-sided test. The appropriate pair of hypotheses is: a. Null hypothesis p = 0.72, alternative hypothesis: p ` 0.72 This is not the best answer. This would be a good pair of hypotheses for a two-sided test.
b. Null hypothesis p " 0.72, alternative hypothesis: p < 0.72
This is not the best answer. This would be a good pair of hypotheses if we wanted to know if the population porportion had gone down.
c. Null hypothesis p $ 0.72, alternative hypothesis: p > 0.72 This is the correct answer.
You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent" has increased. The alternative hypothesis states that the proportion has increased, and the null hypothesis states that it has not increased. You look at Leo's data. The sample proportion is 0.81 and the sample size is 126. But what about the standard deviation? a. You have enough information to calculate the standard deviation. This is the correct answer. For proportions, you can calculate the standard deviation using the null hypothesis proportion.
b. You should call Leo and ask him to calculate the standard deviation for you. Leo's phone is busy. Perhaps you should reconsider the data, and see if you can figure out the standard deviation with the information you have.
z-table Utility for Single Populations Here's how you find the standard deviation for a proportion problem: Using the appropriate formula, you calculate the standard deviation to be 0.45. Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely sample means around the null hypothesis value of the population proportion: 0.72. Find the range of likely sample proportions around the null hypothesis proportion, and formulate a short answer for Leo. a. There is not enough evidence to show that the proportion of guests rating the restaurant ambiance "excellent" has increased. This is not the correct answer. Check your calculations: the sample proportion falls well outside the range of likely sample proportions.
b. The evidence supports Leo's claim that the proporation of guests rating the restaurant ambiance "excellent" has increased. This is the correct answer.
z-table Utility for Single Populations A one-sided test calls for a one-sided range of likely sample proportions. You need to find the upper bound for this range such that the range captures the lower 95% of the sample proportions. The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis proportion, 0.72, for p. The upper bound for the range containing the lower 95% of all sample means is 0.78. Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The data provide sufficient evidence that the population proportion has, in fact, changed. Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected indicate that the difference between the historical population proportion and the proportion of the random sample is not due to chance. The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased.
Just what I wanted to hear! Thanks, you two. Exercise 1: The Ventura Insurance Company Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a special insurance package to members of certain professional groups. In particular, Luther wants to create a special package for health professionals. To find out what rate to charge for this package, Luther conducts a preliminary study to see if health professionals are less likely to be involved in car accidents than the rest of his customer base. If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer health professionals a lower, more competitive rate. In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the correct pair of hypotheses for solving Luther's problem? a. Null hypothesis p = 8.3%; Alternative hypothesis p % 8.3% This is not the best answer. These would be the appropriate hypotheses for a two-sided test.
b. Null hypothesis p $ 8.3%; Alternative hypothesis p > 8.3% This is not the best answer. Luther wants to know if medical professionals are better drivers. The alternative hypothesis should state that medical professionals are less likely to be in accidents.
c. Null hypothesis p " 8.3%; Alternative hypothesis p < 8.3% This is the correct answer. Luther wants a one-sided test, because he wants to know if medical professionals are better drivers. The alternative hypothesis should state that medical professionals are less likely to be in accidents.
A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents. If he uses a 95% confidence level, which of the following is the best conclusion Luther can come to? a. Health professionals should be charged more for car insurance. This is not the best answer. The sample proportion falls outside the range of likely sample proprotions around the null hypothesis proportion, and the null hypothesis should be rejected. Based on our test, we shouldn't charge health professionals higher insurance rates. Health care professionals have fewer accidents, this, if we change their rates at all, they should be charged lower rates.
b. The evidence suggests that health professionals are less likely to be involved with car accidents. This is the best answer. The range of likely sample proportions around the null hypothesis proportion does not contain the sample proportion, so we can reject the null hypothesis. With 95% confidence, the proportion of health professionals involved in car accidents is lower than the proportion of Ventura's population of drivers.
c. The data provide no evidence suggests that health professionals are more or less likely to be involved in car accidents. This is not the best answer. The sample proportion falls outside the range of likely sample proprotions around the null hypothesis proportion, so you should reject the null hypothesis.
z-table Utility for Single Populations You need to find a range of likely sample proportions. To find this range, you calculate a standard deviation. The standard deviation is 0.28. For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower bound of this range is 0.054 = 5.4%. The range of likely sample proportions does not contain 5.0%, so you should reject the null hypothesis.
With 95% confidence, the proportion of health professionals involved in car accidents is lower than the proportion of the overall population of drivers. P-Values After sleeping over your analysis of restaurant operations, Leo seems unsatisfied. Leo Demands a Deeper Understanding Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a "reject/don't reject" decision. If I understand you correctly, it doesn't matter how close to the border of the rejection region our sample statistic falls: "reject" is "reject." But can't you tell me more? I want to know how strong the evidence against the null hypothesis is, not just if it is strong enough. I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one that provides a measure of the strength of the evidence. P-Values The evening before, Alice had acquainted you with p-values: "We can use the p-value method of hypothesis testing to make 'reject/not reject' decisions in the same way we have been doing all along. But the p-value also measures the strength of evidence against a null hypothesis." In hypothesis tests we've done so far, we first chose the confidence level of the test. The confidence level tells us the significance level of the test, which is simply 1 minus the confidence level. Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value for rejection. Assuming that the null hypothesis is true, we reasoned that certain sample mean values are less likely to appear than others. If the mean of the sample we collected was sufficiently unlikely to appear (that is, less than 5% likely) we considered the null hypothesis implausible and rejected it. Now, rather than simply checking whether the likelihood of collecting our sample is above or below our chosen threshold, we'll ask: if the null hypothesis is true, how likely is it to choose a sample with a mean at least as far from the null hypothesis mean as the sample mean we collected? The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance from the null hypothesis mean. In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis. The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the hypothesis. Let's look at an example. Recall the movie theater manager who wanted to know if the average satisfaction rate for his clientele had changed from its historical rate of 6.7. To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample means if the null hypothesis mean had still been true. Since the mean of the sample of current moviegoers we collected, 7.3, fell outside of that range, we rejected the null hypothesis. Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean as extreme as 7.3 is less than 5% if the null hypothesis is true. Now let's find out exactly how unlikely it is by calculating the p-value. Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for samples of sufficient size, the sample means of any population are distributed normally. To calculate the likelihood of a certain range of sample mean values — in our example, sample mean values greater than 7.3 or less than 6.1 — we just need to find the appropriate area under the
distribution curve of the sample means. To calculate the p-value for this two-sided test, we want to find the area under the normal curve to the right of 7.3 and to the left of 6.1. The standard deviation in this example is 2.8, and the sample size is 196. We can calculate this probability by first calculating the z-value associated with the value 7.3. That zvalue is 3. Then, we find the probability of having a z-value less than -3 or greater than 3. The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the same size, so the total area is 0.0027. That is our p-value. These areas and the p-value can be found in Excel using the NORMSDIST(-3) function, in the z-table, or with the Excel utility provided. Excel z-table Utility for Single Populations Our p-value calculation tells us that the probability of collecting a sample mean at least as far from 6.7 as 7.3 is 0.0027. The p-value is lower than 0.05. Thus, at a significance level of 0.05, we would reject the null hypothesis and conclude that moviegoers' average satisfaction rating is no longer 6.7. Excel z-table Utility for Single Populations But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value, the higher our confidence in rejecting the null hypothesis. One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the p-value is the area under one side of the curve. In our movie theater example, if the alternative hypothesis states that the population mean is larger than 6.7, the p-value is the area under the normal curve to the right of the sample mean of 7.3. Summary The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood, assuming that the null hypothesis is true, of collecting a sample mean at least as far from the null hypothesis mean as the sample actually collected. We compare the p-value to the threshold significance level to make a reject/not reject decision. The p-value also tells us how comfortable we can be with that decision. Solving the Restaurant Revenue Problem (Part II) Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant revenue hypothesis test again. This time, you'll be able to give Leo an idea of how strong the statistical evidence is. Leo wants you to complete the p-value hypothesis test right there in his office. You're a little nervous — you've never had a client peering over your shoulder when you work. But you oblige him, because you're growing more confident of your statistical skills. Looking back at your notes on the problem, you find the data and the hypotheses. You make a mental note that you are doing a two-sided test to see whether or not average spending on food has changed from its historical level of $55. An eager Leo interrupts your thought process: When you ran the hypothesis test earlier I had you use a 95% confidence level. That corresponds to a significance level of 0.5, right? You politely respond:
a. Yes, Leo you've got it. You'd better rethink that. You wouldn't want to give Leo the wrong information.
b. I'm sorry, but I don't think that's right. Good choice. Leo is still a little confused, but you bring him up to speed.
To find the significance level corresponding to a confidence level of 95%, simply subtract 95% from 100%, and convert into decimal notation: 0.05. After you clarify Leo's mistake, he sits back and lets you finish your analysis without further interruption. First, you find the appropriate z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Excel z-table The correct z-value is 3.00, corresponding to a right-tail probability of 0.00135. You: a. Take that probability as your p-value. This is not the correct answer. You're conducting a two-sided test, you need to find the probability in both tails.
b. Double that probability to find the p-value. This is the correct answer. For a two-sided test, you calculate both tail probabilities.
c. Halve that probability to find the p-value. This is not the correct answer. You're conducting a two-sided test and you've only found the probability in one tail.
z-table Utility for Single Populations Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55. You want to calculate the likelihood of getting a sample mean that is at least as far from the population mean as x-bar. That likelihood is not just the tail probability to the right of the sample mean. Sample means on the other side of the normal curve are just as far from the population mean as x-bar. They must be included, too, when you calculate the p-value for a two-sided hypothesis test. Doubling the right-tail probability gives you the correct p-value: 0.0027. Alice summarizes your results for Leo. All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than the significance level 0.05. Our data are statistically significant at the 0.05 level. Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value method suggests that we reject the null hypothesis. With 95% confidence, average food spending per guest has changed. But now we can also see that the evidence is very strong, because the p-value is much lower than the significance level. We can claim that food spending has changed at the 0.0027 level of significance. Thanks, you two. I feel much more comfortable concluding that average guest spending in my restaurant has changed. Exercise 1: Oma's Revisited In the following exercise you will revisit an earlier problem, this time solving it with the p-value method. Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks. Each bag of pretzels contains one serving, and Oma's advertises that the pretzel snacks contain an average of 112 calories per serving.
In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. The researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample was 19. Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can market them appropriately. At the significance level 0.01, do these data indicate that the pretzels' calorie content has changed? a. No. This is not the best answer. The sample mean lies outside of the range of likely sample means around the null hypothesis mean.
b. Yes. This is the best answer. The data indicate that the null hypothesis should be rejected. The calorie content has probably changed.
c. Can't be determined from the information given. This is not the best answer. All the information needed to solve the problem is given.
Utility for Single Populations Excel z-table In this problem, the null hypothesis is that the actual population mean is what Oma's has always advertised. A two-side test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed. Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the appropriate formula. The z-value is -2.98. Using the Excel NORMSDIST function or the Standard Normal Table, you can find the corresponding left-tail probability of 0.0014. For a two-sided test, you double this number to find the p-value, in this case 0.0028. Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover, you now can say that you are rejecting the null hypothesis at the 0.0028 level of significance. You can recommend to Blanche that she have the labeling changed on the pretzel bags, and adjust her marketing accordingly. Exercise 2: Neshey's Revisited In the following challenge you will revisit an earlier problem, this time solving it with the p-value method. You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its former capacity. If the machine isn't working at top capacity, you will need to have it replaced. The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44. You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, you should: a. Replace the machine. This is the best answer. The p-value is less than the significance level of 0.05. You reject the null hypothesis that the machine's performance is not impaired.
b. Keep the machine.
This is not the best answer. At the 0.05 significance level, the data suggest that the machine's performance has been impaired.
Utility for Single Populations Excel z-table The null hypothesis is that the population mean is no lower than 340. The alternative hypothesis is that the population mean is now less than 340. You are using a one-tailed test, and you are assuming that the new population mean is lower than the population mean before the flood. Identify the values of the relevant quantities. Use the appropriate formula and calculate the z-value. The z-value is -2.83. This z-value corresponds to a left-tail probability of 0.0023. This is the tail you are interested in, since you are conducting a one-sided to test to see if the actual population mean is less than it was in the past. This tail probability is the p-value. Since the p-value is less than the significance level, you reject the null hypothesis that the population mean is unchanged. Moreover, you now can say that you are rejecting the null hypothesis at the 0.0023 level of significance. You should replace the machine. Comparing Two Populations Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary spending habits of two categories of guests: leisure and business. Leisure Guests vs. Business Guests: Who spends more? Every hotel manager wrestles with the problem of stretching limited marketing resources. I want to make sure that I'm wisely allocating each marketing dollar. Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many professional associations like to have their conventions here, so our islands attract business travelers, who mix business and pleasure. Business travelers pay lower room prices because conferences book rooms in bulk. Bulk reservations are good for me because they keep my occupancy levels high. However, I don't have a good sense of whether the discretionary spending of my business guests is different from that of my leisure guests: they may take fewer scuba lessons but use the spa services more, for example. Can you help me figure out whether there is any significant difference between leisure and business travelers' discretionary spending habits? Your conclusions might influence my marketing efforts. I collected two random samples: one of leisure guests and one of business guests. Not including room, meal, and beverage charges leisure travelers spent an average of $75 a day, compared to $64 a day for the business travelers. I knew that the difference between the two averages of the two samples could be due to chance, so I thought I'd have you do a hypothesis test to find out. When I was compiling the data for you, I realized that my samples were of different sizes. I was able to get 85 leisure guests to respond, but only 76 business guests returned my survey. Which figure will you use as the sample size? Or will you add them together? I also realized that with these data, you'd have to calculate two sample standard deviations, one for each sample. How do you go about solving a problem like this? Using Hypothesis Tests to Compare Two Population Means How do you test whether two populations have different means?
So far, we've used hypothesis tests to study the mean or proportion of a single population. Often, managers want to compare the means or proportions of two different populations: in this case, we use a two-population hypothesis test. Let's clarify when we use each type of test. We conduct single-population tests when we have an initial value for a population mean and want to test to see if it is correct. Single population tests are especially useful when we suspect that the population mean has changed. For example, we use a single-population test when we know the historical average of a population and want to test whether that historical average has changed. We conduct two-population tests to compare a characteristic of two groups for which we have access to sample data for each group. For example, we'd use a two-population test to study which of two educational software packages better prepares students for the GMAT. Do the students using package 1 perform better on the GMAT than the students using package 2? In two-population tests, we take two samples, one from each population. For each sample, we calculate the sample mean, standard deviation, and sample size. We can then use the two sets of sample data to test claims about differences between the two populations. For example, when we want to know whether two populations have different means, we formulate a null hypothesis stating that the means are not different: the first population mean is equal to the second. Let's look at the GMAT software package example more closely. The manager of one educational software company might wonder if the average GMAT score of students using her software is different from the average GMAT score of students using the competitor's software. Since the manager only wants to test if the average GMAT scores are different, she conducts a two-sided hypothesis test for two populations. The null hypothesis states that there is no difference between the average GMAT scores of the students who use the two companies' software. The alternative hypothesis states that the average GMAT scores of the students who use the two companies' software are different. We denote the average scores of the two populations by the Greek letter mu and distinguish them with subscripts. Our hypotheses are: To be 95% confident in the result of the test, we use a significance level of 0.05. We collect two samples, one from each population. We denote the sample means with the familiar x-bar, which we again distinguish with subscripts. We are able to collect the GMAT scores of 45 people who used the company's software, and 36 people who used the competitor's software. As we will see shortly, the different sample sizes will not pose a problem. The respective sample means are 650 and 630, and the standard deviations are 60 and 50. Could the two random samples we picked just happen to have different means by chance but really have come from populations that have the same population means? The null hypothesis states that there is no difference in the two population means. As with singlepopulation tests, we test the null hypothesis by asking how likely it would be to produce the sample results if the null hypothesis is in fact true. That is, if the average GMAT scores for students using the two different software packages actually are the same, what is the chance that two samples we collect would have sample means as different as 650 and 630? Our intuition tells us that the greater the difference between the means of the two samples, the more likely it is that the samples came from different populations. But how do we know when the numerical difference is large enough to be statistically significant? When do we have enough evidence to actually conclude that the two populations must be different? We use p-values to answer this question. First we calculate a z-value for the difference of the sample
means, incorporating the data from both populations. It looks a bit complicated: Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we have: Using the formula, we find that the z-value is 1.64. For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a p-value of 0.10. Since this p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis. In other words, the high p-value tells us that there is insufficient evidence from the two samples to conclude that the average GMAT score of the students who use the company's software is different from the average GMAT score of students who use the competitor's software. Two-population hypothesis tests can be performed using the formula shown above, or you can click here to access the Excel utility for hypothesis testing. Summary In a hypothesis test for two population means, we assume a null hypothesis: that the two population means are equal. We collect a sample from each population and calculate its sample statistics. We calculate a p-value for the difference between the two samples. If the p-value is less than the significance level, we reject the null hypothesis. Hypothesis Tests for Two Population Proportions Often, managers want to know if two population proportions are equal. For example, a marketing manager of a packaged snack foods company might want to compare the snack food habits of different states in the US. The marketing manager might think that the proportion of consumers who favor potato chips in Texas is different from the proportion of consumers who favor potato chips in Oklahoma. Comparing two population proportions is similar to comparing two population means. We have two populations: the null hypothesis states that their proportions are the same; the alternative hypothesis states that they are different. We collect a sample from each population and calculate its sample size and sample proportion. As in the single population proportion test, we don't need to find the sample standard deviation, since we know that the population standard deviation is the square root of [p*(1 - p)]. Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for the difference between the proportions using the formula below: We translate the z-value into a p-value just as we would for any other type of hypothesis test. If the pvalue is less than our significance level, we reject the null hypothesis and conclude that the proportions are different. If the p-value is greater than the significance level, we do not reject the null hypothesis. Optional Example Let's take a closer look at the study of snacking habits in Texas and Oklahoma. The manager does not wish to test for a particular direction of difference; he just wants to know if the proportions are different. Thus, he should use a two-sided test. The marketing manager wants to be 95% confident in the result of this test, so the significance level is 0.05. Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample proportions are 45% and 35%, respectively. Could the two random samples we picked just happen to have different sample proportions? That is, if the true proportions of Texans and Oklahomans favoring potato chips actually are the same, what
would be the chance that the sample proportions are 45% and 35% respectively? We use p-values to answer this question. First, we calculate a z-value for the difference of the sample proportions that incorporates the data from both populations. The null hypothesis states that the population proportions are equal, so their difference is 0. The z-value is 2.48. For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence a p-value of 0.013. Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis. In other words, the low p-value tells us that there is sufficient evidence from the samples to conclude that there is a difference between the proportions of Texan and Oklahoman potato chip lovers. We can make this claim at a 0.013 level of significance. Two-population hypothesis tests for population proportions can be performed using the formula shown above, or you can click here to access the Excel utility for hypothesis testing. Summary In a hypothesis test for two population proportions, we assume a null hypothesis: the two population proportions are equal. We collect two samples and calculate the sample proportions. We calculate a p-value for the difference between the sample proportions. If the p-value is less than the significance level, we reject the null hypothesis. Excel Utility (Two Populations) Click here to open an Excel Utility that allows you to perform hypothesis tests for two populations. Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the GMAT and potato chip examples. Solving the Leisure vs. Business Guest Spending Problem Two-population hypothesis tests help you determine whether two populations have different means. You use a two-population test to solve Leo's problem. You have to find out if leisure guests' average daily discretionary spending is different from business guests' average daily discretionary spending. Leo has provided these data: Now it's time to state the null hypothesis. The best formulation is: a. There is no difference between business and leisure guests' mean spending. This is the best answer. You want to know if two means are different, not if they differ in one particular direction. If Leo had asked you to conduct a test to learn only if business guests' spending was greater than that of leisure guests, the second answer would be correct.
b. On average, business guests' spending is less than leisure guests' spending. This is not the best answer. You want to know if two means are different, so the null hypothesis states that they are not different.
c. The average difference between business and leisure guests' spending is $11. This is not the best answer. We never use the summary statistics from the sample to formulate our hypotheses; the hypotheses must be specified before the samples are collected.
You want to find out if the means of two populations — average spending by leisure guests vs. average spending by business guests — are different. The two samples from those populations have different means: $75 and $64, respectively.
The samples may come from populations with the same means, and the numerical difference is due to chance in getting these particular samples. It could be that the first sample just happened to have a high mean and the second sample just happened to have a lower mean. You test the null hypothesis that the population means have the same value. You make note of your null hypothesis and the corresponding alternative hypothesis. You use a two-sided test because you don't have any reason to believe that one type of guest spends more than the other. At Leo's request you do a p-value test using a significance level of 0.05. To calculate the p-value, you first find the z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. a. 2.10 b. 2.11 c. 2.12 d. -2.10 e. -2.11 f. -2.12 z-table Utility for Two Populations The z-value is +2.12 or -2.12, depending on how you set up the difference, i.e., in what order you subtract the sample means. Either way, the final conclusion will be the same. You use the z-value to calculate the p-value. The p-value is: a. 0.017 This is not the correct answer. Remember that for a two-sided test, you must calculate both tail probabilities.
b. 0.034 This is the correct answer. For a two-sided test, you calculate both tail probabilities.
c. 0.051 This is not the correct answer. For a two-sided test, you calculate the probabilities in both tails to get the p-value.
z-table Utility for Two Populations A z-value of 2.12 has a cumulative probability of 0.9830. You subtract this probability from the total probability, 1, for a right-tail probability of 0.017. Because we are doing a two-side test, we want to measure the probability of extreme values on both sides. Thus, we double 0.017 to get a p-value of 0.034. Since the p-value 0.034 is less than the significance level 0.05, you recommend to Leo that he reject the null hypothesis. The average daily discretionary spending per person is different for leisure and business guests. Leo reads your report: I see. We can tell if two population means are different by running a hypothesis test on their difference. We test the null hypothesis that there is no difference.
As you see, the p-value is less than the significance level. This tells you that... ...the difference in the two sample means is probably not due to chance. The spending habits of the two types of guests are different. Got it. Exercise 1: The Burger Baron Claude Forbes is a regional manager of The Burger Baron restaurant chain. The Baron would like to open a franchise in Sappington. Two properties are up for sale, and Claude wants to choose the location that has the most traffic. On 54 randomly selected days over a six-month period, an average of 92 cars passed by location A during the lunch hour, from noon to 1PM. On 62 randomly selected days, 101 cars passed by location B during the lunch hour. The standard deviations for the two locations were 16 and 23, respectively. Is the difference between the amount of traffic at locations A and B statistically significant at the 0.01 level? a. Yes. This is not the correct answer. This problem requires a two-sided test, and the p-value is 0.0135.
b. No. This is the correct answer. The p-value is 0.0135, which is higher than 0.01.
z-table Utility for Two Populations First, you set up the hypotheses. The null hypothesis states that there is no difference in mean traffic flow during the lunch hour at the two locations. Since you are only interested in whether the traffic is different at the two locations, you conduct a two-sided hypothesis test. Next, you find the z-value for the difference between the two sample means using the appropriate formula. The z-value is -2.47 A z-value of -2.47 corresponds to a left-tail probability of 0.0068. Since you are doing a two-sided test, you must double this probability to find the correct p-value: 0.0136. The p-value is higher than the significance level. You cannot reject the null hypothesis. On the basis of these data, Claude has insufficient evidence to show that one location has more traffic than the other. Exercise 2: Karnivorous Kong vs. Peter the Pipsqueak The Magical Toy company manufactures a line of wrestling action figures. Maude Troston, the head of the doll department, must make the final decision regarding which of two models, "Karnivorous Kong" or "Peter the Pipsqueak," should be discontinued. Magical Toy ships crates containing equal numbers of Pipsqueaks and Kongs to retail outlets. 45 randomly selected toy stores have sold an average of 78% of all Pipsqueaks delivered to them. 48 other stores have sold an average of 84% of their Kongs. At a significance level of 0.05, do the data indicate that one action figure sells better than the other? a. Yes. This is not the correct answer. The p-value is greater than the significance level.
b. No. This is the correct answer. The data are inconclusive.
z-table Utility for Two Populations Maude wants to know if one figure sells better, but doesn't have a preconception about which figure might be flying off the shelves. Thus you use a two-sided test, and your alternative hypothesis simply
states that there is a difference between the two population proportions. Using the appropriate formula and the data, you find the z-value for the difference between the sample proportions. The z-value is -0.738. A z-value of -0.738 corresponds to a left-tail probability of 0.2303. You double this to find the p-value for the two-sided test. The p-value is 0.4606 The p-value is greater than the significance level. Therefore, you can't reject the hypothesis. The data do not show conclusively that one action figure sells better than the other. Exercise 3: Grapefruit Bizarre The Regal Beverage Company makes the soft drink Grapefruit Bizarre. The marketing department wants to refocus its energies and resources. You have been asked to determine if there are regional differences in consumers' response to advertisements for Grapefruit Bizarre. Specifically, you must find out if the Midwest responds to Grapefruit Bizarre advertisements as well as the West Coast. The marketing department is clamoring to start a second campaign. It claims that ads that are effective on the West Coast do not go over as well in the Midwest. Management demands statistical evidence at a significance level of 0.05. In the context of a free movie screening, an ad for Grapefruit Bizarre is shown to 173 Midwesterners. The viewers had been randomly selected, and had not previously tasted the drink. When asked later, 33% claimed that they were at least mildly interested in trying Grapefruit Bizarre. In a similar survey conducted on the West Coast, 42% of 152 test subjects claimed at least a mild interest in trying Grapefruit Bizarre. You calculate the z-value for the difference in sample proportions. Enter your z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. a. 1.67 b. 1.68 c. 1.69 d. 1.7 e. -1.67 f. -1.68 g. -1.69 h. -1.7 z-table Utility for Two Populations The z-value is + or -1.68, depending on how you set up the difference, i.e., in what order you subtract the sample means. A z-value of -1.68 corresponds to a left-tail probability of 0.0465. What do you report to the marketing department? a. The p-value is 0.0930 and the data indicate that Midwesterners respond less well to the ads.
This is not the best answer. If the p-value is larger than the significance level, the null hypothesis should not be rejected.
b. The p-value is 0.0930 and the data are inconclusive: the difference between the sample proportions may be due to chance. This is not the best answer. You may have conducted a two-sided test. Since marketing wants to know if Midwesterners respond less well to the ads than the West Coast, you should conduct a one-sided test.
c. The p-value is 0.0465 and the data are inconclusive: the difference between the sample proportions may be due to chance. This is not the best answer. The p-value is 0.0465, which is less than the significance level 0.05.
d. The p-value is 0.0465 and the data indicate that Midwesterners respond less well to the ads. This is the best answer. The p-value is 0.0465, which is less than the significance level 0.05.
z-table Utility for Two Populations You are conducting a one-sided hypothesis test. The alternative hypothesis states that the proportion of the Midwest sample is less than the proportion of the West Coast sample. Therefore, you are interested only in the left-tail probability. Your p-value is 0.0465. 0.0465 is less than the significance level 0.05. You should reject the null hypothesis. Midwesterners do not respond to the ads as well as people from the West Coast. Marketing's claims are valid. Challenge: LeMer Fashion Design Upscale fashion designer Marjorie LeMer must decide from which supplier she should purchase bolts of cloth. Rumor has it that BlueTex's product is superior to Southern Halifax's. Random 10-yard sections from 43 bolts of Halifax's cloth contain a mean of 1.8 flaws per yard. Similar sections from 42 bolts of BlueTex's product contain 1.6 flaws per yard. The standard deviations are 0.3 and 0.6, respectively. Marjorie wants you to find out if the rumors that BlueTex makes a better product are statistically warranted. You conduct a one-sided test. Which of the following is the best alternative hypothesis? a. There is no difference in the quality of the cloth. This is not the best answer. This is the null hypothesis of a two-sided test. Marjorie has asked you to test the hypothesis that BlueTex has a better product than Halifax.
b. There are fewer flaws per yard in Halifax's cloth than in BlueTex's. This is not the best answer. This would be an appropriate alternative hypothesis if we wanted to see if Halifax produced the better cloth.
c. There are more flaws per yard in Halifax's cloth than in BlueTex's. This is the best answer. The alternative hypothesis should state that Halifax's product is inferior to BlueTex's.
z-table Utility for Two Populations At what level are these data significant? a. significance level 0.05 This is the correct answer. The p-value is 0.0262, which is greater than 0.01, but less than 0.05.
b. significance level 0.01 This is not the correct answer. The p-value is greater than 0.01.
c. Both at significance level 0.05 and significance level 0.01 This is not the correct answer. To be significant t both levels, a p-value must be less than both 0.01 and 0.05.
d. Neither at significance level 0.05 and significance level 0.01 This is not the correct answer. To be significant t both levels, a p-value must be less than both 0.01 and 0.05.
z-table Utility for Two Populations You find the z-value for the difference between the two sample means using the appropriate formula. The z-value is -1.94. The cumulative probability for z=-1.94 is 0.0262. This is the left-tail probability. Since you are running a one-sided test, 0.0262 is your p-value. 0.0262 is greater than 0.01. At this significance level, you would not reject the null hypothesis. 0.0262 is less than 0.05. At this significance level, you would reject the null hypothesis. Southern Halifax's product is slightly cheaper than BlueTex's. All other factors being equal, Marjorie would like to buy the less expensive product. Unless she is 99% confident that there is a difference in quality, she will go with the cheaper cloth. Based on this information and your calculations data, Marjorie should: a. Buy cloth from Southern Halifax Textile This is the correct answer. The data are not significant at the 0.01 level, which corresponds to a confidence level of 99%.
b. Buy cloth from BlueTex This is not the correct answer. Based on these data, Marjorie can't be 99% confident that BlueTex's produce is superior.
z-table Utility for Two Populations The data are not significant at the 0.01 level, so you can't reject the null hypothesis that they are different. A significance level of 0.01 corresponds to a confidence level of 99%. So at a 99% confidence level, you can't reject the null hypothesis. Marjorie can't be 99% confident that BlueTex's product is better than Halifax's. Since Halifax's product is cheaper and you can't determine a difference in quality at the level of statistical significant Marjorie requires, you recommend Halifax to Marjorie. "Good work!" says Alice. "You're ready for a new challenge: investigating relationships between variables." Regression Basics Introduction As you relax in your room during a brief afternoon downpour, your phone rings. Leo's Bisque Debacle and the Staffing Problem Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd better not keep him waiting. I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A server spilled a tureen of crab bisque on one of our most "favored" guests, Mr. Pitt. The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help from a staffing agency. Those staffing agencies charge a fortune, which is especially irritating considering that the employees they refer to us are often poorly suited to customer service in an up-scale hotel.
Really, this is my fault for not having a more effective staffing process. I just wish I could predict my needs better. Sometimes, when demand is lower than I expected, I'm overstaffed. Then I lose money paying idle bellhops. If I had a good sense of my staffing needs at least a month in advance, I could avoid hiring workers at the last minute and having idle staff. I had been thinking that the number of advance reservations would give me a good idea of how high my occupancy would be a month down the road. But clearly advance reservations don't tell me the whole story. I've been making way too many false predictions. Is there anything you can do to help me here? What predictions about occupancy can I make based on advance bookings? And how much can I trust them? We'll take a look at the data on advance bookings and occupancy and let you know what we find out. Introducing the Regression Line Alice seems confident that the two of you can offer useful advice on Leo's staffing problem: "This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all the time in business: in finance, demand forecasting, market research to name just a few areas. I'm sure you'll use it in your MBA program. And it's a great chance to review what you've learned so far: sampling, confidence intervals, and hypothesis testing all play a part in regression." As we have seen, it is often useful to examine the relationship between two variables. Using scatter diagrams, we can visualize such relationships. We can learn more about the relationship by finding the correlation coefficient, which measures the strength of the linear relationship on a scale from -1 to 1. Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the relationship between two variables. Let's look at an example. Julius Tabin owns a small food processing company that produces the spreadable lunchmeat product EasyMeat. Julius is trying to understand the relationship between his firm's advertising and its sales. Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius' competitors' actions have been quite stable. Julius believes that his advertising levels influence his firm's sales positively, but he doesn't have a clear understanding of what the relationship looks like. Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the Excel link to create the scatter diagram yourself from an Excel spreadsheet. EasyMeat Data Plotting annual sales against annual advertising expenditures gives us a visual sense of the relationship between the two variables. Looking at the graph, we can see that as advertising has gone up, sales have generally increased. The relationship looks reasonably linear. EasyMeat Data The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship between advertising and sales. EasyMeat Data What if we were to draw a line that characterizes this relationship? Which line would best fit the data? Our mind's eye already sees how the two variables are related, but how can we formalize our visual impression? EasyMeat Data Before we start any calculations, let's look at several lines that could describe the relationship. EasyMeat Data
One of these lines most accurately describes the relationship between the two variables: the "best-fit" or regression line. EasyMeat Data In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the yintercept is -333,831 and the slope is 50. EasyMeat Data In general, a regression line can be described by a simple linear equation, y = a + bx, with yintercept a and slope b. EasyMeat Data In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think Julius' sales depend to some degree on his advertising. The x-variable, advertising, is called the independent variable, or the explanatory variable. EasyMeat Data When we observe that a change in the independent variable (here advertising) is typically accompanied by a proportional change in the dependent variable (here sales), regression analysis can identify and formalize that relationship. EasyMeat Data Summary Regression analysis helps us find the mathematical relationship between two variables. We can use regression to describe a linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx. The Uses of Regression What kinds of questions can regression analysis help answer? How does regression help us as managers? In can help in two ways: first, it helps us forecast. For example, we can make predictions about future values of sales based on possible future values of advertising. Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the relationship mathematically. EasyMeat Data Using Regression for Forecasting Let's talk first about how managers can use regression to forecast. In our example, regression can help Julius predict his company's sales for a specified level of advertising. For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to be? If we didn't know anything about the relationship, but only had the historical data, we might simply note that the last time Julius spent $65,000 on advertising, his sales were $3,200,200. But is this the best prediction we can make? EasyMeat Data Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this will allow us to make more accurate predictions than if we infer the future value of sales from a single observation of advertising and sales. Having identified the relationship between the two variables from the full data set, we can apply our
understanding of that relationship to our forecast. EasyMeat Data Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising. If Julius plans to spend $65,000 in advertising, what would we predict sales to be? a. Around $3,200,000. This is not the best answer.
b. Around $2,900,000. This is the best answer.
c. Around $2,500,000. This is not the best answer.
EasyMeat Data The point on the line shows us what level of sales to expect. In this case, we would expect sales of $2,916,169. EasyMeat Data With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen historically. For example, even if Julius has never spent exactly $50,000 on advertising, we can still forecast a corresponding level of sales. EasyMeat Data We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already observed. The further we are from the historical values of advertising, the more we should question the reliability of our forecast. EasyMeat Data For example, we might feel comfortable forecasting sales for advertising levels a bit above the observed range- perhaps as high as $100,000 or $105,000. But we shouldn't infer that if Julius spent $10 million on advertising, he would achieve $500 million in sales. The total market for spreadable meat is probably much less than $500 million annually! EasyMeat Data Likewise, we might feel comfortable forecasting sales for advertising levels just below the observed range. But we certainly shouldn't report that if Julius spent $0 on advertising he would have negative sales! EasyMeat Data If we try to use our regression equation to forecast sales for advertising levels outside of the historical range, we are implicitly assuming that the relationship between advertising and sales continues to be linear outside of the historical range. EasyMeat Data In reality, although the relationship may be quite linear for the range of values we've observed, the curve may well level off for advertising values much lower or much higher than those we've observed. With no observations outside the historical data range, we simply don't have evidence about what the relationship looks like there. EasyMeat Data Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming that the past is a reasonable predictor of the future. Thus,
we should only use regression to predict the future if the general circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to hold in the future. The Structure of a Relationship Regression can be used to deepen our understanding of the structural relationship between two variables. If we think about it, many business decisions are about increasing or decreasing one variable — investments or advertising, for example — 7 to affect some other variable — productivity, brand recognition, or profits, for example. Regression can reveal the structure of relationships of this type. Our regression analysis stipulates a linear relationship between sales and advertising. Understanding "the structure" of this relationship translates into finding and interpreting the coefficients of the regression equation. As we've noted above, the constant term -333,831 may have no real managerial significance; it just "anchors" the regression line by telling us the y-intercept. We've never seen advertising levels close to $0, so we cannot infer that spending no money on advertising will lead to sales of -$333,831! The more important term is the advertising coefficient, 50, which gives us the slope of the line. The advertising coefficient tells us how sales have changed on average as advertising has increased. In the past, when advertising has increased by $10,000, what has been the average corresponding change in sales? a. Sales have stayed flat. This is not the correct answer. Double check your numbers and try again.
b. Sales have increased by $50. This is not the correct answer. Double check your numbers and try again.
c. Sales have increased by $500,000. This is the correct answer.
Assuming that the relationship between sales and advertising is linear, each $1 increase in advertising should be accompanied by the same average increase in sales. In our example, for every incremental $1 in advertising, sales increase on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on average by $500,000. The regression line gives us insight into how two variables are related. As one variable increases, by how much does the other variable typically change? How much growth in sales can we anticipate from an incremental increase in advertising expenditures? Regression analysis helps managers answer questions like these. Summary We use regression analysis for two primary purposes: forecasting and studying the structure of the relationship between two variables. We can use regression to predict the value of the dependent variable for a specified value of the independent variable. The regression equation also tells us how the dependent variable has typically changed with changes in the independent variable. Exercise 1: Soft Drink Consumption Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP). Generally, the higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is measured in number of 8-oz servings. Based on data from 12 countries, the relationship can be expressed mathematically as:
Source Based on this relationship, you can expect that, on average, for each additional $1,000 of per-capita GDP a country's soda consumption increases by: a. 180 servings. This is not the correct answer. Double check your math and try again.
b. 148 servings. This is not the correct answer. Only the slope of 0.018 should be considered when finding the incremental increase in soda consumption. c. 130 servings. This is not the correct answer. The slope of 0.018 should be considered when finding the increase in soda consumption.
d. 18 servings. This is the best answer.
The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for every additional $1 of per-capita GDP. So, for an additional $1,000, average consumption increases by ($1,000)(0.018 servings/$) = 18 servings. The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of soda consumed in the Netherlands per year? Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary. a. 580 b. 581 The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore, we anticipate the Netherlands' average soda consumption to be 580.6 servings. Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands, the actual measured number of servings consumed is much lower: 362. The discrepancy in the actual and predicted consumption reinforces that per-capita GDP alone is not a perfect predictor of soda consumption. Calculating the Regression Line A regression line helps you understand the relationship between two variables and forecast future values of the dependent variable. Alice points out to you that these two features of regression analysis make it a powerful tool for managers who make important decisions in the uncertain world of business. But how do you generate a regression line from observed data? Of all the straight lines that you could draw through a scatter diagram, which one is the regression line? The Accuracy of a Line Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be drawn that would pass through every point in the data set. This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data point to fall in a perfect line. The regression line depicts the best linear relationship between the two variables. We attribute the difference between the actual data
points and the line to the influence that other variables have on sales, or to chance alone. Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the regression line represent the data? To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line we could draw through our data set. Let's consider a second line. Click on the line that more closely fits the ten data points. Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure of a line's accuracy. To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line. Why don't we measure the shortest distance between the point and the line — the distance perpendicular to the line? Why do we measure vertically? We measure vertical distance because we are interested in how well the line predicts the value of the dependent variable. The dependent variable — in our case, sales — is measured on the vertical axis. For each data point, we want to know: how close is the value of sales predicted by the line to the historically observed value of sales? From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our dependent variable. This difference may be due to the influence of other variables or to plain chance. Going forward, we will refer to the value of the dependent variable predicted by the line as yhat and to the actual value of the dependent variable as y. Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent variable. The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the error, y - (y-hat). Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data. To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and negative distances would cancel each other out. Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we calculate variance. This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how accurately a line describes a set of data. The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors. Summary To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. Identifying the Regression Line Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that "best fits" the data: the regression line. We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us different Sums of Squared Errors. The line we are looking
for — the regression line — is the one with the smallest Sum of Squared Errors. Let's look at several lines that could describe the relationship between advertising and sales in our example. Our intuition tells us that the middle line is a much better fit than line a or line b. Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to determine its accuracy. The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy. The line that most accurately describes the relationship between advertising and sales — the regression line — is the line that minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive process best left to statistical software. Summary The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared Errors is minimized. Performing Regression Analysis in Excel 2007 Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool works, so you can run regressions in the future, when you do have access to the Data Analysis Toolpak.
Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software packages and major spreadsheet programs — Excel, for example — can do the necessary calculations for you in a matter of seconds. Click on the Excel link to access the data file so you can practice doing the analysis in Excel as you read through the instructions. EasyMeat Data Let's go through the process step by step. We start with data entered in two columns in an Excel spreadsheet. Each column contains values of a variable. To perform regression analysis, there must be an equal number of entries in each column.
Under the Data tab in the toolbar we select the Data Analysis option. A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK". A new window opens offering several options for regression analysis.
In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of cells containing the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($). We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label (B1) and the data (B2:B11) for the independent variable: Advertising ($). Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is helpful because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not include the labels in your ranges, do not check the label box, or Excel will treat the first row of data as labels, excluding those entries from the regression. Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click "OK." Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data. For the moment, we are mainly interested in the entries in the cells labeled "Coefficients", which specify the intercept and slope of the regression line. Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in the "Advertising ($)" row is the slope of the regression line. For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and perform the regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file containing the relevant regression output.
EasyMeat Data
EasyMeat Data EasyMeat Regression Exercise 1: Soft Drink Consumption Revisited To practice using Excel's regression tool, run a regression using the world soft drink consumption data from an earlier exercise. Use soft drink consumption for the dependent variable and per capita GDP for the independent variable. Soft Drink Consumption Data What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the right of the decimal point (e.g., enter "5" as "5.000"). Round if necessary. a. 0.018 Soft Drink Consumption Data Soft Drink Consumption Regression Source We run the regression by selecting range C1:C13 for the Y-range, the dependent variable consumption, and B1:B13 for the X-range, the independent variable GDP per capita. We check the label box, and see the output below. The slope of the regression line is the coefficient of the independent variable, GDP per capita. What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if necessary. a. 129 b. 130 Soft Drink Consumption Data Soft Drink Consumption Regression Source The intercept of the line is the coefficient labeled "Intercept." Deeper into Regression Equipped with the basic tools needed to find and interpret the regression line, you feel ready to tackle Leo's assignment. But Alice cautions you not to be hasty and urges you to consider some tricky questions: "How well does the regression line actually characterize the relationship in the data? Is a straight line even a good descriptor of the relationship?" Quantifying the Predictive Power of Regression How much does the relationship between advertising and sales help us understand and predict sales? We'd like to be able to quantify the predictive power of the relationship in determining sales levels. How much more do we know about sales thanks to the advertising data? To answer this question we need a benchmark telling us how much we know about the behavior of sales without the advertising data. Only then does it make sense to ask how much more information the advertising data give us. Without the advertising data, we have the sales data alone to work with. Using no information other than the sales data, the best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as our benchmark, and draw a "mean sales line" through the data. Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how accurately an individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure of how much more accurate the regression line is than
the mean line. To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much smaller the error is around the regression line than around the mean line. The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the same measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors for our benchmark — the mean sales line — the Total Sum of Squares. Here, the Total Sum of Squares is 8.01 trillion. The Sum of Squared Errors for the regression line is often called the Residual Sum of Squared Errors, or the Residual Sum of Squares. The Residual Sum of Squares is the variation left "unexplained" by the regression. Here, the Residual Sum of Squares is 1.13 trillion. The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88 trillion in this case, is called the Regression Sum of Squares. The Regression Sum of Squares measures the variation in sales "explained" by the regression line. Excel's regression output reports all three of these terms. A standardized measure of the regression line's explanatory power is called R-squared. Rsquared is the fraction of the total variation in the dependent variable that is explained by the regression line. R-squared will always be between 0 and 1 — at worst, the regression line explains none of the variation in sales; at best it explains all of it. We find R-squared by dividing the variation explained by the regression line — the Regression Sum of Squares — by the total variation in the dependent variable — the Total Sum of Squares. R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the advertising and sales example, the R-squared value is 6.88 trillion/8.01 trillion = 0.859 = 85.9%. An equivalent approach to computing R-squared is somewhat less intuitive but more common. In this approach we first find the fraction of the total variation in the dependent variable that is NOT explained by the regression line: we divide the Residual Sum of Squares by the Total Sum of Squares. Then we subtract the fraction of unexplained variation from 1 to obtain R-squared. Fortunately, we don't need to calculate R-squared ourselves — Excel computes R-squared and includes it in the standard regression output. In a regression that has only one independent variable, R-squared is closely related to the correlation coefficient between the independent and dependent variables: the correlation coefficient is simply the positive or negative square root of R-squared; positive if the slope of the regression line is positive and negative if the slope of the regression line is negative. Excel's regression output always computes the square root of R-squared, which it labels "Multiple R." Summary R-squared measures how well the behavior of the independent variable explains the behavior of the dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of the total variation in the dependent variable is explained by its linear relationship with the independent variable. Residual Analysis Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the line. Collectively, the vertical distances from the data to the line — the errors — measure how well the line fits the data. These errors are also known
as residuals. A careful study of the residuals can tell us a lot about a regression analysis and the validity of the assumptions we base it on. For example, when we run a regression, we assume that a straight line best describes the relationship between our two variables. In fact, sometimes the relationship may be better described by a curve. In this graph, we can clearly see a negative trend. If we run a regression on these data, we find a relatively high R-squared. How do we use the residuals to check our assumption that the relationship is linear? First, we measure the residuals: the distance from the data points to the regression line. Then we plot the residuals against the values of the independent variable. This graph — called a residual plot — helps us identify patterns in the residuals. We can recognize a pattern in the residual plot: a curve. This pattern strongly indicates that a straight line is not the best way to express the relationship between the variables: a curve would be a much better fit. A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the general trend in the data. Residual plots are critical for studying error patterns in more advanced regressions with multiple independent variables. If the only pattern in the dependent variable is accounted for by a linear relationship with the independent variable, then we should see no systematic pattern in the residual plot. The residuals should be spread randomly around the horizontal axis. In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance. Residuals are called homoskedastic if their distributions have the same variance. If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the dependent variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the assumption that the relationship is linear may be unwarranted. We've already seen the pattern of a curved relationship. What other patterns might we see? Let's look at this scatter diagram and its corresponding residual plot. The residuals appear to be getting larger for higher values of the independent variable. This phenomenon is known as heteroskedasticity. Residual analysis reveals that the distribution of the residuals changes with the independent variable: the variance increases as the independent variable increases. Since the variance of the residuals — which contributes to the variation of the dependent variable — is affected by the behavior of the independent variable, we can conclude that there must be more to the story than just the linear relationship. There are a number of other assumptions about regression whose validity can be tested by performing a residual analysis. Although interesting, these uses of residual analysis are beyond the scope of this course. Summary A complete regression analysis should include a careful inspection of the residuals. Plot the residuals against the independent variable to reveal patterns in the distribution of the residuals. Graphing Residual Lines and Residuals in Excel 2007 Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able
to perform residual analysis using the regression tool. However, we suggest you read through the instructions to learn how Excel's regression tool works, so you can perform residual analysis in the future, when you do have access to the Data Analysis Toolpak. To study patterns in residuals and other regression data, visual representations can be very helpful. To plot a regression line we first generate a scatter diagram of the data we are studying. Once we have the scatter diagram we add the regression line by right-clicking on any one of the data points. A menu will pop up. From that menu, we select "Add Trendline." In the Trendline Options Menu, we select "Linear" under Trend/Regression Type. If we wish to display the regression equation and R-squared value on the same scatter plot, we can select the check boxes next to those items at the bottom and press "Close." The scatter diagram will now have been augmented by the regression line, the regression equation, and the R-squared value. This is a quick way to perform a simple regression analysis from a scatter plot, though it doesn't provide all of the output we'll want to thoroughly review the results. Residual analysis is an option in Excel's Regression tool. To calculate the residuals and generate the residual plot, we select "Residual Plots" in the Residuals section of the Regression Menu and click "Ok." The residuals and their plot appear in the regression output. The Significance of Regression Coefficients R-squared and the residuals are not the only things to keep an eye on when running a regression. The regression line is the line that best fits our observed data. But are we sure that the regression line depicts the "true" linear relationship between the variables? In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the variables. Why? Because the data we use to find the regression line typically represent only a sample from the entire population of data pertaining to the relationship. Returning to our spreadable lunchmeat example, advertising and sales data for a different set of years would likely have given us a different line. Since each regression line comes from a limited set of data, it gives us only an approximation of the "true" linear relationship between the variables. As we do in sampling, we'll use Greek letters to denote parameters describing the population, and Latin letters to denote the estimates of those parameters based on sample data. We'll use alpha and beta to represent the coefficients of the "true" linear relationship in the population, and a and b to represent our estimates of alpha and beta. When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of alpha, the intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line. Like all estimates based on sample data, the calculated estimate for a coefficient is probably different from the true coefficient. Just as in sampling, to find a range of likely values for the true coefficient, we construct confidence intervals around each estimated coefficient. Excel's output gives us 95% confidence intervals for each coefficient by default. The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 95% confident that the slope of the true line is between 33.5 and 66.5. We can specify any other confidence level when setting up the regression. We might be interested in a 99% confidence interval for the slope of the EasyMeat regression line.
The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 99% confident that the slope of the true line is between 25.9 and 74.1. Testing for a Linear Relationship Since we don't know the exact value of the slope of the true advertising line, we might well question whether there actually IS a linear relationship between advertising and sales. How can we assure ourselves that the pattern we see in the sample data is not simply due to chance? If there truly were no linear relationship, then in the full population of relevant data, changes in advertising would not correspond systematically to proportional changes in sales. In that case, the slope of the best-fitting line for the true relationship would be zero. There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the confidence interval for the slope coefficient and see if it includes zero. The 95% confidence interval for the slope in our case, [33.5, 66.5], does not include zero, so we can reject with 95% confidence the hypothesis that the slope of the true line is zero. We can say that we are 95% confident that there really is a linear relationship between advertising and sales. Alternatively, we can test how likely it is that the slope really is zero using a hypothesis test. We formulate the null hypothesis that there is no linear relationship: the true slope, beta, of the regression line is zero. Then we calculate the p-value: the likelihood of having a slope at least as far from zero as the slope calculated, b, assuming that the slope is in fact zero. In our example, the p-value tells us the likelihood of having a slope as far from zero as 50 if the true slope were in fact zero. Fortunately, Excel saves us the trouble of calculating the p-value and reports it along with the regression output. This p-value tells us exactly what we want to know: if there really is no linear relationship between the variables, how likely is it that our sample would have given us a slope as large as 50? In our example, the p-value for the advertising coefficient is 0.0001, indicating that we can be confident at the 99.99% level that beta, the slope of the true line, is different from zero. Thus, we are confident at the 99.99% level that there really is a linear relationship between advertising and sales In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the slope beta is zero, and conclude with 95% confidence that there is a linear relationship between the two variables. Moreover, the smaller the p-value, the more confident we are that a linear relationship exists. If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95% confidence that there is a significant linear relationship between the variables. Additionally, p-values can be used to test hypotheses about the intercept of a regression line. In our example, the p-value for the intercept is 0.498, much larger than 0.05. However, since the intercept simply anchors the regression line, whether or not it is zero is not particularly important. We won't use the standard error of the coefficient in this course, but will simply point out that the standard error is similar to a standard deviation of our distribution for the coefficient. Thus, the 95% confidence interval is approximately 50 +/- 2*(7.2). It's not exact, because the distribution isn't quite normal, but the concept is very similar. We won't use the t-stat in this course either, since the p-value gives the equivalent information in a more readily usable form. The t-stat tells us how many standard errors the coefficient is from the value zero. Thus, if the t-stat is greater than 2, we are quite sure
(approximately 95% confident) that the true coefficient is not zero. Summary The slope and intercept of the regression line are estimates based on sample data: how closely they approximate the actual values is uncertain. Confidence intervals for the regression coefficients specify a range of likely values for the regression coefficients. Excel reports a p-value for each coefficient. If the p-value for a slope coefficient is less than 0.05 we can be 95% confident that the slope is nonzero, and hence that there is a linear relationship between the independent and dependent variables. Revisting R-squared and p It is important not to confuse the p-value for a coefficient with the R-squared for the regression. R-squared tells us what percentage of the variation seen in the dependent variable is explained by its relationship with the independent variable. The p-value tells us the likelihood that there is no real relationship between the dependent and independent variables - that the true coefficient of the line in the full population is zero. Let's look at an example that illustrates the difference in these two measures. This scatter diagram shows data tightly clustered around the regression line. R-squared is very high — very little variation in Y is left unexplained by the line. The p-value on the coefficient is very low — there is clearly a significant linear relationship between X and Y. The second scatter diagram has a much lower R-squared — a lot of variation in Y is left unexplained by the line. But there is no question that there is a significant relationship between X and Y, so the p-value on the coefficient is again very low. Here, the linear relationship may not explain as much variation as it does in the upper graph, but the relationship is clearly significant. A high p-value indicates a lack of confidence in the underlying linear relationship. We would not expect a questionable relationship to explain much variation. So if we have only one independent variable and it has a high p-value, we would not expect to find a very high Rsquared. However, as we'll see shortly, the story becomes more complex when we have two or more independent variables. So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the sample we use to conduct the regression analysis, the more precise the information we obtain about the true nature of the relationship under investigation. Specifically, the larger the sample, the better our estimates for the slope and the intercept, and the tighter the confidence intervals around those estimates. Let's look at how sample size affects p-values in an example. This scatter plot is based on a large sample size — 50 observations. With a p-value of zero we are extremely confident that there is a linear relationship between X and Y, and our confidence interval for the slope coefficient is quite tight. The second scatter plot is based on a sample size of only 10. The p-value rises to 0.07, so we cannot be 95% confident that there is a linear relationship between X and Y. Our lack of confidence in our estimate is also evident in the wide confidence interval for the slope coefficient. Summary The p-value and R2 provide different information. A linear relationship can be significant but not explain a large percentage of the variation, so having a low p-value does not ensure a high R2. Sample size is an important determinant of regression accuracy: as with all sampling, larger samples give more accurate estimates. Solving the Staffing Problem Aware of many of basic regression's subtleties, you are ready to turn to Leo's staffing problem.
The Kahana has 500 guest rooms. Over the last three years, average yearly occupancy has varied a good deal: my record low was 250 rooms occupied, and my high was 484. That range of variation makes staff planning difficult. I'd like to be able to predict the Kahana's occupancy one month in advance. So far, I've been making educated guesses about the level of occupancy one month in the future based on the number of advance bookings. I take the number of bookings and add another 50%, to be on the safe side. Obviously, I haven't done very well with that method. You need to find out more precisely what the relationship is between advance bookings and occupancy. Alice asks you to run a regression analysis on Leo's occupancy and advance bookings data. Bookings and Occupancy Data The results of your analysis tell you that, for every 100 additional advance bookings, a. Leo can expect about 269 additional guests. This is not the best answer. Only the slope coefficient should be considered when finding the incremental number of bookings. b. Leo can expect about 45 additional guests. This is not the best answer. You may be confusing the dependent and independent variables. The independent variable in this case is the number of advance bookings.
c. Leo can expect about 86 additional guests. This is the best answer. The regression output gives a slope of 0.86. So for each additional 100 advance bookings, Leo can expect about 86 additional guests.
d. Leo can expect about 183 additional guests. This is not the best answer. In the regression output, be sure you are looking at the slope of the regression line, not its intercept.
Bookings and Occupancy Data Bookings and Occupancy Regression When you run the regression analysis, be sure you choose the dependent and independent variables correctly. We are using advance bookings to predict occupancy, suggesting that we believe that occupancy levels "depend" on advance bookings. Thus, the occupancy is the dependent variable, and the number of advance bookings is the independent variable. In the regression output, find the coefficient of the slope of the regression line. This tells you how many additional guests Leo can expect for each additional advance booking. Based on these data, can you be 95% confident that the slope of the regression line is not 0? a. No, because the R-squared is lower than 0.95. This is not the best answer. R-squared does not directly measure the significance of the linear relationship - it tells us how much of the variation in occupancy is explained by advance bookings.
b. Yes, because the p-value of the slope coefficient is less than 0.05. This is the best answer. If the p-value of the slope coefficient is less than 0.05, you can be 95% confident that the slope of the regression line is not 0. In other words, there is a significant linear relationship between advance bookings and occupancy.
c. No, because the p-value of the slope coefficient is less than 0.05.
This is not the best answer. A p-value less than 0.05 tells us that we can be 95%
confident that the slope is not 0.
d. Yes, because the slope is larger than 0.5. This is not the best answer. The value of the slope by itself does not tell us anything about the likelihood of the slope not being 0.
Bookings and Occupancy Data Bookings and Occupancy Regression How much of the variation in occupancy is explained by the variation in the number of advance bookings? a. about 62% This is not the correct answer. 62% is the correlation between the two variables.
b. about 39% This is the best answer. R-squared — the statistic that measures the explanatory power of the independent variable — is easily located in the regression output. c. about 15% This is not the best answer. The regression output labeled R-squared by itself measures the percentage of variation explained by the independent variable - it does not need to be squared or manipulated in any way to give you the information you need. d. over 99.99% This is not the best answer. You may be confusing R-squared and the p-value. The pvalue tells us that we are more than 99.99% confident that the relationship is significant. It does not measure the percent of variation in occupancy explained by advance bookings.
Bookings and Occupancy Data Bookings and Occupancy Regression You and Alice present your findings to Leo. So there is a positive relationship between advance bookings and occupancy. But the power of advance bookings to predict occupancy is pretty small. More than half of the variation in your room occupancy is due to other factors. I suppose the mathematical relationship you found will help me make slightly more informed staffing choices. But mostly I'll still be stumbling in the dark. Not so fast Leo: We may be able to identify other factors linked to occupancy, and use them to come up with an improved forecasting model. Why don't we meet tomorrow and discuss what other factors might help us predict occupancy? Alright. In the meantime, I'll be doing my best to appease Mr. Pitt. He's talking about filing suit against me. He says he was burned. Burned by a bisque! Exercise 1: Inventories and Capacity Utilization You have been asked to examine the relationship between two important macroeconomic quantities: the change in business inventories and factory capacity utilization levels.
If you wish to learn what percent of the variation in changes in inventories is explained by capacity utilization levels, which variable should you choose as your independent variable? a. Change in business inventories This is not the best answer. Variation in the independent variable explains changes in the dependent variable.
b. Capacity utilization This is the correct answer. Variation in the independent variable explains changes in the dependent variable.
Click here to access US economic data from the years 1971-1986. Run the regression with the change in business inventories as the dependent variable and the capacity utilization as the independent variable. Source Using the regression output, find the slope of the regression line. Enter the slope as a decimal number with 2 digits to the right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary. a. 0.10 Inventories and Capacity Data Inventories and Capacity Regression Exercise 2: Clever Solutions Greta John is the human resources manager at the software consulting firm Clever Solutions. Recently, some of the programmers have been restless: the senior programmers feel that length of service and loyalty have not been rewarded in their compensation. The junior programmers think that seniority should not be a major basis for pay. Greta wants some hard data to inform the debate. As a preliminary step, she plots employees' salaries against their length of service. Using the data provided, perform a regression analysis with salary as the dependent variable and length of service as the independent variable. Clever Solutions Data What is the average increase in a Clever Solutions programmer's salary per year of service? a. About $97/year of service. This is not the correct answer. The regression reports the relationship between salary and months of service, whereas you have been asked to find the increased salary for an additional year of service.
b. About $1,165/year of service. This is the correct answer. The average salary increase per year of service is the slope of the regression line multiplied by the number of months in a year.
c. About $36,669/year of service. This is not the correct answer. Only the slope coefficient should be considered when finding the average salary increase per year of service.
d. About $37,737/year of service. This is not the correct answer. Only the slope coefficient should be considered when finding the average salary increase per year of service.
Clever Solutions Data Clever Solutions Regression Based on the regression analysis, Greta can tell that:
a. Approximately 65% of the variation in compensation can be explained by length of service. This is not the correct answer. R-squared measures how much of the variation in the dependent variable — compensation — is explained by the independent variable — length of service. You may be confusing correlation coefficient with R-squared. b. Approximately 42% of the variation in compensation can be explained by length of service. This is the correct answer. R-squared measures how much of the variation in the dependent variable — compensation — is explained by the independent variable — length of service. In this case, R-squared is about 42%. c. Approximately 0.05% of the variation in compensation can be explained by length of service. This is not the correct answer. The average salary increase per year of service is the slope of the regression line multiplied by the number of months in a year. d. Approximately 99.95 % of the variation in compensation can be explained by length of service. This is not the correct answer. R-squared measures how much of the variation in the dependent variable — compensation — is explained by the independent variable — length of service. You may be confusing the p-value with R-squared.
Clever Solutions Data Clever Solutions Regression Exercise 3: Productivity and Compensation Productivity measures a nation's average output per labor-hour. It is one of the most closely watched variables in economics: as workers produce more per hour, employers can pay them more without increasing the price of the product. Since wages can rise without provoking a corresponding rise in consumer prices, the growth of productivity is essential to a real increase in a nation's standard of living. Peter Agarwal, a student at the Harvard Business School, wants to investigate the relationship between change in productivity and change in real hourly compensation. Peter has data on change in productivity and change in compensation for 8 industrialized nations. The figures are annual averages over the period from 1979-1990. Productivity and Compensation Data Source Run a regression with change in compensation as the dependent variable and change in productivity as the independent variable. Productivity and Compensation Data Source How much of the variation in the change in compensation can be explained by the change in productivity? Enter the percentage as decimal number with 2 digits to the right of the decimal point (e.g, enter ''50%'' as ''0.50''). Round if necessary. a. 0.46
b. .46 Productivity and Compensation Data Productivity and Compensation Regression Given these data, Peter finds that the relationship can be mathematically expressed as: Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant? a. Yes This is not the correct answer.
b. No This is the correct answer.
c. The answer can't be determined from the regression analysis. This is not the correct answer. The regression output gives you the information you need to solve this problem.
Productivity and Compensation Data Productivity and Compensation Regression The coefficient for the slope given by Excel is an estimate based on the data in Peter's sample. The estimate for the slope of the regression line is about 0.75. If the actual slope of the relationship is 0, there is no significant linear relationship between the change in productivity and the change in compensation. On the regression output, there are two ways to tell if the slope coefficient is significant at the 0.05 level. First, we can look at the 95% confidence interval provided and see that it ranges from -0.05 to +1.56. Since the 95% confidence interval contains zero, the coefficient is not significant at the 0.05 level. Alternatively, we can note that the p-value of the slope coefficient, 0.0625, is greater than 0.05. Peter cannot be 95% confident that the actual slope is 0. Since Peter cannot be confident that the slope is not zero, he cannot be confident that there is a linear relationship between the two variables. Suppose Peter collects data on 8 more countries. Run the regression for the entire data set with change in compensation as the dependent variable and change in productivity as the independent variable. Expanded Productivity and Compensation Data The new data set indicates that the variables have a slightly different relationship: Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant? a. Yes This is the correct answer. The p-value for the slope in this regression is less than 0.05: it is highly unlikely that the actual slope is 0.
b. No This is not the correct answer. Note that the p-value for the slope in the regression output is less than 0.05.
Expanded Productivity and Compensation Data Expanded Productivity and Compensation Regression What might explain why the coefficient is significant in the second (combined) data set?
a. The total compensation of the countries in the combined data set is larger than the total compensation of the countries in the first set. This is not the best answer. The compensation in the countries no bearing on the significance of the data. The countries are being considered as individual data points in the data sets.
b. The total population of the countries in the combined data set is larger than the total population of countries in the first set. This is not the best answer. The size of the country populations has no bearing on the significance of the data. The countries are being considered as individual data points in the data sets.
c. The number of countries in the combined data set is larger than in the first set. This is the best answer. The countries are being considered as individual data points in the data sets. Thus the number of countries in each data set is that set's sample size. With a larger sample size, the confidence interval for the slope coefficient is narrower, and hence less likely to contain zero. Thus, the p-value is smaller, indicating a higher likelihood of the slope estimate being statistically significant.
Expanded Productivity and Compensation Data Expanded Productivity and Compensation Regression Multiple Regression Introduction After brainstorming for several hours, you and Alice devise a plan to improve your regression analysis of the staffing problem, to help Leo better predict the Kahana's occupancy. The Staffing Problem (II) Good morning. I hope you had a good night's sleep and have come up with some ideas on how to better predict my occupancy. For my part, I slept horribly last night. This bisque lawsuit is taking years off my life. I'm sorry to hear that. I'm afraid we don't have any legal advice to give you. We do have some ideas about how to improve your ability to forecast your hotel's occupancy. We can do better than explaining 39% of the variation in occupancy as we did with our earlier regression using advance bookings as the independent variable. Remember when we first arrived, we analyzed the relationship between Kauai's average hotel occupancy rates and arrivals on the island? We found a fairly strong correlation between occupancy and arrivals: 71%. Source I took a look at the relationship between arrivals on Kauai and your hotel's occupancy numbers. In the regression of Kahana occupancy versus arrivals, arrivals explain 80% of the variation in occupancy. That's much better than the 39% explained by advance bookings. Bookings and Occupancy Data Wait a minute. Those numbers add up to more than 100%. It seems like your regression technique explains more variation than there is! Good observation, Leo. The numbers don't add up. The reason is that there is a statistical relationship between arrivals and advance bookings that we aren't taking into account when we run the two regressions separately. We intend to find one equation that incorporates the data on both arrivals and advance bookings and takes into account the relationship between them. We'll also investigate the impact of other factors, such as the business practices of your competitors. Who is your main competitor in the area? That would be the Hotel Excelsior. Its manager, Knut Steinkalt, is a real cut-throat. He's
always offering special promotions that undercut my room prices. Fortunately, the Excelsior, though very luxurious, is not nearly as inviting as the Kahana. That place feels like an undertaker's parlor! I've been able to keep ahead of old Knut by offering a better product. We'll study the Excelsior's promotions, and see if they've had a significant influence on the Kahana's occupancy. Thanks. Let me know as soon as you have some results. I have to warn you, though, I may be out: I'm going see my lawyers in Honolulu this week to discuss Mr. Pitt's bisque lawsuit. What a mess! Introducing Multiple Regression "Most management problems are too complex to be completely described by the interactions between only two variables," Alice tells you. "Incorporating multiple independent variables can give managers a more accurate mathematical representation of their business." When buying a new home, we know that the house's size influences its selling price. All other factors being equal, we expect that the larger the home, the higher its price. We can gain a better sense of this relationship by graphing data on price and house size in a scatter diagram, and running a regression with price as the dependent variable and house size as the independent variable. We'll use data on 15 recent home purchases in the town of Silverhaven. Silverhaven Real Estate Data We can tell from the value of R-squared, 26%, that the relationship is fairly weak: house size variation explains only about a quarter of the variation in price. Silverhaven Real Estate Data Silverhaven Real Estate Regressions This should come as no surprise: house size is not the only variable affecting the price of a home. There are at least three other important factors, as any real estate agent will tell you: One facet of a desirable location is a low commuting time, so we'd expect average commuting time to be related to price. To study this relationship, we'll use a proxy variable: the house's distance from the business center in downtown Silverhaven. A proxy variable is a variable that is closely correlated to the variable we want to investigate, but typically has more readily available data. The data on distance for the same 15 houses reveals a negative relationship: the farther away from downtown, the less expensive houses tend to be. Again, the strength of the relationship is relatively weak: R-squared for the regression of price versus distance is 37%. Silverhaven Real Estate Data Silverhaven Real Estate Regressions We might be tempted to think that, if house size explains 26% of the price of a house and distance explains 37%, then the two variables together would explain 63% of the price. But what if there is a relationship between the two independent variables, house size and distance? In fact, the correlation coefficient of house size and distance is 31%. As we might expect, there is a positive relationship between the two variables: as we move farther from the city center, houses tend to be larger. How should we factor this relationship into our analysis of how house size and distance affect the price of a house? Rather than considering each individual relationship, we need to find a way to express the three-way relationship among all three variables: price, house size, and location. Instead of two separate equations describing price, each with a different independent variable, we need
one equation that includes both independent variables. We find this three-way relationship using multiple regression. In multiple regression, we adapt what we know about regression with one independent variable — often called simple regression — to situations in which we take into account the influence of several variables. Thinking about several variables simultaneously can be quite challenging. Given only the price of a home, we cannot make inferences about its size and location. A $500,000 home might be a mansion in the countryside, a modest house in the suburbs, or a cozy cardboard box on the corner of 1st and Main. Graphing data on more than two variables poses its own set of difficulties. Three variables can still be represented, but beyond that, visualization and graphical representation become essentially impossible. Although we can carry over many of the central ideas behind simple regression to the multivariate case, we'll have to consider several interesting complications. As managers, almost any quantity we wish to study will be influenced by more than one variable: to construct an accurate model of a business' dynamics, we'll usually need several variables. Multiple regression is an essential and powerful management tool for analyzing these situations. Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a good idea. But how do you adapt regression analysis to accommodate multiple variables? Summary Multiple regression is an extension of simple regression that allows us to analyze the relationships between multiple independent variables and a dependent variable. Relationships among independent variables complicate multivariate regression. With more than two independent variables, graphing multivariable relationships is impossible, so we must proceed with caution and conduct additional analyses to identify patterns. Incorporating more than one independent variable into your analysis of the Khana's occupancy sounds like a good idea. But how do you adapt regression analysis to accomodate multiple variables? Adapting Basic Concepts "Multiple regression equations look very similar to regression equations with only one independent variable," Alice explains. "But be careful - you have to interpret them slightly differently." Interpreting the Multiple Regression Equation In our home price example, we found two regression equations, one for the relationship between price and house size, and one for the relationship between price and distance. What will the equation for the three-way relationship between price, the dependent variable, and the two independent variables, house size and distance, look like? The regression equation in our housing example will have the form below: house size and distance each have their own coefficients, and they are summed together along with the constant coefficient a. In general, the linear equation for a regression model with k different variables has the form below. Since the coefficients we obtain from the data are just estimates, we must distinguish between the idealized equation that represents the "true" relationship and the regression line that estimates that relationship. To express that even the "true" equation does not fit perfectly, we include an error term in the idealized equation. Running the regression gives us coefficients for house size and distance: 252 and -55,006, respectively. We can use this multiple regression equation to predict the price of other houses
not in our data set. To predict a house's price, we need to know only its size and its distance to downtown. Silverhaven Real Estate Data Silverhaven Real Estate Regressions Suppose "Windsor" is a modest mansion of 3,500 square feet, located in the outer suburbs of Silverhaven, approximately 11 miles from downtown. Based on our regression equation, how much would we expect Windsor to sell for? a. Around $277,000 That is not the best answer. The correct answer is $699,938.
b. Around $330,000 That is not the best answer. The correct answer is $699,938.
c. Around $700,000 That is the best answer.
d. Around $771,000 That is not the best answer. The correct answer is $699,938.
Silverhaven Real Estate Data Silverhaven Real Estate Regressions We simply enter Windsor's square footage and distance to downtown into the equation, and calculate an expected selling price of $699,938. Let's take a closer look at the coefficients in the housing example, focusing on the distance coefficient: -55,006. This coefficient is substantially different from the coefficient in the original simple regression: -39,505. Why is it so different? The coefficients in the simple regression and the coefficients in the multiple regression have very different meanings. In the simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the following way: for every additional mile farther from downtown, we expect house price to decrease by an average of $39,505. We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of variation of all other factors that influence price. In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is different, because it has a different meaning. Here, the coefficient tells us that, for every additional mile, we should expect the price to decrease by $55,006, provided the size of the house stays the same. In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of distance to downtown. We refer to this decrease as the net effect of distance on price. Alternatively, we refer to it as "the effect of distance on price controlling for house size". Two houses are similar in size, but located in different neighborhoods: "Shangri La" is five miles farther from downtown than "Xanadu." If Xanadu is valued at $450,000, how much would we expect Shangri La to cost? a. Around $175,000 That is the correct answer.
b. Around $252,000 That is not the correct answer. The correct answer is $425,000 - $55,006/mile * 5 miles, which is $174,970.
c. Around $725,000 That is not the correct answer. The correct answer is $450,000 - 5*$55,006/mile which is about
$174,970.
d. The answer cannot be determined from the information provided. That is not the correct answer. All the information needed to make a prediction of Shangri La's selling price is present.
Silverhaven Real Estate Data Silverhaven Real Estate Regressions Since the two houses are the same size, we use the net effect of distance on price, -$55,006/mile, to predict the expected difference in their selling prices. Shangri La is 5 additional miles form downtown, so its price should be -$55,006/mile * 5 miles = $275,030 less than Xanadu's, or $450,000 - $275,030 = $174,970. "Valhalla" is another house located 5 miles farther from downtown than Xanadu. We have no information about the relative sizes of the two homes. If Xanadu's selling price is $450,000, what would we expect Valhalla's selling price to be? a. Around $175,000 That is not the correct answer. Since the houses are not known to be the same size, you should look at the gross effect of distance on price.
b. Around $252,500 That is the correct answer. Since the houses are not known to be the same size, you should look at the gross effect of distance on price.
c. Around $647,500 That is not the correct answer. Valhalla is farther from the central city than Xanadu, not closer.
d. The answer cannot be determined from the information provided. That is not the correct answer. All the information needed to make a prediction of Valhalla's selling price is present.
Silverhaven Real Estate Data Silverhaven Real Estate Regressions Since we cannot assume that the sizes of the two houses are equal, we should not control for size. Thus we use the gross effect of distance on price, -$39,505/mile, to predict the expected difference in the two homes' selling prices. Valhalla is 5 additional miles from downtown, so its price should be $39,505/mile * 5 miles = $197,525 less than Xanadu's, or $450,000 - $197,525 = $252,475. Let's try to build our intuition about the difference in the distance coefficients in the simple and multiple regressions. The coefficients are different because they have different meanings. But what exactly accounts for the drop from -39,505 to -55,006? In the multiple regression, by essentially considering only houses that are of equal size, we separate out the effect of house size on price. We are left with a distance coefficient that is net relative to house size. In the simple regression, the gross effect of distance, -$39,505/mile, represents an average over the range of house sizes. As such, it also captures some of the effect that house size has on price. Let's take a closer look at the distance and house size data. Calculating the correlation coefficient between sizes of homes and their distances from downtown Silverhaven, we see that there is a slight positive relationship, with a correlation coefficient of 31%. In other words, as we move farther from downtown, houses tend to be larger. We have seen that two things happen as we move farther from downtown — housing prices drop because the commute is longer, and house size increases. The fact that house size increases with distance complicates the pricing story, because larger houses tend to be more expensive.
Longer distances from downtown translate into two different effects on price. One effect of distance on price is negative: as distance increases, commute times increase and prices drop. A second effect of distance on price is positive: as distance increases, house size increases, and larger houses corresponds to higher prices. Running a multiple regression with both size and distance as independent variables helps tease out these two separate effects. When we control for house size, we see the net effect of distance on price: prices drop by $55,006 per additional mile. When we don't control for house size, the effect of distance alone on price is confounded by the fact that house size tends to rise as distance increases. The "real" effect of distance on price is diminished by the relationship between price and house size. When we look at the net relationship between distance and price, we consider only similarly sized houses. Now we assume that as distance from downtown grows, house size stays the same. If house size didn't increase as we moved farther out, prices would drop more sharply: by $55,006 rather than $39,505 per additional mile. Let's analyze the house size coefficient in a similar fashion. In the multi-variable regression model of home prices, the house size coefficient is net relative to distance. The coefficient of 252 tells us to expect prices of homes equally distant from downtown to increase by an average of $252 for each additional square foot of size. The gross effect of house size on price is $167 per square foot, considerably less than $252, the net effect. When we do not control for distance, house size "picks up" some of the negative effect of distance: the fact that larger houses are typically located farther from the city offsets some of the effect of increased house size. We should always be careful to interpret regression coefficients properly. A coefficient is "net" with respect to all variables included in the regression, but "gross" with respect to all omitted variables. An included variable may be picking up the effects on price of a variable that is not included in the model — school district, for example. Finally, we should note that these coefficients are based on sample data, and as such are only estimates of the coefficients of the true relationship. For each independent variable, we must inspect its p-value in the regression output to make sure that its relationship with the dependent variable is significant. Since the p-value is less than 0.05 for both house size and distance to downtown, we can be 95% confident that the true coefficients of the two independent variables are not zero. In other words, we are confident that there are linear relationships between each independent variable and house price. There are four steps we should always follow when interpreting the coefficients of an independent variable in a multiple regression: Summary We use multiple regression to understand the structure of relationships between multiple variables and a dependent variable, and to forecast values of the dependent variable. A coefficient for an independent variable in a regression equation characterizes the net relationship between the independent variable and the dependent variable: the effect of the independent variable on the dependent variable when we control for the other independent variables included in the regression. Performing in Excel Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform multiple regression analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool works, so you can perform multiple regression in the future, when you do have access to the Data Analysis Toolpak.
Performing multiple regression in Excel is nearly identical to performing simple regressions. In the "Input Y range" field, enter the column with the data on the dependent variable as you would for a simple regression. To enter data on the independent variables, the columns of all the independent variables must be contiguous and have the same number of rows. In the "Input X Range" field and enter the cell reference of the top cell of the independent variable appearing farthest to the left. Following a colon, enter the cell reference of the bottom cell of the independent variable appearing farthest to the right. Select "Labels" so that Excel can properly label the independent variables in the output. Select "Residual Plots" to include the residuals and residual plots of the residuals against each independent variable in the output. Enter the desired "Confidence Level," or accept the default level of 95%. The regression output prints the values of the coefficients for each variable in separate rows, along with the p-values for the coefficients and the desired confidence intervals. Residual Analysis When running a multiple regression you must distinguish between net and gross relationships. What about R-squared? How do you measure the predictive power of a regression model with multiple independent variables? One of the Silverhaven houses in our data set, "Windsor," sold for $570,000, substantially less than its predicted price, $699,938. Rumor has it that Windsor is haunted, and it was hard to find a buyer for it. The difference between the actual price and the predicted price is the residual error, in this case -$129,938. In simple regression, the residuals — the differences between the actual and predicted values of the dependent variables — are easy to visualize on a scatterplot of the data. The residuals are the vertical distances from the regression line to the data points. Graphically representing relationships among three variables is a bit difficult. When we have only one independent variable, we create a scatter plot of the data points, then draw a line through the data representing the relationship defined by the regression equation: y = a + bx. With two independent variables, an ordinary scatter plot will no longer do. Instead, we keep track of the third variable by picturing a three-dimensional space. Our data set is "scattered" in the space with house size measured on one axis, distance to downtown measured on another axis, and the dependent variable, price, measured on the vertical axis. The regression equation with two independent variables defines a plane that passes through the data. The residuals — the differences between the actual and predicted prices in the data set — are the vertical distances from the regression plane to the data points. As in simple regression, these vertical distances are known as residuals or errors. The regression plane is the plane that "best fits" the data in the sense that it is the plane for which the sum of the squared errors is minimized. We can plot the residuals against the values of each independent variable to look for patterns that could indicate that our linear regression model is inadequate in some way. Here is the residual plot analyzing the behavior of the residuals over the range of the independent variable house size... ...and here is the residual plot analyzing the behavior of the residuals over the range of the independent variable distance. When linear regression is a good model for the relationships studied, each of the residual plots
should reveal a random distribution of the residuals. The distribution should be normal, with mean zero and fixed variance. The residual plot against distance in the multiple regression looks quite different from the residual plot against distance in the simple regression. They represent different concepts: the first gives insight into the net relationship between distance and price controlling for house size, and the second gives insight into the gross relationship. With more than two independent variables, visualizing the residuals in the context of the full regression relationship becomes essentially impossible. We can only think of residuals in terms of their meaning: the differences between the actual and predicted values of the dependent variable. Because residual plots always involve only two variables — the magnitude of the residual and one of the independent variables — they provide an indispensable visual tool for detecting patterns such as heteroskedasticity and non-linearity in regressions with multiple independent variables. Summary The residuals, or errors, are the differences between the actual values of the dependent variable and the predicted values of the dependent variable. For a regression with two independent variables, the residuals are the vertical distances from the regression plane to the data points. We can graph a residual plot for each independent variable to help identify patterns such as heteroskedasticity or non-linearity. Quantifying the Predictive Power of Multiple Regression In simple regressions, R-squared measures the predictive or explanatory power of the independent variable: the percentage of the variation in the dependent variable explained by the independent variable. When we run the regression of house price versus house size, we find an R-squared of 26%. The R-squared for price versus distance from downtown is 37%. The multiple regression of price versus the two independent variables, house size and distance, also returns an R-squared value: 90%. Here, R-squared is the percentage of price variation explained by the variation in both of the independent variables. The multiple regression's R-squared is much higher than the R-squared of either simple regression: But how can we be sure we really gained predictive power by considering more than one variable? In fact, R-squared cannot decrease by adding another independent variable to a model — it can only stay the same or increase, even if the new independent variable is completely unrelated to the dependent variable. To understand why R-squared always improves when we add another variable, let's return to the simple regression of house price versus distance from downtown. When we look at only two observations — two houses — we can fit a line through them that fits perfectly. R-squared for these two data points with this line is 100%. But clearly we can't use this line to explain the true relationship between house price and distance. The high R-squared is a result of the fact that the number of observations, 2, is so small relative to the number of independent variables, 1. If we have 3 observations — three houses — we can't find a line that fits perfectly. Here, Rsquared is 35%. But when we add another independent variable — no matter how irrelevant — we can find a plane that fits the data points perfectly, increasing R-squared up to 100%. The "perfect" Rsquared is again due to the fact that the number of observations, 3, is only one more than the number of independent variables, 2. Improving R-squared by adding irrelevant variables is "cheating." We can always increase
R-squared to 100% by adding independent variables until we have one fewer than the number of observations. We are "over-fitting" the data when we obtain a regression equation in this way: the equation fits our particular data set exactly, but almost surely does not explain the true relationship between the independent and dependent variables. To balance out the effect of the difference between the number of observations and the number of independent variables, we modify R-squared by an adjustment factor. This transformation looks quite complicated, but notice that it is largely determined by n-k, the difference between the number of observations n and the number of independent variables k. This adjustment reduces R-squared slightly for each variable we add: unless the new variable explains enough additional variance to increase R-squared by more than the adjustment factor reduces it, we should not add the new variable to the model. Excel reports both the "raw" R-squared and the Adjusted R-squared. It is critical to use adjusted R-squared when comparing the predictive power of regressions with different numbers of independent variables. For example, since the adjusted R-squared of the multiple regression of house price versus house size and distance is greater than the adjusted R-squared of either simple regression, we can conclude that we gained real predictive power by considering both independent variables simultaneously. A final caveat: we should never compare R-squared or adjusted R-squared values for regressions with different dependent variables. An R-squared of 50% might be considered low when we are trying to explain product sales, since we expect to be able to identify and quantify many key drivers of sales. An R-squared of 50% would be considered high if we were trying to explain human personality traits, since they are influenced by so many factors and random events. Summary R-squared measures how well the behavior of the independent variables explains the behavior of the dependent variable. It is the percentage of variation in the dependent variable explained by its relationship with the independent variables. Because R-squared never decreases when independent variables are added to a regression, we multiply it by an adjustment factor. This adjustment balances out the apparent advantage gained just by increasing the number of independent variables. Solving the Staffing Problem (II) Alice urges you to use your newfound knowledge to analyze the relationship between the Kahana's occupancy and advanced bookings and Kauai's arrivals. You and Alice want to find the three-way relationship between the dependent variable, the Kahana's occupancy, and the two independent variables, arrivals on Kauai and the Kahana's advance bookings. The first thing Alice asks you to do is to measure the strength of the relationship between the two independent variables. Enter the correlation coefficient as a decimal number with three digits to the right of the decimal point (e.g., enter "5" as "5.000"). Round if necessary. Kahana Occupancy Data The correlation between arrivals and bookings isn't especially strong, 44%. Kahana Occupancy Data You run the multiple regression of occupancy versus Kauai arrivals and advance bookings. Kahana Occupancy Data Which of the following indicates that the multiple regression with two independent variables is an improvement over both simple regressions?
a. The adjusted R-squared of 86.4% for the multiple regression is higher than either of the adjusted R-squareds. This is not the best answer. R-squared (without the adjustment) is 86.4%. The adjusted R-squared is 85.6%.
b. The adjusted R-squared of 85.6% for the multiple regression is higher than either of the adjusted R-squareds. This is the best answer. Compare the adjusted R-squared of the multiple regression to the adjusted Rsquareds of the simple regressions.
c. The R-squared of 86.4% for the multiple regression is higher than either of the adjusted R-squareds. This is not the best answer. Compare the adjusted R-squared of the multiple regression to the adjusted R-squareds of the simple regressions.
d. The adjusted R-squared of 85.6% for the multiple regression is lower than the sum of the adjusted R-squareds. This is not the best answer. The new regression has better explanatory power if the adjusted R-squared of the multiple regression is higher than the adjusted R-squareds of the simple regressions.
Kahana Occupancy Data Kahana Occupancy Regressions Do the data indicate that the regression coefficients of the independent variables are significant at the 0.05 level? a. No. Neither of the coefficients are significant. This is not the correct answer. The p-values for both coefficients are less than 0.05.
b. No. The bookings coefficient is significant but the arrivals coefficient is not. This is not the correct answer. The p-values for both coefficients are less than 0.05.
c. No. The arrivals coefficient is significant but the bookings coefficient is not. This is not the correct answer. The p-values for both coefficients are less than 0.05.
d. Yes. Both coefficients are significant. This is the correct answer. The p-values for both coefficients are less than 0.05.
Kahana Occupancy Data Kahana Occupancy Regressions Suppose Leo has 200 advance bookings for the month of January and 250 advance bookings for February. What is the best estimate of how many more guests Leo can expect in February compared to January? a. 20 This is not the best answer. This would be the best answer if you knew that the number of arrivals on the island were the same in both months.
b. 43 This is the best answer. Without any information on the number of arrivals in those two months, using the coefficient from the simple regression of occupancy versus bookings will deliver the best estimate of the increase in occupants from January to February.
c. 50 This is not the best answer. Without any information on the number of arrivals in those two months, using the coefficient from the simple regression of occupancy versus bookings will deliver the best estimate of the increase in occupants from January to February.
d. There is not enough information given to find any such estimate.
d. There is not enough information given to find any such estimate. This is not the best answer. Using the coefficient from the simple regression of occupancy versus bookings will deliver an estimate of the increase in occupants from January to February.
Kahana Occupancy Data Kahana Occupancy Regressions You and Alice try to contact Leo, but he appears to be unavailable. Leo must be with his lawyers in Honolulu. I'm sure we'll be able to get a hold of him later. In the meantime, I can complete my research into the Excelsior's promotions. And there are some complexities of multiple regression that you should become familiar with... Exercise 1: Empire Learning Empire Learning is a developer of educational software. CEO Bill Hartborne is making a bid for a contract to create an e-learning module for a new client. Preparing the bid requires an estimate of the number of labor-hours it will take to create the new module. Bill believes that the length of a module and the complexity of its animations directly affect the amount of labor required to complete it. Bill has data on the labor-hours Empire used to complete previous courses. He also knows the number of pages and the animation run-time of each previous course — quantities he thinks are reasonable proxies for course length and animation complexity, respectively. Perform a simple regression analysis for each of the independent variables: number of pages and run-time of animations. Empire Learning Data Which factor explains more variation in labor hours? a. Number of pages This is the correct answer. The R-squared for the simple regression of labor-hours versus number of pages is 83%.
b. Run-time of animations This is not the correct answer. The R-squared for the simple regression of labor-hours versus runtime of animations is 69%, which is lower than the R-squared using number of pages as the independent variable.
Empire Learning Data Empire Learning Regressions In the simple regressions, which of the independent variables contributes significantly to the number of labor-hours it takes Empire to create an e-learning course? a. Number of pages only This is not the correct answer. The p-value for the coefficient on the number of pages is 0.0002, well below 0.05, the most commonly used level of significance.
b. Run-time of animations only This is not the correct answer. The p-value for the coefficient on the run-time of animations is 0.003, well below 0.05, the most commonly used level of significance.
c. Both variables This is the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.
d. Neither variable This is not the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.
Empire Learning Data Empire Learning Regressions The p-values for the coefficients on animation run-time and number of pages are 0.003 and 0.0002 respectively; well below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute significantly in their respective simple regressions to the number of labor hours Empire takes to create an e-learning course. Run the multiple regression of labor-hours versus number of pages and run-time of animations. Empire Learning Data According to this multiple regression, which of the independent variables contributes significantly to the number of labor-hours it takes Empire to create an e-learning course? a. Number of pages only This is not the correct answer. The p-value for the coefficient on the number of pages is 0.001, well below 0.05, the most commonly used level of significance.
b. Run-time of animations only This is not the correct answer. The p-value for the coefficient on the run-time of animations is 0.015, well below 0.05, the most commonly used level of significance.
c. Both variables This is the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.
d. Neither variable This is not the correct answer. The p-values for both independent variables are less than 0.05, the most commonly used level of significance.
Empire Learning Data Empire Learning Regressions The p-values for the coefficients on animation run-time and number of pages are 0.014 and 0.0015 respectively; well below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute significantly in the multiple regression to the number of labor hours Empire takes to create an e-learning course. Exercise 2: The Empire Strikes Back For this exercise, refer to the regression analyses performed in Exercise 1 of this section. Empire Learning Data Empire Learning Regressions Bill Hartborne, CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and the total run-time of animations as independent variables. Empire Learning Data Empire Learning Regressions In the multiple regression of labor-hours versus number of pages and run-time of animations, the coefficient of 0.84 for the number of pages tells us that: a. For every additional 100 pages of module length, the run-time of animations increases by an average of 84 seconds. This is not the best answer. The coefficient on an independent variable describes the mathematical relationship between the independent variable and the dependent variable (in this case, labor-hours),
not the relationship between two independent variables.
b. For every additional 100 pages of module length, the run-time of animations increases by 84 seconds when we control for labor-hours. This is not the best answer. The coefficient on an independent variable describes the mathematical relationship between the independent variable and the dependent variable (in this case, labor-hours), not the relationship between two independent variables.
c. For every additional 100 pages of module length, the number of labor-hours increases by an average of 84. This is not the best answer. The coefficient of 0.84 describes the net relationship between number of pages and labor-hours when controlling for animation run-time.
d. For every additional 100 pages of module length, the number of labor-hours increases by 84 when we control for animation run-time. This is the best answer. The coefficient of 0.84 describes the net relationship between number of pages and labor-hours when controlling for animation-run-time.
Empire Learning Data Empire Learning Regressions In the multiple regression equation, the coefficient of the independent variable "number of pages" is gross relative to: a. The number of labor-hours. This is not the correct answer. An independent variable is not gross or net relative to the dependent variable.
b. The run-time of animations. This is not the correct answer. The run-time of animation is part of this multiple regression, so the number of pages is net relative to the run-time of animations.
c. The number of illustrations used in the module. This is the correct answer. The number of illustrations is omitted in the regression analysis, so the number of pages is gross relative to this variable.
d. Nothing. The number of pages is an all around pleasant and sanitary variable. This is not the correct answer. Perhaps you should review the clip on interpreting the multiple regression equation to find the meaning of "gross" in this context.
Empire Learning Data Empire Learning Regressions Challenge: Children of the Empire For this exercise, refer to the regression analyses performed in Exercise 1 of this section. Empire Learning Data Empire Learning Regressions Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and the total run-time of animations as independent variables. Empire Learning Data Empire Learning Regressions Bill bills out his talent at $70/hour. Based on the multiple regression, how much should he charge for the labor content of a course with 400 pages and 170 seconds of animations? Enter the estimated cost of the labor (in $) as an integer (e.g., enter "$5.00" as "5"). Round if necessary. a. 76500
b. 77500 c. 77210 Empire Learning Data Empire Learning Regressions First use the regression equation to predict the number of labor-hours required to complete the course. Empire Learning Data Empire Learning Regressions Then multiply that number by Empire Learning's billing rate of $70/hour to find the total amount he should charge for the labor content of the course, $77,210. Empire Learning Data Empire Learning Regressions Bill is sure that the client will balk at a labor bill of over $70,000. He knows that animation is important to the client, so doesn't want to cut corners there. However, he believes that his lead writer can cover the content in fewer pages without compromising his renowned clear and engaging prose. Empire Learning Data Empire Learning Regressions To reduce total labor costs to $70,000, how many pages must Bill cut from the plan to meet his client's cost limits? a. Around 87 pages. This is not the correct answer. The coefficient for number of pages is 0.84.
b. Around 103 pages. This is not the correct answer. 103 is the number of labor-hours bill needs to cut, not the number of pages.
c. Around 123 pages. This is the correct answer.
d. The aren't enough pages for Bill to cut to reduce the price below $70,000. This is not the correct answer. There are enough pages to cut to reduce the price below $70,000.
Empire Learning Data Empire Learning Regressions To reduce the labor bill from $77,210 to $70,000, Bill must reduce labor costs by $7,210. To achieve this reduction, Bill must cut the contract's labor hours by 103 hours, since he bills out his talent at $70/hour. Empire Learning Data Empire Learning Regressions Since the animation run time will not change, we use the net relationship between laborhours and number of pages, which tells us that each additional page consumes 0.84 labor hours. Thus, Bill must reduce the number of pages by 123.. Empire Learning Data Empire Learning Regressions New Concepts in Multiple Regression
"I expect Leo will call us this evening after his meeting with the lawyers," Alice predicts. "I hope things are going well. If Mr. Pitt's lawsuit materializes, Leo might not have much of a business left to help him with." The Staffing Problem (III) I just spent the whole day at my lawyers' offices. Please give me some good news about the occupancy problem. Well, we've found a regression model that incorporates arrivals on Kauai and advance bookings. We're now able to explain about 86% of the variation in occupancy. Kahana Occupancy Regressions That's great. That's so much better than the 39% you calculated using only advance bookings as the independent variable. I should be able to make much more reliable predictions based on your new model! Unfortunately, no. Although this new model helps us understand why your occupancy varies, we can't exactly use the model to make predictions. You can use advance bookings to make a prediction about occupancy in a given month because the bookings are known to you ahead of time. But you won't get the data on the number of arrivals in a month until it's too late and your guests are already on your doorstep. That's terrible! Sure, it's nice to know how today's occupancy is affected by today's arrivals. But I need to make business decisions! I need to know one month in advance how many staffers to hire! Please, isn't there something you can do? Don't give up yet, Leo. We still have a number of statistical approaches at our disposal. We'll have something for you when you get back from Honolulu. Multicollinearity "We need some more advanced statistical tools to find a regression suitable to Leo's purposes," Alice tells you. "And there is still a pitfall you'll need to learn to avoid when using multiple regression." Another key factor that influences the price of a house is the size of the property it is built on its "lot size". Naturally, we'd pay more for a spacious acre of land than for 800 cramped square feet. Using new data on the lot sizes of the 15 sample houses in Silverhaven, run the simple regression of house price on lot size. If we don't control for any other factors, how much of the variation in price can be explained by variation in lot size? Enter R-squared as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary. a. 0.30 b. 0.301 c. 0.302 d. 0.303 e. 0.304 f. 0.305 Silverhaven Real Estate Data Silverhaven Real Estate Regressions
Lot size accounts for 30 percent of a home's price. Do the data provide evidence that there is a significant linear relationship between house price and lot size? a. Yes. This is the correct answer. The p-value for the lot size coefficient is 0.033, which is less than the most commonly used significance level 0.05.
b. No. This is not the correct answer. The p-value for the lot size coefficient is 0.033, which is less than the most commonly used significance level 0.05.
c. This question cannot be answered without running a multiple regression. This is not the correct answer. The significance of the independent variable can be found in a simple regression, too.
Silverhaven Real Estate Data Silverhaven Real Estate Regressions The low p-value of 0.033 tells us that we can be confident that the gross relationship between lot size and home price is significant. What happens when we add lot size as a third independent variable in our multiple regression of price on three independent variables: house size, distance from downtown Silverhaven, and lot size? Run a multiple regression of price on the three independent variables: house size, distance, and lot size. How does the addition of the new independent variable, "lot size" affect the predictive power of the regression model? a. The explanatory of the regression increases when lot size is added as an independent variable. This is the correct answer.
b. The explanatory of the regression decreases when lot size is added as an independent variable. This is not the correct answer.
c. The explanatory of the regression stays the same when lot size is added as an independent variable. This is not the correct answer.
Silverhaven Real Estate Data Silverhaven Real Estate Regressions By adding the independent variable "lot size", we improve adjusted R-squared slightly: from 89% to 91%, telling us that the predictive power of the regression has improved. What about the significance of the independent variables? Has adding the new variable changed the pvalues of the coefficients? Silverhaven Real Estate Regressions Something odd has happened. In our earlier regression with two independent variables, the pvalues for both the house size and the distance coefficients were less than 0.05. Now, adding lot size into the equation has somehow raised the p-value for house size. The new p-value for house size, 0.2179, is so high that there is no longer evidence of a significant linear relationship between price and house size after taking lot size into account. How do we explain this drop in significance? When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship between two or more of the independent variables. Let's look at the data on house size and lot size. The high correlation coefficient between house size and lot size — 94% — is the culprit in the
Case of the Dropping Significance. When two of the independent variables are highly correlated, one is essentially a proxy for the other. This phenomenon is called multicollinearity. In our example, lot size is a good proxy for house size. Both house size and lot size contribute to the price of a home. But because these two variables are closely correlated in our data set, there is not enough information in the data to discern how their combined contributions should be attributed to the two independent variables. The net effect of house size on price should tell us the effect of house size on price assuming that the lot size is fixed. However, we can't detect this effect in the data: house size and lot size are so closely related that we've never seen house size vary much when lot size is fixed. Would dropping the variable house size improve the predictive power of the regression model? The multiple regressions with and without house size have different numbers of independent variables, so we use adjusted R-squared to compare their predictive power. Without house size, adjusted R-squared is 90.89%, slightly lower than 91.40%, the adjusted R-squared for the regression including house size. Thus, although the regression model cannot accurately estimate the effect of house size when we control for lot size and distance, the addition of house size does help explain a bit more of the variance in selling price. Diagnosing and Treating Multicollinearity A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value accompanied by low significance for one or more of the independent variables. One way to diagnose multicollinearity is to check if the p-value on and independent variable rises when a new independent variable is added, suggesting strong correlation between those independent variables. How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If we're using it to make predictions, multicollinearity is not a problem, assuming as always that the historically observed relationships among the variables continue to hold going forward. In the house price example, we'd keep the house size variable in the model, because its presence improves adjusted R-squared and because our judgment would suggest that house size should have an impact on price separate from the effect of lot size. If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious problem that must be addressed. One way to reduce multicollinearity is to increase the sample size. The more observations we have, the easier it will be to discern the net effects of the individual independent variables. We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables. Identifying which one to remove requires a careful analysis of the relationships between the independent variables and the dependent variable. This is where a manager's deep understanding of the dynamics of the situation becomes invaluable. In our home price example, we'd expect both house size and lot size to have significant and discernable effects on the price of a home: a shack on an acre of land should cost less than a mansion on a similar property. To better understand the net effect of house size we should probably gather a larger sample to reduce multicollinearity. If we didn't expect lot size and house size to have distinct effects on price, we might remove house size from the equation. Summary Multicollinearity occurs when some of the independent variables are strongly interrelated: distinguishing the respective effects of some of the independent variables on the dependent variable is not possible using the available data. Multicollinearity is typically not a problem when we use regression for forecasting. When using regression to understand the net
relationships between independent variables and the dependent variable, multicollinearity should be reduced or eliminated. Lagged Variables In the Silverhaven real estate example we looked at a number of houses and compared four characteristics: price, house size, lot size, and distance from the city center. These are cross-sectional data: we looked at a cross-section of the Silverhaven real estate market at a specific point in time. A time series is a set of data collected over a range of time: each data point pertains to a specific time period. Our EasyMeat data is an example of a time series — we have data on sales and advertising levels for ten consecutive years. Sometimes, the value of the dependent variable in a given period is affected by the value of an independent variable in an earlier period. We incorporate the delayed effect of an independent variable on s dependent variable using a lagged variable. The effects of variables such as advertising often carry over into the later time period. For example, last year's EasyMeat advertising levels may continue to affect this year's sales. To study this carry-over effect, we add the lagged variable "previous year's advertising." Our regression now has two independent variables: the current year's advertising level and last year's advertising level. To run a regression on the lagged EasyMeat advertising variable we first need to prepare the data for the lagged variable: we copy the column of advertising data over to a new column, shifted down by one row. The first and last data points draw attention: the first has all the necessary data except a lagged advertising value, and the last has a lagged value but no other information. Since we need observations with data on all variables, we are forced to discard the first observation as well as the extraneous piece of information in the lagged variable column. In effect, by introducing a lagged variable, we lose a data point, because we have no value for "previous year's advertising" for the first observation. We run the regression as we would any ordinary multiple regression. The equation we obtain is: EasyMeat Sales Data EasyMeat Sales Regressions Comparing the output from the regressions with and without the lagged variable, we first notice that the number of observations has decreased from 10 to 9. Does the addition of the lagged advertising variable improve the regression? The lagged variable decreases adjusted R-squared substantially from 84.11% to 76.62%. Moreover, the lagged variable's high p-value — 0.33 — tells us that its effect is too insignificant for us to distinguish from the current year's advertising's effect. We conclude that the addition of the lagged variable has not improved the regression. We can introduce variables with even longer lags. We might try to use advertising data from two or more years in the past to help explain the present sales level. However, for each additional period of lag we lose another data point. Running the EasyMeat regression with the current year's advertising, last year's advertising, and the advertising level from two years ago leaves us only eight observations. We lose predictive power: adjusted R-squared drops from 76.62% to 62.47%. Moreover, none of the coefficients is significant at the 0.05 level. Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which reduces the precision of our estimates of the regression coefficients. At the same time, because we are adding another variable, we decrease adjusted R-squared.
Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an observation and the "penalty" imposed by the adjustment to R-squared. Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are usually available ahead of time. Lagged variables are often good "leading indicators" that help us predict future values of a dependent variable. Summary We can use lagged variables when data consist of a time series, and we believe that the value of the dependent variable at one point in time is related to the value of an independent variable at a previous point in time. Lagged variables are especially useful for prediction since they are available ahead of time. However, they come at a cost: we lose one observation for each time series interval of delayed effect we incorporate into the lagged variable. Dummy Variables So far, we've been constructing regression models using only quantitative variables such as advertising expenditures, house size, or distance. These variables by their nature take on numerical values. But many variables we study are qualitative or categorical: they do not naturally take on numerical values, but can be classified into categories. For example, in our house price example, a categorical variable might be the home's primary construction material: wood, brick, or straw. Assigning numerical values to such categories doesn't make sense — a house made of wood cannot be described as a brick house plus some number. How do we incorporate categorical variables into a regression analysis? Let's look at an example from Julius Tabin's EasyMeat business. We have found that EasyMeat sales are determined to a large extent by how much he spends on advertising. Another factor affecting sales is which flavor of EasyMeat Julius features in his advertising campaigns. Julius produces both EasyMeat Classic, made from beef for the most part, and EasyMeat Poulk!, a pork and poultry blend. Hoping to match the right spreadble meat flavor to the mood of the nation, Julius features Poulk! in his ad campaigns in some years. In other years, classic takes the lead role. To study the effects of Julius' flavor-of-the-year choice, we use a type of categorical variable called a dummy variable. A dummy variable takes on one of two values: 0 or 1, to indicate which of two categories a data point falls into. This dummy variable — we'll call it "Poulk!"flavor — is set to 1 for years when EasyMeat ads feature Poulk!, and 0 for years when they feature Classic. Run the regression on the data, with the sales as the dependent variable, and advertising and the Poulk! flavor dummy variable as the independent variables. What is the coefficient for the Poulk! flavor variable? Enter the coefficient for the Poulk! flavor variable as an integer (e.g., "5"). Round if necessary. a. 533024 b. 533,024 EasyMeat Sales Data EasyMeat Sales Regressions We find a coefficient of 533,024 for Poulk! flavor. What does this coefficient tell us? The coefficient 533,024 tells us that after controlling for the level of advertising
expenditures, average sales are $533,024 higher when Poulk! is the flavor-of-the-year, rather than EasyMeat Classic. This regression model can be expressed graphically: essentially we have two parallel regression lines: one for years in which Classic is promoted, and one for "Poulk! Years." The vertical distance between the lines is the average increase in sales Julius would expect if he chooses to feature Poulk! in a given year, after controlling for advertising level. The slope of each line is the same: it tells us the average increase in sales when Julius spends another dollar in advertising after controlling for which flavor is featured. In this example, we arbitrarily chose to set the dummy variable equal to zero for the Classic flavor, i.e. to make Classic the "base case" flavor. If we made Poulk! the base case flavor, we would obtain exactly the same graph. The coefficient on flavor now would be -533,024, but it would again indicate that Julius should expect average sales to be $533,024 lower in the years that his ads feature Classic rather than Poulk!. As for any independent variable in a regression, we can test a dummy variable's significance by calculating a p-value for its coefficient. When the p-value is less than 0.05, we can be 95% confident that the coefficient is not zero and that the dummy variable is significantly related to the dependent variable. If Julius produced and promoted more than two flavors, we'd need another dummy variable for each additional flavor. In general, we'd need one fewer dummy variable than the number of flavors: one variable for each of the flavors except for Classic, the base case. For each year that a given flavor is featured, its dummy variable is set to one, and all other flavor variables are set to zero. In a "Classic Year," all dummy variables are set to zero Why don't we have exactly as many dummy variables as categories? Suppose we have just two categories — Poulk! and Classic — with separate dummy variables for each — P and C. In a Poulk! Year, P = 1 and C = 0. In a Classic Year, P = 0 and C = 1. The second dummy variable would be completely redundant: since there are only 2 flavors, if P = 0 we know the flavor is not Poulk!, so it must be Classic. Technically, including both dummy variables would be problematic because they would be perfectly correlated: whenever P = 1, C = 0, and whenever P = 0, C = 1. Perfectly correlated variables are perfectly collinear: we gain nothing by including both and must drop one to avoid crippling multicollinearity. It is useful to note that running a simple regression analysis with only a dummy variable is equivalent to running a hypothesis test for the difference between the means of two populations. In this case, one poulation would be sales in Poulk! years, and the other would be sales in Classic years. From the data we find the mean and standard deviation of sales in Poulk! and Classic years. The hypothesis test gives us a p-value of 0.43, so we cannot conclude that the means differ when the feature flavor is different. A simple regression of sales on the dummy variable flavor gives us the same result. The pvalue on the dummy variable is 0.45, telling us that the flavor dummy variable is not statistically significant. The p-values differ slightly for the two approaches due to differences in the way Excel calculates the terms, but the tests are conceptually equivalent. Summary We can use dummy variables to incorporate qualitative variables into a regression analysis. Dummy variables have the values 0 and 1: the value is 1 when an observation falls into a category of the qualitative variable and 0 when it doesn't. For qualitative variables with more than two categories, we need multiple dummy variables: one fewer than the number of categories. Solving the Staffing Problem (III) With one eye on the lookout for signs of lurking multicollinearity and two new types of
variables in your toolbox, you set out to settle the staffing problem for good. You might have noticed that the number of arrivals follows an annual seasonal pattern. Arrivals tend to drop off during the late summer, surge again in October, and drop very low for the rest of the year. Source They pick up briefly in February, but the tourist business slows through the spring. During the early summer, vacationers start arriving in droves, with arrivals peaking in June or July. Source Perhaps we can use the seasonality of arrivals in some way. We might come up with a lagged variable that functions as a proxy for the current month's arrivals. Then we can run a regression with the lagged variable. Since the values of a lagged variable would be based on historical data, Leo would know them ahead of time. Given that arrivals follow this seasonal pattern, which of the following variables is likely to be a good proxy for this month's arrivals on Kauai? a. Arrivals lagged by one month This is not the best answer. Since there is an annual seasonal pattern in the arrivals data, the number of arrivals in a given month will be similar to the number of arrivals in the same month of previous years.
b. Arrivals lagged by two months This is not the best answer. Since there is an annual seasonal pattern in the arrivals data, the number of arrivals in a given month will be similar to the number of arrivals in the same month of previous years.
c. Arrivals lagged by six months This is not the best answer. Since there is an annual seasonal pattern in the arrivals data, the number of arrivals in a given month will be similar to the number of arrivals in the same month of previous years.
d. Arrivals lagged by twelve months This is the best answer. Due to the annual seasonal pattern in arrivals, it makes the most sense to predict the arrivals in a given month using data on arrivals in the same month of previous years.
Source You run the simple regression of Kahana occupancy versus 12-month lagged arrivals. Kahana Occupancy Data Source What is the lowest level at which the "lagged arrivals" variable is significant? a. At the 0.01 level. This is the best answer. The p-value for the lagged arrivals coefficient is far below the significance level 0.01.
b. At the 0.05 level only. This is not the best answer. The p-value for the lagged arrivals coefficient is far below the significance level 0.01.
c. At the 0.10 level only. This is not the best answer. The p-value for the lagged arrivals coefficient is far below the significance level 0.01.
d. The number of arrivals suffering from jet-lag is completely insignificant. This is not the best answer. Perhaps you should review the clip on lagged variables.
Kahana Occupancy Data Kahana Occupancy Regressions Source
The regression output shows a p-value of 0.000 for the lagged arrivals coefficient, far below the significance level of 0.01. At 55%, the R-squared for the simple regression on lagged arrivals is much lower than for the simple regression on current arrivals, 80%. The adjusted R-squared is only 53%. Still, lagged arrivals have substantial predictive power. Run a multiple regression of occupancy versus lagged arrivals and advance bookings. Kahana Occupancy Data What is the adjusted R-squared for this multiple regression? Enter the adjusted R-squared as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Kahana Occupancy Data Kahana Occupancy Regressions The adjusted R-squared of 60% for the multiple regression is higher than the 53% for the simple regression. But is that the best you can do? For the moment, you report your analysis to Alice. This model with the lagged variable is more useful to Leo, even if its predictive power is still pretty low. But let's look at these data on Leo's competition that I dug up. Maybe we can use them to give us an even better model. Leo's competitor, Knut Steinkalt at the Hotel Excelsior, frequently launches promotion campaigns: in some months, Steinkalt slashes room prices dramatically. These data show the months in which the Excelsior's promotions took place in the last three years. We can use a dummy variable to see how the competition's promotions have affected Leo's occupancy. Kahana Occupancy Data The Excelsior has to advertise these promotional packages at least a month in advance to attract customers. As long as Leo keeps an eye on the Excelsior's advertising, he'll be alerted to these promotions in enough time to take them into account when he makes staffing decisions for the following month. Kahana Occupancy Data Using Alice's research, you run a simple regression of occupancy versus the promotions. Then you run the multiple regression of occupancy versus advance bookings, lagged arrivals, and Excelsior promotions. Kahana Occupancy Data Suppose Leo has used the multiple regression equation to predict July's occupancy using a given level of advance bookings and last July's arrivals. Just before he makes his staffing decisions, he learns that, unexpectedly, his rival Knut has cut the Excelsior's room prices for July. Leo should revise his predicted occupancy by: a. Subtracting 43 expected guests. This is not the correct answer. The gross effect of Excelsior promotions on Kahana occupancy is to reduce it by 43 guests. The advance bookings and lagged arrivals for July are still the same as they were before the Excelsior launched its campaign, so Leo wants to find the net effect of the promotion on his occupancy levels. b. Subtracting 60 expected guests. This is the correct answer. Whenever the Excelsior has a promotional deal, Kahana occupancy is reduced by an average of 60 guests. The advance bookings and lagged
arrivals for July are still the same as they were before the Excelsior launched its campaign, so Leo wants to find the net effect of the promotion on his occupancy levels. c. Adding 43 expected guests. This is not the correct answer. The effect of the Excelsior promotions is to reduce the Kahana's occupancy. d. Adding 60 expected guests. This is not the correct answer. The effect of the Excelsior promotions is to reduce the Kahana's occupancy.
Kahana Occupancy Data Kahana Occupancy Regressions The gross effect of Excelsior promotions on Kahana occupancy, a reduction of 52 guests, is not relevant here since we know the advanced bookings and lagged arrivals for July are still the same as they were before the Excelsior launched its campaign. Leo wants to use the net effect of the promotion on his occupancy levels, assuming advanced bookings and lagged arrivals remain fixed — an average reduction in occupancy of 60 guests. Adding Excelsior promotions to the regression analysis: a. Makes the lagged arrivals variable appear statistically insignificant at the 0.05 level. This is not the correct answer. The p-value for the coefficient on lagged variables is still less than 0.05.
b. Makes the advance bookings variable appear statistically insignificant at the 0.05 level. This is not the correct answer. The p-value for the coefficient on advance bookings is still less than 0.05.
c. Is an improvement in terms of R-squared over all of the simple regressions but is not over the regression of occupancy versus lagged arrivals and advance bookings. This is not the correct answer. The adjusted R-squared of 84% is higher for this regression than for any simple regression or for the regression of occupancy versus lagged arrivals and advance bookings.
d. Is an improvement in terms of R-squared over all of the simple regressions and over the regression of occupancy versus lagged arrivals and advance bookings. This is the correct answer. The adjusted R-squared of 84% is higher for this regression than for any simple regression or for the regression of occupancy versus lagged arrivals and advance bookings.
Kahana Occupancy Data Kahana Occupancy Regressions Upon Leo's return, you eagerly report your results. This is good work. Naturally, I'd like to have even greater predictive power than 84%, but I realize you aren't psychics. This model will really help me when I hire staff. Thanks! I have some good news of my own. Mr. Pitt agreed not to file suit against me! I did have to promise him an extended stay in the Kahana's penthouse, free of charge. He's coming next month — there's one occupant I don't need to use statistics to predict. This time, I'll serve the bisque myself. Thanks again for all your help! Exercise 1: The Kiwana Quandary Linda Szewczyk, marketing director of Amalgamated Fruits Vegetables & Legumes (AFV&L), is researching the nation's fruit consumption habits. In particular, she would like greater
insight into household consumption of the kiwana, a cross-breed of kiwis and bananas that AFV&L pioneered. Naturally, one important determinant of household consumption is the size of the household — the number of members. Since AFV&L has positioned the kiwana as a "high end" fruit, Linda believes that household income may also influence its consumption. Run a multiple regression of household kiwana consumption versus household size and income. Make note of important regression parameters such as R-squared, adjusted Rsquared, the coefficients, and the coefficients' significance. The income variable has a coefficient of 0.0004. Can a variable with such a small coefficient be statistically significant? Kiwana Consumption Data The independent variable, income, is statistically significant since its p-value is less than 0.05, the most common level of significance. The small coefficient tells us that for every additional $10,000 of income, average kiwana consumption increases by 4 lbs. a year. To date, AF&L has focused its marketing campaigns on high-income, highly educated consumers. Linda would like to deepen her understanding of how the educational level of the household members might affect their appetite for kiwanas. Kiwana Consumption Data To incorporate education into her kiwana consumption analysis, Linda separated the households in her data set into three categories based on the highest level of education attained by any member of the household — no college degree, college degree but no postgraduate degree, and post-graduate degree. She represents these categories using two dummy variables — "college only" and "post-graduate." Run a regression on all four independent variables. Kiwana Consumption Data Controlling for household size, income, and post-graduate degree, how many more pounds of kiwanas are consumed in a household in which the highest educational level is a college degree, compared to a household in which no one holds a college degree? a. 40.8 This is not the correct answer. The coefficient for the dummy variable "college" that describes the net relationship between "college" and household kiwana consumption is 51.6
b. 51.6 This is the correct answer. The coefficient for the dummy variable "college" that describes the net relationship between "college" and household kiwana consumption is 51.6
c. 52.0 This is not the correct answer. The coefficient for the dummy variable "college" that describes the net relationship between "college" and household kiwana consumption is 51.6
d. 84.0 This is not the correct answer. The coefficient for the dummy variable "college" that describes the net relationship between "college" and household kiwana consumption is 51.6
Kiwana Consumption Data Kiwana Consumption Regressions The coefficient for the dummy variable "college," 51.6, tells us the expected difference in kiwana consumption for "college degrees only" households compared to the excluded educational category: households in which no one holds a college degree. The coefficient describes the net relationship between "college degrees only" and household kiwana consumption, controlling for household size, income, and post-graduate degree. Controlling for household size, how many more pounds of kiwanas are consumed in a
household in which the highest educational level is a post-graduate degree, compared to a household in which the highest educational level is a college degree? Enter the difference in consumption between the two households as a decimal number with two digits to the right of the decimal point. (e.g., enter "5" as "5.00"). Round if necessary. Kiwana Consumption Data Kiwana Consumption Regressions When you control for household size and income, college degree households consume 51.63 lbs more than non-college households. Post-graduate degree households consume 51.95 lbs more than non-college households. In other words, post-graduate households consume 0.32 lbs more than college households. The analysis of household kiwana consumption indicates: a. The presence of multicollinearity in the four-variable regression. This is not the best answer. There appears to be multicollinearity in the data, but answer choices "b" and "c" are correct, too.
b. That educational level and income are highly correlated. This is not the best answer. The low significance of income in the presence of the dummy variables on educational level indicates that educational level and income are highly correlated. But answer choices "a" and "c" are correct, too.
c. That household size contributes significantly to household kiwana consumption. This is not the best answer. The low p-value for the household size coefficient indicates that household size contributes significantly to household kiwana consumption, but answer choices "a" and "b" are correct, too.
d. All of the above. This is the best answer. The data and regression analyses indicate that all three answer choices are correct.
Kiwana Consumption Data Kiwana Consumption Regressions Exercise 2: The Return of the Empire For this exercise, refer to the regression analyses you ran in exercise 1 of the previous section. Empire Learning Data Empire Learning Regressions Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and the total animation run-time as independent variables. Empire Learning Data Empire Learning Regressions Bill believes that the number of illustrations used in the course may also have a significant impact on the number of labor-hours it takes to complete an e-learning course. He wants to add the number of illustrations to the model as another independent variable. Empire Learning Data Empire Learning Regressions Run the simple regression of labor-hours versus number of illustrations. Empire Learning Data Empire Learning Regressions
At which level is the number of illustrations a statistically significant independent variable? a. 0.01 This is not the best answer. The p-value for the illustrations coefficient is greater than the significance level 0.01.
b. 0.05, but not 0.01 This is the best answer. The p-value for the illustration coefficient is 0.0499.
c. 0.10, but not at 0.05 This is not the best answer. The p-value for the illustration coefficient is less than 0.05.
d. None of the above This is not the best answer. The p-value for the illustration coefficient is less than 0.05.
Empire Learning Data Empire Learning Regressions Run the multiple regression of labor-hours versus number of pages, illustrations, and animation run-time. Empire Learning Data Empire Learning Regressions Is there evidence of multicollinearity in the data? a. Yes, because R-squared is relatively high and all of the independent variables have statistically significant coefficients. This is not the correct answer. If all of the independent variables have statistically significant coefficients, there is no evidence of multicollinearity.
b. Yes, because R-squared is relatively high and some of the independent variables do not have statistically significant coefficients. This is the correct answer. If some of the independent variables have statistically significant coefficients, there is evidence of multicollinearity.
c. Yes, because R-squared is relatively high and the intercept coefficient is not significant. This is not the correct answer. The intercept coefficient is irrelevant when diagnosing multicollinearity.
d. No, because R-squared is relatively high and some of the independent variables do not have statistically significant coefficients. This is not the correct answer. This evidence suggests the existence - not the absence - of multicollinearity.
Empire Learning Data Empire Learning Regressions A common symptom of multicollinearity is a high adjusted R-squared — in this case 94% — accompanied by one or more independent variables with low significance. In this case, the coefficient for the number of illustrations is not significant at the 0.05 level, and the p-value for the number of pages has risen to 0.0291, up from 0.0015 in the regression without illustrations. Which of the following is the likely culprit of multicollinearity? a. A positive correlation between the number of illustrations and the number of pages This is the best answer. The correlation coefficient is 67%. Moderate to high correlations between independent variables are often the cause of multicollinearity.
b. A negative correlation between the number of illustrations and the number of pages This is not the best answer. The correlation coefficient between the number of illustrations and the
number of pages is positive.
c. A positive correlation between the number of illustrations and the run-time of animations This is not the best answer. The correlation coefficient between animations and illustrations is 22%, much lower than between illustrations and pages.
d. A negative correlation between the number of illustrations and the run-time of animations This is not the best answer. The correlation coefficient between animations and illustrations is positive.
Empire Learning Data Empire Learning Regressions Multicollinearity occurs when the respective effects of two or more independent variables on the dependent variable are not distinguishable in the data. This can be the result of correlated independent variables. The fact that the p-value for the number of pages rises when we add the illustrations raises our suspicions that the number of illustrations and the number of pages might be correlated. We can compute the correlation between the number of pages and the number of illustrations: the correlation coefficient, 67%, is fairly high. We could also attempt to diagnose the cause of the multicollinearity by running a regression of labor-hours versus number of pages and number of illustrations - omitting animations. Here, the significance of illustrations is extremely low, with a p-value of 0.85. Empire Learning Regressions In the regression of labor-hours versus number of illustrations and run-time of animations omitting pages - the respective effects of the independent variables on the dependent variables can be distinguished. Here, the p-values for both variables are much lower than 0.05. All of the evidence points to a linear relationship between number of pages and number of illustrations as the culprit for the multicollearity. Empire Learning Regressions Bill wants to use the regression analysis to predict the number of labor-hours it will take to complete a new e-learning course. Comparing the two regression models - one with all three independent variables, one without illustrations - which should Bill use? a. With number of illustrations This is the best answer. The number of illustrations is at least statistically significant at the 0.10 level, and adjusted R-squared for this regression is higher than it is when we exclude the number of illustrations variable. For making predictions, multicollinearity is typically not a problem.
b. Without number of illustrations This is not the best answer. The number of illustrations is at least statistically significant at the 0.10 level, and adjusted R-squared for this regression is higher than it is when we exclude the number of illustrations variable. For making predictions, multicollinearity is typically not a problem.
Empire Learning Data Empire Learning Regressions Decision Analysis Introduction Three more days before you leave Hawaii. As excited as you are by your imminent debut at business school, you don't look forward to your departure. You find yourself weighing the relative merits of business school and just staying on Kauai. Leo has an opening in the restaurant...
The shrill ring of the telephone interrupts your reverie. Sounding excited even by Leo standards, the hotelier demands your immediate attention in his office. Dining at Sea: The Chez Tethys Problem I've had an exciting business venture on my mind for some time: I want to run a floating restaurant. It's going to be amazing, offering spectacular views of volcanic silhouettes and the bright lights of resort towns, the best Hawaiian culinary artistry, tastefully alluring hula dancing and music, ... Slow down, slow down, Leo. I'm interested to hear your idea, but let's focus on the big picture first. A floating restaurant? I want to run a world-class restaurant on a yacht. Chez Tethys, I'll call it. I've had the idea for some time. Chez Tethys will be Hawaii's most luxurious dining experience. The yacht will travel from island to island, docking in a different harbor each night. Seating will be limited, and you can board only at selected times. The challenge alone of making a reservation will make it irresistible. It would become a matter of prestige to include a meal Chez Tethys on any trip to the islands. The Chez Tethys has been on my mind since before I inherited the Kahana. I'd been looking for investors, but without luck. Then the Kahana fell into my lap, and the hotel has preoccupied me since then.
I'm beginning to feel confident about running the hotel — thanks to you two, I should add. Now I feel like I'm ready to expand my operations. In addition to its own profits, just think of the prestige a high-profile floating restaurant would add to the Kahana! Listen, Leo, this is an intriguing proposition, but I'm not fully convinced that a floating restaurant would really catch on. Let's go over your business plan, and then we'll think about how likely it is that the Tethys will be a profitable venture. One quick question: How do you intend to finance the Tethys? I'm way ahead of you. I've already been to the bank, and I think I can take out a loan using the Kahana as collateral. I have a really good feeling about the Tethys. I'm pretty confident that I can make a floating restaurant work. "Leo's Chez Tethys idea is out of the ordinary, but very interesting," Alice muses. "But we'll need to guide him through the decision to expand his operations so extravagantly, especially since he's putting the Kahana at risk. We wouldn't want him to make a choice he'll regret."
Introducing Decision Analysis S&C Films, a small film production company, just purchased a script for a new film called Cloven. Cloven's unusual plot has the makings of a cult classic: a pastoral romance shattered by a livestock revolution and the creation of a poultry-dominated totalitarian farm state. S&C's owner and producer, Seth Chaplin, enthusiastically pitches the film: "It's Animal Farm meets Casablanca". Seth entered into negotiations with two major studios — Pony Pictures and K2 Classics — and hammered out a prospective deal with each of them. Pony Pictures liked the script and agreed to pay S&C $10 million to produce Cloven, covering the production costs and a profit margin for S&C. Seth estimates Cloven would cost S&C $9 million to produce. Under this agreement, Pony would acquire all rights to the movie. The second deal, negotiated with K2 Classics, stipulates that K2 would market and distribute the movie, and S&C would cover the production costs itself. Under this agreement, K2 and S&C would split ownership of the movie. Seth believes that the movie has huge potential. Even with partial ownership, S&C would reap enormous profits if Cloven becomes a blockbuster. This deal specifies that K2 would collect all revenues until its marketing and distribution costs are covered, then take 35% of any further revenues. S&C would collect the remaining 65%. If Cloven is a blockbuster, S&C will make a killing. If it flops, S&C will lose some or all of its production investment. Which deal should Seth choose?
Decision-making is an essential management responsibility. Managers are charged with choosing from multiple courses of action that can each lead to radically different consequences. Seth's decision about how to finance Cloven's production might lead to anything from blockbuster profits to financial disaster. When faced with a choice of actions and a range of possible outcomes, decision making can be difficult, in large part due to uncertainty. Although the course of action we choose influences the outcome, much of what happens is not only beyond our control, but beyond our powers to
predict with certainty. Nonetheless, decisions like Seth's that involve uncertainty can be analyzed logically and rigorously: decision analysis tools can help managers weigh alternative options and make informed and rational choices. In a world of uncertainty, applying decision analysis will not guarantee that each decision you make will lead to the best result, or even to a good result. But if you apply effective decision analysis consistently over the course of a managerial career, you are almost certain to gain a reputation for sound judgment.
Summary Decision analysis is a set of formal tools that can help managers make more informed decisions in the face of uncertainty. Although applying even the most rigorous decision analysis does not guarantee infallibility, it can help you make sound judgments over the course of your career.
Decision Trees "The problem with Leo's floating restaurant is that its success hinges on so many factors," Alice tells you. "Some, Leo can predict, like the approximate price of a yacht, or the cost of labor. But what about consumer demand? Truth be told, not even Leo knows whether the Chez Tethys will catch on. Leo has to incorporate this uncertainty into his decision-making process." Uncertainty and Probability Producer Seth Chaplin boasts that when Cloven is released in theaters, it will become an instant classic. But even an incurable enthusiast like Seth would never claim that success is certain, and if asked how confident he is that Cloven will achieve at least modest success, he might reply with a percentage: "80%." Uncertainty clearly has a quantitative dimension: we can distinguish between degrees of uncertainty. But our intuition about uncertainty is often underdeveloped. How do we measure uncertainty consistently and systematically? Probability is a measure of uncertainty. A meteorologist reports the probability of rain in her forecast: "We'll see cloudy skies tomorrow, with a 20% chance of rain." Probability is measured on a scale from 0 to 1, or 0% to 100%. Events that are impossible are said to have zero (or zero percent) probability, and events that are absolutely certain are said to have a probability of one (or 100%). But how do we interpret probabilities of 20% or 37.8%? Let's look at a very simple and familiar uncertain event: a spin of a wheel of fortune. Our wheel consists of two equally-sized areas: an orange half and a green half. Spin the wheel a few times and count the number of green and orange outcomes you get.
Since the two areas are of equal size, we'd expect "greens" and "oranges" to occur with equal frequency. About half of the spins will end up "green" and half will end up "orange." Of course, if we only spin a few times, we won't expect exactly half of the spins to result in "green." But the more times we spin, the closer we expect the percentage of greens will be to 50%. This is the probability of getting a "green" each time we spin. This interpretation of probability is a relative frequency interpretation. We have a set of events — in our case, spins — and a set of outcomes — "green" or "orange." We form the ratio of the number of times a particular color occurs to the total number of spins. The probability of that particular color is the value that ratio approaches as the total number of spins approaches infinity. Let's look at a different wheel. This wheel is also divided into two areas, but now the green area is larger: it covers three quarters of the wheel. What is the probability of a spin of this wheel resulting in the outcome "green"? a. 25% This is not the best answer. Think about the relative sizes of the green area and the total area.
b. 50% This is not the best answer. Think about the relative sizes of the green area and the total area.
c. 75% This is the best answer.
d. 100% This is not the best answer. Think about the relative sizes of the green area and the total area.
Since the green area covers three quarters of the whole wheel, we can expect that three out of four spins will result in a green outcome. This corresponds to a probability of 75%. For simple uncertain events like a wheel of fortune game, figuring out the probabilities of the outcomes is easy. There are a limited number of possible outcomes and there are no hidden factors affecting the outcome. Besides, if we had any difficulty believing that the probability of a "green" is 75%, we could simply spin the wheel over and over again to calculate an approximation of the probability. For more complex uncertain events like the weather or the success of a new venture, finding the probabilities of the outcomes is more difficult. Many factors can affect the outcome; indeed, we may not even know all the possible outcomes. Furthermore, there may be no simple experiment we can repeatedly run to generate approximations of the relative frequencies. How should we think about probabilities in these situations? Let's think about the weather. Anytime you leave your house, you face the decision of whether or not to take an umbrella. If the sky is perfectly blue, you probably won't take one. Nor will you do so if there are a few white clouds in the sky. On the other hand, if the sky is completely overcast, you might consider taking some kind of rain protection, and if it's already raining, you almost certainly would. You'd be able to give a rough estimate of the probability of rain; for instance if it's overcast, you might estimate a 40% chance of rain.
This estimate may be rough, but it isn't completely arbitrary, and it's clearly a better estimate than either 0% or 100%. Your subjective estimate may be based on an informal assessment of relative frequency. The estimate of "40%" might be shorthand for the reflection: "In my past experience, it has rained on a little under half the days when it is has been this overcast in the morning at this time of the year, in this geographic location." Although you probably never sat down and tabulated the days according to season, atmospheric conditions, and rainfall, your informal assessment is powerful enough to make a reasonable choice about taking your umbrella. Though sometimes, you may make the wrong choice.
Similarly, when Seth Chaplin gives his estimate of an 80% chance of Cloven's success, he is probably articulating the following: "In my experience, about four out of five movies of a similar genre, with a similar budget, released under similar conditions were successful." In the examples we've discussed thus far, our uncertainty about the events we faced has stemmed from the fact that the events had not yet happened. Sometimes we face uncertainty for a different reason: an event may have already occurred, but we simply don't know what the outcome was. Let's return to the wheel of fortune to see how we might think about uncertainty in these cases. Imagine you and a friend spin the wheel. You close your eyes and keep them shut while your friend observes the wheel. While the wheel is spinning, both you and your friend are uncertain about the outcome, and you would each assess the probability of "green" at 75%.
When the wheel stops, the result becomes definite. Your friend knows the outcome: for her, the probability of "green" is now either 0% or 100%. But as long as your eyes are closed, you remain uncertain about the outcome. If someone promised to give you $10 if you correctly guessed the outcome of that spin before you opened your eyes, which color would you choose? a. Green There is no correct answer to this question, of course, but on average, choosing "green" would be financially advantageous.
b. Orange There is no correct answer to this question, of course, but on average, choosing "green" would be financially advantageous. From your limited point of view, the best choice you can make is "green," even though the outcome may already be a definite "orange." If you played the same game over and over again and always chose "green," you would win three quarters of the games. The probability of the outcome turning out to have been "green" when you open your eyes is 75%. Even though the event has already occurred, the outcome is uncertain from your point of view. It makes sense to measure the uncertainty due to a lack of information the same way we measure the uncertainty due to our inability to predict the outcomes of future events.
Summary Uncertainty makes decision-making challenging. We can be uncertain about outcomes that have or have not occurred. To make the most informed decisions, we quantify our uncertainty using probability measures. Probability is measured on a scale from 0% to 100%: events with 0% probability are impossible; events with 100% probability are certain. An event's probability can often be determined by observing the relative frequency of its occurrence within a set of opportunities for the event to occur. For events for which relative frequencies are difficult to assess, we often make subjective estimates of the probabilities. Though not based on "hard" data, these estimates are often a sufficient basis for sound decision making.
Structuring Decision Trees Alice whips out a notepad and pencil, and begins drawing something that vaguely resembles a saguaro cactus. "Let's look at how we can analyze and structure decisions like Leo's." Graphical representations are often a highly efficient way to organize and convey information. Scatter plots and histograms efficiently communicate distributions of data, an organizational chart can quickly outline the structure of a company; and pie charts effectively express information about proportions and probabilities. Is there such a thing as a graphical tool that helps inform and organize the decision-making process?
The answer is yes, and the graphical tool is called a decision tree. Let's look at a simple decision tree. Seth Chaplin's business decision — whether to produce Cloven for Pony Pictures or in partnership with K2 Classics — involves two alternatives. At some point, Seth must make a decision. Until he does, there are two paths along which history can unfold. Cloven will either be owned by Pony or co-owned by S&C Films and K2 Classics. A different branch of the tree represents each alternative. Decisions aren't the only points at which alternatives can branch off. Events with uncertain outcomes also lead to branching alternatives. Once released in theaters, Cloven could be a "Blockbuster" hit, have a fair, but "Lackluster" performance, or "Flop" completely. Each of these outcomes corresponds to a separate branch on the decision tree. Decision trees have two types of branching points, or nodes. At decision nodes, the tree branches into the alternatives the decision-maker can choose. We use square boxes to represent decision nodes. At chance nodes, the tree branches because the uncertainty of the business world permits multiple possible outcomes. We represent chance nodes by circles. We arrange the nodes from left to right in the order in which we will eventually determine their results. In Seth's example, the first thing to be determined is his own decision about which deal to choose. The next thing to be determined is Cloven's box office performance. One of the challenges of creating a decision tree is determining the correct sequence of nodes: in what order will the relevant outcomes be determined? which events "depend" on the occurrence of other events? What alternatives are created or foreclosed by prior decisions or by the outcomes of uncertain events? Drawing a decision tree forces us to delineate each alternative and clarify our assumptions about those alternatives. Thus, even before we use a tree to make a decision, simply structuring the tree helps us clarify and organize our thoughts about a problem. A decision tree is also useful in its own right as a tool for communicating our understanding of a complex situation to others. Categorizing branching points as decision or chance nodes makes clear which events and outcomes are under our power and which are beyond our control. Writing down alternatives in a systematic fashion often allows us to think of ideas for new alternatives. For instance, after viewing the decision tree, Seth might realize that he would prefer a deal that pays part of his production costs but still allows a small ownership stake as opposed to either of the deals he has hammered out with Pony and K2.
Summary Decision trees are a graphical tool managers use to organize, structure, and inform their decision-making. Decision trees branch at two types of nodes: at decision nodes, the branches represent different courses of action the decision-maker can choose. At chance nodes, the branches represent different possible outcomes of an uncertain event - at the time of the initial decision, the decision maker does not know which of these outcomes will occur (or has occurred). The nodes are arranged from left to right, in the order in which the decision-maker will determine which of the possible branches actually occurs.
Incorporating Data into Decision Trees Laying out a decision tree — mapping a logical structure and identifying the alternative scenarios — is an essential first step in the decision-making process. But how do we compare different scenarios? On what criteria might we base our preference for one scenario over another? In the business world, success is often measured in profits. Seth Chaplin will consider Cloven a success if it brings in at least a modest profit. How should Seth incorporate possible profit levels into his decision analysis? Seth's decision tree contains four possible scenarios: four unique paths from his current decision node on the left to the ultimate outcomes on the right. He must assign to each scenario a dollar amount — S&C's profits. For example, if Seth chooses to produce Cloven for Pony Pictures, his expected profits will be $1 million. What if Seth chooses the K2 deal? If he has a significant ownership stake in the movie, his profits will depend on Cloven's box office success. With a "Blockbuster" movie come "Blockbuster" profits: Seth estimates around $6 million. If Cloven performance is "Lackluster", Seth thinks the Cloven project would break even for S&C films: S&C's expected profits would be roughly $0. If Cloven "Flops," S&C's production costs will exceed its revenues from the movie, leading to an expected loss of $2 million. Are these three the only possible outcomes? Clearly, no. S&C's profits on the release of Cloven could be $6.6 million or $15.03. In fact there is a range of outcomes stretching from profits in the hundreds of millions of dollars — the historical upper limit of movie profits — to a loss of $9 million — the amount S&C will invest in production. Although it is mathematically possible to analyze the full range of possible outcomes, the procedure for doing so is usually more complicated and time consuming than is warranted for many decision problems. In practice, considering only a few representative scenarios as we have done in the Cloven example will typically lead to good decisions. When we use a scenario to represent a range of possible outcomes, the outcome figure we assign to that scenario should represent the weighted average of all outcomes in the range that scenario represents. For example, in the Cloven decision, $0 million might represent the weighted average of all possible profits from -$3 million to +$3 million. Decision Trees and Probabilities How can Seth use his decision tree and estimated profit values to choose the better deal? If Seth chooses the partnership with K2, S&C could make $6 million. Clearly, that scenario is preferable to the scenario in which S&C simply sells Cloven to Pony for a profit of $1 million. If S&C retains part ownership of Cloven and it "Flops" in the theaters, S&C stands to lose $2 million. That scenario is clearly worse than the scenario in which S&C sells Cloven to Pony. If Cloven is highly likely to "Flop" in theaters, then Seth should choose the Pony deal; if it's highly likely to be a "Blockbuster" he should choose the K2 deal. Clearly, the likelihood of the different outcomes in the K2 deal should weigh heavily in Seth's decision. To each branch emanating from a chance node we must associate a probability: the probability of that outcome occurring. These probabilities may be based on historical data or on our best judgment of the likelihood of each outcome.
Based on the quality of the script and his years of experience in the movie industry, Seth estimates the probability of Cloven becoming a "Blockbuster" at 30%, the probability of a fair, but "Lackluster" performance at 50%, and the probability of a "Flop" at 20%. If S&C sells Cloven to Pony, the level of box office success is largely irrelevant to Seth. S&C's profits are certain to be $1 million. When assigning outcomes and probabilities to chance nodes, we must meet two
requirements. First, outcomes that branch off from the same chance node must be "mutually exclusive." Thus, for example, two outcomes emanating from the same node cannot both occur at the same time: the occurrence of one outcomes excludes the occurrence of any other. Second, the set of outcomes that branch off from the same chance node must be "collectively exhaustive": the branches must represent all possible outcomes.
In practice, we typically don't depict every possible outcome separately. For Cloven, we've reduced the vast range of possible financial outcomes to a manageable set of three representative outcomes. However, we consider these three outcomes collectively exhaustive in that they represent three ranges that together exhaust all possibilities. Also, we usually consider extremely unlikely but not impossible events to be included in one of the existing branches, or if sufficiently unlikely, to be irrelevant to decision-making. In the Cloven example, we don't include a branch representing the simultaneous destruction of all existing copies of the Cloven film. When we construct a chance node, we must make sure that the outcomes emanating from that node are mutually exclusive and collectively exhaustive. Since it is certain that one and only one of the possible scenarios will occur, the individual probabilities must add up to 100%.
Summary For a decision tree to effectively inform a decision, it must incorporate two types of relevant data: endpoint values corresponding to each scenario and the probabilities of the possible outcomes of each uncertain event. An outcome value is associated with a scenario - a unique path from the first node on the left of the tree to an endpoint on the right. We place the appropriate probability on each branch emanating from a chance node. Outcomes represented by branches emanating from the same chance node must be mutually exclusive and collectively exhaustive.
Exercise 1: The Tardytech Blame Game Ted Nesbit is a project manager at Tardytech, a software development company. Ted is planning the project schedule and budget for the development of a new piece of custom business software for a Fortune 500 client. The client has imposed a strict deadline for completion of the project. However, based on past experience with similar clients, Ted knows that there is a significant probability that Tardytech will not be able to meet its deadline due ? in most cases due to to client-side delays.
Experience indicates that 56% of all Tardytech projects are completed on time, 42% are delayed due to client-side delays, and 2% are delayed due to errors and process failures at Tardytech. What is the probability that Tardytech's new project will not be completed on time? a. 56% This is not the best answer. Think about the two different causes of delays, then think about the total probability of a delay.
b. 44% This is the best answer.
c. 42% This is not the best answer. Think about the two different causes of delays, then think about the total probability of a delay.
d. 2% This is not the best answer. Think about the two different causes of delays, then think about the total probability of a delay. The total probability of a delay is the sum of the probabilities of a client-side delay and an Tardytech-side delay: 44%. A potential new client has asked Tardytech to report the probability that a project will not be delayed due to Tardytech errors or process failures. What probability should Ted report?
a. 98% This is the best answer.
b. 58%
This is not the best answer. Think about what can happen to a project besides being delayed by Tardytech errors and process failures.
c. 56% This is not the best answer. Think about what can happen to a project besides being delayed by Tardytech errors and process failures.
d. 44% This is not the best answer. Think about what can happen to a project besides being delayed by Tardytech errors and process failures. Only 2% of all projects are delayed due to problems caused by Tardytech. The remaining projects are either completed on time (56%) or delayed by client-side delays (42%). The probability that a project will not be delayed by Tardytech is the sum of the probabilities of those outcomes, 98%. Another potential client has asked Tardytech to report the percentage of all delayed projects whose delays could be attributed to Tardytech's errors and process failures. What percentage should Ted report?
a. Around 2.0% This is not the best answer. Think about the relative frequency of Tardytech delays among the population of delayed projects.
b. Around 4.5% This is the best answer.
c. Around 4.8% This is not the best answer. Think about the relative frequency of Tardytech delays among the population of delayed projects.
d. The answer cannot be determined from the information provided. This is not the best answer. Think about the relative frequency of Tardytech delays among the population of delayed projects. In the past, 44 out of 100 projects were not completed on time. On average, 2 projects out of 44 delayed projects were delayed due to Tardytech errors and process failures. Thus, the proportion of Tardytech-caused delays among all delays is 2/44, or 4.5%. We'll discuss and analyze similar questions in greater depth in the next unit.
Exercise 2: The First Bank of Silverhaven As a loan officer at the First Bank of Silverhaven, Carla Wu determines whether or not to grant loans to small business loan applicants. Of 108 recent successful small-business loan applicants with a full year of payment history, 28 were unable to meet at least one loan payment in the first year they had an outstanding loan.
What is the probability that a randomly selected small business in this pool missed a payment in the first year of the loan? Enter the probability as a decimal number with three digits to the right of the decial point (e.g., enter "50%" as "0.500"). Round if necessary. The probability that a randomly selected small business missed a payment in the first year of the loan is simply the ratio of the number who missed at least one payment to the total number of loans, i.e., 25.9%.
Exercise 3: The Shipping Bea Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her current fleet. If she leases the truck, she may be able to generate new business that would increase her company's profits. However, if additional business fails to materialize, she may not be able to cover the incremental leasing costs. Based on her assessment of future trends in the transportation sector, Robin identifies three scenarios that might transpire: "Boom," "Moderate Growth," and "Slowdown," depending upon the performance of the economy and its impact on the transportation sector. She associates with each scenario an estimate of the total profits an expanded fleet will generate for her firm. If she doesn't lease the new truck, Robin probably won't generate as much revenue, but her leasing costs will be lower. Again, she breaks down the possible outcomes if she does not lease the additional truck into three scenarios corresponding to the performance of the economy. Then she associates an estimate of total firm profits with each scenario.
Which of the following trees best represents Robin's decision?
a. a. This is not the best answer. Think about what kinds of nodes are present in this decision.
b. b. This is the best answer.
c. c. This is not the best answer. What order do the important events of this decision occur in?
d. d. This is not the best answer. What order do the important events of this decision occur in? The first branching of the tree occurs at the point when Robin makes her decision: to lease or not to lease a new truck. A square decision node should represent this decision. Each of Robin's options splits into three possible outcomes depending on the performance of the economy. Robin will learn which economic scenario transpires only after making her decision. The second branching represents this uncertainty. The nodes of these branches are chance nodes and should be drawn as circles.
Exercise 4: Wheel of Fortune You and a friend have a wheel of fortune that has a 75% chance of a "green" outcome and a 25% chance of a "red" outcome. Before you spin, the friend offers you $100 if you correctly predict the outcome. If you choose the wrong color, you receive nothing. The good news: you don't have to choose until the spin is complete. The bad news: you will have to keep your eyes closed until the wheel has stopped and you have made your prediction. You close your eyes and keep them shut while the wheel spins. When the wheel stops, your eyes are still closed. You now must decide whether to choose "red" or "green. Which of the following trees best represents your decision?
a. a. This is the best answer.
b. b. This is not the best answer. Think about the order in which you will discover the results of each node.
c. c. This is not the best answer. Think about the circumstances under which you win this game.
d. d. This is not the best answer. Think about the order in which you will discover the results of each node.
The nodes of a decision tree are arranged from left to right in the order in which we discover their results, not in the order in which the events actually occur. Thus, even though the wheel stops and the result is finalized before you make your decision, because you don't know the result until after your decision is made, the chance node for the spin result should appear after — to the right of — the decision node. Comparing the Outcomes Alice's decision trees are neat devices to help organize and structure a decision problem. But how are they going to help Leo evaluate his options and choose the best one? Introducing the Expected Monetary Value Seth Chaplin must decide how to produce the movie Cloven. He has mapped out the logical structure of his decision in a decision tree and has incorporated the appropriate data, evaluating each scenario in terms of its expected profits and its probability of occurring. Now, how should he use the tree to inform his decision? If S&C Films works in partnership with K2 and retains part ownership of Cloven, S&C could earn a profit of $6 million, a highly desirable outcome. However, that scenario is relatively unlikely: 30%. How do we balance the high value of that outcome against its low probability? The answer is elegant and simple: we multiply the outcome value by its probability: $6 million * 0.3 = $1.8 million. Essentially, we "credit" the "Blockbuster" outcome with only 30% of its value. Similarly, we credit a "Lackluster" performance of $0 profits with only half of its value.
The magnitude of the loss incurred if the film "Flops" is mitigated by its low likelihood of 20%.
Then we add these weighted values together, for a total of $1.4 million. This total is called the expected monetary value (EMV) of the K2 option. The EMV is a weighted average of the expected outcomes of the scenarios in which S&C retains part ownership of Cloven as stipulated in the K2 deal. Let's return to our wheel of fortune game to build our intuition for the concept of the EMV. This wheel consists of three areas: "blue," "green," and "red." The blue area is 30% of the total area. If a spin results in "blue," you win $6. "Green" covers 50% and "red" covers 20% of the wheel's area. If a spin results in "green," you gain nothing at all, if the result is "red," you lose $2. Suppose you play the game 100 times. How much money do you think you will have gained or lost over 100 spins? What do you estimate will be your average yield per game? About 30% of the time, a spin will result in "blue." In other words, you can expect 30 of the 100 spins to yield $6. About 50% of the time a spin will result in "green." You expect 50 of the 100 spins to yield $0. And about 20% of the time a spin will result in "red," — that is, you expect 20 of the 100 spins to cost you $2.
The total amount you can expect to win after playing the game 100 times is $140, and the average yield per spin is $1.40. That average is the expected monetary value (EMV) for a single spin. The expected monetary value for a single spin is $1.40. If you spin the wheel once, which of the following results is least likely to occur as the outcome? a. A loss of $2. That is not the correct answer. Compute the probabilities of the four outcomes. Which is the least likely to occur?
b. No gain or loss (i.e. $0). That is not the correct answer. Compute the probabilities of the four outcomes. Which is the least likely to occur?
c. A gain of $1.40. This is the best answer. The probability of gaining $1.40 is 0.
d. A gain of $6. That is not the correct answer. Compute the probabilities of the four outcomes. Which is the least likely to occur? The probability of each outcome is shown below. With a probability of 0%, $1.40 is the least likely outcome.
It is important to understand the nature of the expected monetary value. We do not actually "expect" an outcome of $1.40. In fact, $1.40 is not even a possible outcome. The EMV of $1.40 is the long-term average value of the outcomes of a large number spins. In the Cloven case, the EMV of $1.4 million is the average amount of profits Seth Chaplin can expect to make when he produces similar films in similar circumstances. We can use the EMV as a measure with which to compare alternative options. First, we calculate the EMV for each chance node, beginning at the right of the tree. For the chance node associated with the K2 deal, the EMV is $1.4 million. Now that we have calculated the EMV of the K2 chance node, we can "collapse" the branches emanating from the chance node to a single point. Going forward, we can treat the EMV of $1.4 million as the endpoint value of the K2 option. The EMV for the Pony Option is simply $1.0 million: the outcome value multiplied by its probability of 100%.
At a decision node, we choose the best EMV of all the branches emanating from that decision node. In our Cloven example, the best EMV is the one with the highest expected profits. The EMV of the K2 deal, $1.4 million, exceeds $1.0 million, the EMV of the Pony deal. Selecting the option with the best EMV and removing all other options from consideration is known as "pruning" the tree. Any decision tree — no matter how large or complex — can be analyzed using two simple procedures. At each chance node, calculate the EMV, collapse the branches to a point, and
replace the chance node with its EMV. At each decision node, compare the EMVs and prune the branches with less favorable EMVs. This entire process is known as folding back the decision tree. Summary We often use expected monetary value (EMV) to quantify the value of uncertain outcomes. The EMV is the sum of the values of the possible outcomes of an uncertain event after each has been weighted by its probability of occurring. The EMV can be interpreted as the expected average outcome value of the uncertain event, if that uncertain event were repeated a large number of times. To analyze a tree, we "fold it back:" we move from right (the future) to left, finding the EMV for each node. For chance nodes we calculate the EMV as described below. For decision nodes we simply choose the option with the best EMV ! lower costs or higher profits ! among the choices represented by a decision node's branches and prune the others.
Relevant Costs Burning to use decision trees and EMVs, you start to draw up the square and circular nodes that make up Leo's Chez Tethys decision. Alice, however, urges caution: "Before you get too trigger-happy with those decision trees, you should be aware of a few common pitfalls." Jen Amato has been driving "Millie," an old jalopy of a car, for the past five years. Millie — an Oldsmobile Delta-88 — is twenty years old. A week ago, the air conditioner broke down, and Jen had it replaced at a cost of $500. Now, the drive train is worn out. Replacing it will cost $1,200. If she doesn't repair it, Jen can sell Millie "as is" for about $300. Jen must decide: should she sell her car or should she have the drive train replaced? If Jen sells her car now, she will not even recoup her $500 investment in the air conditioner. If she has the drive train replaced, her beloved Millie will last a little longer, and Jen will be able to enjoy the cool air she spent so much money on. How should the fact that Jen hasn't had the opportunity to benefit from her air conditioner investment affect her decision? With a broken drive train, Millie is worth about $300. According to her mechanic, if Jen pays $1,200 to replace the drive train, Millie will be worth approximately $1,100 in terms of resale value. In the analysis of Jen's decision, what role should her $500 investment in the air conditioner play?
In any scenario in which Jen sells her car, the $500 air conditioner cost has already been incurred, so it could be included in the total cost. Likewise, the air conditioner repair costs are part of the prehistory of any scenario in which Jen has Millie's drive train replaced. Here, too, the $500 investment in the air conditioner could be included as a cost. However, when we compare the total costs of the two options, we recognize that the $500 does not contribute to a difference in the outcome values. Because the $500 cost is incurred in both cases, it plays no role in the comparison of the two options, and thus is irrelevant to Jen's decision — she will have spent the $500 no matter what she decides to do now. Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all scenarios that could possibly unfold after the decision is made. As such these costs — called sunk costs — should not have any bearing on the decision, because we cannot devise a scenario in which they are avoided. It isn't wrong to include a sunk cost in the analysis as long as it is included in the value of every outcome. However, including sunk costs distracts from the differences between scenarios — the relevant costs. Imagine the complexity of Jen's tree if she included every sunk cost she's incurred from owning Millie — from the original purchase price of the car to the cost of all the gasoline she's pumped into Millie over the years, to her expenditure on a dashboard hula dancer. Misinterpreting a sunk cost as a cost that weighs on only some of the scenarios is a common error. After sinking $500 of repairs into her car, selling it for $300 will be quite painful for Jen. Nonetheless, good decisions are made based on possible future outcomes, not on the desire to correct or justify past decisions or mistakes.
Another common decision-making error is to omit from the analysis relevant costs that should have a bearing on the decision. If Jen has Millie's drive train replaced, the repair costs are just one of many costs involved with that option. Given Millie's age, there is a high likelihood of another repair cost soon. Similarly, if Jen sells Millie now, she will have costs associated with buying a new car or arranging other transportation services.
Opportunity costs are an important cost category that decision makers often neglect to include in their analyses. For example, selling the car will require Jen to devote 10 hours of her time that she would otherwise devote to her part-time job paying $12 per hour. Thus, Jen should add $120 in opportunity costs to the outcomes of the "sell Millie" decision. We should also take non-monetary costs into account. Jen will feel sad at leaving her trusted vehicle and companion on many a road trip behind. Although such costs can be difficult to quantify, they should not be neglected. We will see shortly how to use sensitivity analysis to incorporate non-monetary costs into a decision analysis.
Summary Among the most common errors in decision analysis is the failure to properly account for the costs involved in different possible scenarios. On the one hand, relevant costs such as opportunity costs or non-monetary consequences are often omitted. On the other hand, irrelevant costs such as sunk costs are incorrectly included in the analysis. Sunk costs are costs that were incurred or committed to prior to making the decision and cannot be recovered at the time the decision is being made. Since these costs factor into any possible future outcome, they can be safely omitted from the analysis; sunk costs must never be included in only selected branches of a decision tree.
Time Horizons Jen has decided to replace her old car, Millie the Oldsmobile. Her friend, Sven, is leaving the country for two years and doesn't want to pay to store his Mazda Miata. He offers Jen two options. In the first option, Jen leases the car, paying Sven $700 each year. When Sven returns from abroad, he reclaims his car. In the second option, Jen buys the car outright for $4,000. How should Jen compare these two options? What is the appropriate cost difference Jen should use to compare the two options Sven has offered?
a. $3,300 This is not the best answer. Read on to learn more.
b. $2,600 This is not the best answer. Read on to learn more.
c. $1,400 This is not the best answer. Read on to learn more.
d. None of the above. This is the best answer.
To begin, note that in the first option, the costs are spread across two years, whereas in the second option, Jen makes a single payment when she closes the deal with Sven. To compare the options meaningfully, we must define a time horizon for this problem, that is, we must compare their costs over a common time period. In this case, two years is a convenient time horizon. Next, note that we cannot directly compare $1,400, the simple sum of the two lease payments made at two different times, to the $4,000 one-time cost of the purchase option paid entirely at the beginning of the first year. We can only compare costs when they are valued at the same point in time.
We'll compare the present value of the costs associated with each option — that is, the value of the costs at the time Jen makes her decision. In order to compare the costs, we first need to convert the second installment of the leasing payment into its present value. Jen currently has sufficient cash for either alternative in an investment account with a 5% rate of return. What is the present value of the second installment of $700 paid under the leasing option?
a. $700.00 This is not the best answer. Remember that future cash flows must be discounted.
b. $666.67 This is the best answer.
c. $735.00 This is not the best answer. Think about how the value of money changes over time.
d. None of the above.
This is not the best answer. Think about how the value of money changes over time. The present value of the second installment of $700 is $666.67. At the end of one year, $666.67 residing in her investment account today will have increased in value to $666.67*(1.05) = $700, the amount she must pay for the second lease installment. The present value of the total cost of the leasing option is $1,366.67. Can we use this number to compare the two options?
Although $1,366.67 and $4,000 are now comparable costs because they are given at their present value, we have not yet considered the fact that under the purchase option, Jen will own the Miata at the end of the two-year period. In two years, Jen expects that the Miata's market value will be around $3,000. The value of an asset at the end of the time horizon is typically called its terminal value. What's the net present value of the cost of the purchase option - that is, the present value of future cash flows after subtracting, or "netting out," the initial payment to Sven? Recall that Jen currently has her money invested in accounts that earn a 5% average annual return.
a. $4,000.00 This is not the best answer. Think about the value of Jen's assets in two years.
b. $2,721.09 This is not the best answer. Think about the value of Jen's assets in two years.
c. $1,278.91 This is the best answer.
d. $1,142.86 This is not the best answer. Think about the value of Jen's assets in two years. If Jen resells the car in two years, she would receive $3,000. The present value of $3,000 received at the end of two years is $2,721.09. The net present value of the cost of the purchase option is $1,278.91. If cost is the only deciding factor in Jen's choice, she'd save $87.76 if she chose to purchase the Miata. Before finalizing her decision, Jen should make sure to weigh other considerations relevant to the decision, such as whether or not she will need to borrow money to cover other expenses this year if she spends $4,000 to buy the Miata now, what the borrowing rate is, etc.
Summary When we make a decision, we need to choose a time horizon over which we quantify the outcomes of our decision. To compare monetary values that take place at different times, we must account for the time value of money by comparing cash flows at their values at a common point in time. Generally, we compare the present values — or net present values — of different outcomes by discounting future cash flows at the appropriate discount rate. Terminal values of assets held at the end of the time horizon must be determined and discounted to their present value. Solving the Chez Tethys Problem "Now let's find Leo, and get to work on his decision problem," Alice says, snapping her laptop shut. I calculated that if the floating restaurant idea really takes off, I'd make over $2 million in profits over the next five years. And that's after I subtract my initial investment, including the purchase of the ship. How certain are you that the Tethys will be the success you envision?
Gosh, I don't know. It's almost like a flip of a coin to me — chances are about fifty/fifty that the Tethys will be a big hit. That sounds a little too optimistic. Let's take a close look at your business plan. The three of you spend the rest of the day in energetic discussion and research. Finally, you agree on two representative scenarios that might occur if Leo launches the Chez Tethys: "Phenomenon," or "Fad." If the Tethys becomes a cultural "Phenomenon" with staying power, Leo can expect $2 million in profits over five years, in terms of net present value. Leo grudgingly agrees that the likelihood of this scenario is only 35%. Alternatively, dining Chez Tethys might become a passing "Fad" for a couple of years, then be replaced by "The Next Big Thing." In that case, Leo would face substantial losses, estimated to have a present value of about $800,000. Leo grants that his brainchild has a 65% chance of just being a fad.
What is the EMV of launching the Chez Tethys? Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter ''$5,500,000'' as ''5.50''). Round if necessary. The expected monetary value of launching the Tethys is the outcome value of the "Phenomenon" scenario weighted by its probability added to the outcome value of the "Fad" scenario weighted by its probability, i.e., $180,000. Hmm. I can see how this analysis is helpful. I'm glad that my expected profits are positive, but somehow I'm not satisfied. I don't feel very comfortable with some of our estimates.
Exercise 1: The Shipping Bea Flies Again Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her current fleet. Robin identifies three states of the transportation sector that might occur: "Boom," "Moderate Growth," and "Slowdown,". Her firms profits depend on whether she leases an additional truck and on the state of the sector: She associates an estimate of total firm profits with each of the six scenarios.
What is the EMV of the decision to lease the truck? Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as ''5.5''). Round if necessary. The EMV of the decision to lease the truck is $31,000. Weight each of the scenario's outcomes by the probabilities that they will occur, then add the weighted values.
What is the EMV of the decision to not lease the truck? Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as ''5.5''). Round if necessary. The EMV of the option to not lease the truck is $18,200. Weight each of the scenario's outcomes by the probabilities that they will occur, then add the weighted values.
Exercise 2: The SHAMH of the Century Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations. He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to be able to do both. Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. Grants in this category are for a fixed amount, so if he loses the grant, he'll have no money to run the SHAMH. Based on his past fundraising experience, he estimates that if he runs a local fundraiser, he has a 30% chance of raising $30,000 and 70% chance of raising $20,000.
What is the EMV of launching a fundraiser? Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as ''5.5''). Round if necessary. To find the EMV of launching the fundraiser, weight the $30,000 outcome value by its probability of 30% and add that to the $20,000 outcome value weighted by its probability of 70%.
What is the EMV of applying for the grant? Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as ''5.5''). Round if necessary. To find the EMV of applying for the grant, weight the grant award amount of $25,000 by the probability of winning it (90%) and add that to the value of losing the reward ($0) weighted by the probability of losing (10%). The EMV of the fundraiser option is $23,000, higher than the EMV of applying for the grant. Based on thia analysis, Jari should organize a fundraiser.
Exercise 3: The Gaiacorps Upgrade Marsha Ratulangi is the chief operating officer at Gaiacorps, a non-profit organization dedicated to preserving natural habitats around the world. Gaiacorps' IT hardware is aging, and Marsha must decide whether to extend the current lease on Gaiacorps' IT desktop computers or purchase new ones.
Which of the following costs is not relevant to Marsha's decision?
a. The purchase price of the new IT hardware. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred.
b. The cost of the lease extension. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred.
c. The future maintenance costs of the current hardware. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred.
d. None of the above. This is the best answer. The cost of the lease extension and the future maintenance costs of the current hardware are incurred only in the scenario in which Gaiacorps extends its current lease. The purchase price of the new hardware is incurred only in the scenario in which Gaiacorps purchases new hardware. Which of the three costs are incurred depends on which option Marsha chooses, so all three are highly relevant to her decision. Which of the following costs is relevant to Marsha's decision?
a. The maintenance costs already invested in the current hardware. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred.
b. The cost of last quarter's memory card upgrade on selected desktops. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred.
c. The cost of disposing of the new hardware when it becomes obsolete. This is the best answer.
d. The cost of Marsha's salary next year. This is not the best answer. Think about when each cost is incurred, and on which branches of the decision it is incurred. The cost of the memory card upgrade and the maintenance costs previously invested in the current hardware are sunk costs that have already been incurred. Marsha's salary is not a sunk cost, but is incurred in all possible scenarios. Neither of the options Marsha is considering will affect the amount of these three costs, thus all three costs are irrelevant to the decision. In contrast, the cost of disposing of the new hardware is relevant, since it is incurred only in the case that Marsha buys new desktop computers.
Exercise 4: Mopping up the Empire Under pressure from his company's ad hoc advisory board to lower operating expenses, the CEO of Empire Learning, Bill Hartborne, is considering canceling the biweekly cleaning service for the company offices. Each cleaning engagement costs $75. Instead of hiring a cleaning service, Bill could simply clean the office himself whenever a client visits the office. The probability of exactly one client visiting the office in a given month is 25%, and the probability that two clients will visit is 10%. The probability that no clients will visit is 65%. Bill draws up the structure of his decision in the tree depicted below. Given a one-month time horizon, what is your best estimate of the EMV of the cost of not hiring a cleaning service?
a. $150 This is not the best answer. Think about what costs Bill might have omitted from his analysis.
b. $33.75 This is not the best answer. Think about what costs Bill might have omitted from his analysis.
c. $0 This is not the best answer. Think about what costs Bill might have omitted from his analysis.
d. The answer cannot be determined from the information given. This is the best answer.
Instead of cleaning the office, Bill could be creating value for his company by making sales,
networking, boosting employee morale, or simply increasing his productivity by napping on the office sofa. We need to calculate the opportunity cost of the time Bill would spend cleaning the office, so we need to know how long he spends cleaning, and how highly his time is valued. Assuming that it always takes Bill 2 full hours to clean Empire Learning's offices, and that Bill's time is valued at $200/hour, what is the EMV of Bill cleaning the office? Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary. The EMV of Bill cleaning the office is $180. Based on this analysis, Bill should continue employing the cleaning service.
Exercise 5: The Eris Shoe Company Val Purcell, CEO of a supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid for a contract to re-engineer the supply chain of a potential new client: the Eris Shoe Company. Creating the bid will cost $16,000 in Val's time and legal fees. If his bid beats out the competition, Val expects the contract to return profits of $100,000 — from which the cost of preparing the bid has not yet been subtracted. Val believes he has a 20% chance of winning the bid. Eris will pay the consulting fee and Val will accrue his estimated $100,000 profits upon completion of the project, which is scheduled for one year from now. Under these terms, what is the expected monetary value of creating and submitting the bid?
a. -$16,000 This is not the best answer. Think about when all of the costs are incurred and when the cash flows come in.
b. $0 This is not the best answer. Think about when all of the costs are incurred and when the cash flows come in.
c. $60,000 This is not the best answer. Think about when all of the costs are incurred and when the cash flows come in.
d. The answer cannot be determined from the information given. This is the best answer. The answer cannot be determined without knowing Purcell's discount rate. If Purcell wins the bid, it won't receive its consulting fee until completion of the project one year from now. Cash flows related to the contract should be discounted at Purcell's discount rate. For simplicity assume that the cost/profit figures represent cash outflows and inflows, and that Purcell's discount rate is 15%.
What is the EMV for the option of submitting this bid? Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter ''$5,500'' as ''5.50''). Round if necessary. To find the EMV of creating the bid, first calculate the present value of the cash inflow from Purcell's profits on the contract, to be received a year from now upon completion of the project. To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs. Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios to determine the EMV of placing the bid.
Before Val starts work on the bid, Eris decides to move the project's completion date to two years after the contract is signed. Again assume that the cost/profit figures represent cash outflows and inflows, and that Purcell's discount rate is 15%. What is the EMV for submitting the bid now? Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary. To find the new EMV of putting in the bid, first calculate the present value of the cash inflow
from Purcell's profits on the contract, to be received two years from now upon completion of the project. To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs. Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios and determine the EMV of placing the bid. The EMV of putting in the bid is -$877. If Val's payments are delayed by another year, he cannot expect the project to be profitable, so he should not submit a bid. Delaying the profits changes Val's optimal decision!
Exercise 6: The Crumbling Empire Cap Winestone of Universal Learning is preparing a bid for a new building to accommodate its expanding operations. As luck has it, the headquarters of former competitor Empire Learning is for sale through a sealed bid auction. The building is well suited to Universal's needs, containing computing equipment, network infrastructure and other important e-learning accessories. Cap estimates the building is worth about $900,000 to him, and is trying to decide what bid to place. To simplify his decision, he narrows down his bid choices to four possible bids. He muses: "With lower bids I gain more value. If I bid $600,000, I'll get a building worth $900,000 to me, so I gain $300,000 in value. In terms of the value I gain, the lower the bid, the better. "On the other hand," Cap continues, "With a low bid I'm not likely to win. From the point of view of winning the bid, the higher the bid the better. How do I balance these two opposing factors? Cap lays out a decision tree with the possible bid amounts, the likelihood of winning for each bid, and the outcome values. What amount should Cap bid?
a. $600,000 his is not the best answer. Think about the value of paying $600,000 for a building worth $900,000 to you, and the likelihood of winning the building with a $600,000 bid.
b. $700,000 This is not the best answer. Think about the value of paying $700,000 for a building worth $900,000 to you, and the likelihood of winning the building with a $700,000 bid.
c. $800,000 This is the best answer.
d. $900,000 This is not the best answer. Think about the value of paying $900,000 for a building worth $900,000 to you, and the likelihood of winning the building with a $900,000 bid. Folding back the tree, we find the $800,000 bid gives the highest expected monetary value. the $800,000 bid best balances the value gained against the probability of winning.
Sensitivity Analysis Still in Leo's office after you initially calculated the expected monetary value of launching the Chez Tethys, you listen to Leo's growing concerns about the estimates that inform his decision. Leo the Skeptic: The Uncertain Estimates Problem Now that you've mapped out the possible downside and its probability for me, I'm a little discouraged. Sure, the analysis indicates that ventures like the Tethys will be profitable on average, but the fact that I have almost a two-thirds chance of an $800,000 loss scares me. What if the potential loss is even greater? Or, what if it's even more likely that the Chez Tethys will just be a passing "Fad?" Both points are well taken, Leo. I'll tell you what: let's break for lunch, and then delve a little deeper into our analysis.
A Decision's Sensitivity to Outcome Estimates Over lunch, Alice comments on Leo's reaction to your analysis. "Leo's questions bring us to a crucial component of decision analysis: sensitivity analysis." Seth Chaplin has all but decided how to produce the film Cloven. Based on his initial analysis, he is inclined to produce Cloven in partnership with K2 Classics, thereby retaining part ownership of the film. The EMV of the K2 option is $1.4 million. The EMV of the alternative option — in which Seth's company produces Cloven for Pony Pictures and relinquishes ownership in the film — is $1.0 million. But Seth's calculations were based on estimates. The probabilities and the outcome values he used in his analysis were educated guesses. No matter how detailed and rigorous the methodology, a decision analysis is only as
good as the data on which it is based. What if Seth isn't completely certain about these data?
Seth is particularly unsure about the $6 million value he used to represent the profits associated with a "Blockbuster" film. What would the EMV of the K2 option be if $4 million were a more representative value for "Blockbuster" profits? Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as ''5.5''). Round if necessary. If $4 million is the expected value of S&C's profits for the "Blockbuster" scenario, then the EMV of a "Blockbuster" drops to $800,000, less than the $1 million EMV of the Pony option. Note that if the "Blockbuster" profit figure drops to $4 million, Seth's optimal decision changes: he should now choose the Pony option. Seth has learned that his optimal decision is sensitive to the figure he uses to represent S&C's profits if Cloven attains "Blockbuster" status. If Seth's optimal decision switches when the "Blockbuster" profit figure drops to $4 million, he might reasonably wonder what his decision would be for other values. What about $5 million? $4.5 million? How low would the "Blockbuster" profit figure have to be to make the Pony option preferable to the K2 option in terms of the EMV?
At what "Blockbuster" profit figure does Seth's optimal decision switch? Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter ''$5,500,000'' as ''5.50''). Round if necessary. To answer that question, we first write the EMV of the K2 option with a variable, B, to represent "Blockbuster" profits in millions of dollars. Now, we compare the EMV of the K2 option to the EMV of the Pony option. The K2 EMV is greater than the Pony EMV when the "Blockbuster" profit figure is greater than $4.67 million.
We call $4.67 million the breakeven value for "Blockbuster" profits: above it, the EMV criterion recommends the K2 option; below it, the EMV criterion recommends the Pony option. Seth may not be sure if the "Blockbuster" profits are best represented by $6 million or $5 million or $4.8 million, but as long as he is confident that the figure is greater than $4.67 million, he need not lose any sleep over finding a more accurate value. Knowing that his estimates for the "Blockbuster" profit figure are firmly on one side of the breakeven value allows him to stop worrying about the precise value of that number in the decision analysis, since knowing that number with greater precision will not change his decision. On the other hand, if Seth thinks the "Blockbuster" profit figure could be below $4.67 million, he may wish to invest additional resources to find a more accurate figure to represent "Blockbuster" profits. If Seth thinks the true number is around the breakeven value of $4.67 million, but isn't certain if it's slightly above or slightly below that figure, he can also stop worrying about the precise value of the number. As long as the figure is around $4.67 million, Seth should be indifferent between the K2 and Pony deals, since the EMVs of the two options are about the same. In fact, a good way to check that we have calculated a breakeven value correctly is to substitute the value into the EMV calculation: if the options have the same EMV, the breakeven value is correct. Once we calculate a breakeven value, we know whether or not expending additional time and other resources to find a more accurate estimate is worthwhile. The breakeven value establishes a comfort zone: as long as we are confident that the value we are estimating is within the zone, we can feel comfortable choosing the option recommended by our initial analysis, based on our original estimate. If we think the value we are estimating could be close to the breakeven value, we need to be more cautious. If the true value we are trying to estimate could lie outside of the comfort zone, we might want to try to make our estimate of that value more accurate before we reach a final decision. How confident we are that the true value we are estimating lies inside the comfort zone given by the breakeven analysis is a matter of judgment and experience. Sometimes, we might collect sample data to estimate an outcome value. In this case, we should look closely at the variation in the data to see how widely and in what way the data can vary.
Calculating a breakeven value for data used in a decision analysis is called sensitivity analysis: for each estimated value in the analysis, we check to see by how much it would have to change to affect our decision, assuming our estimates for all the other data are correct. Sensitivity analysis is an important and powerful tool for management decision-making. Managers who base decisions on an initial analysis without performing sensitivity analysis on critical data risk lulling themselves into a false sense of security in their decisions.
Summary After completing an initial decision analysis, always conduct a sensitivity analysis for each outcome value estimate you are uncomfortable with. First, calculate the outcome value's
breakeven value: the value for which the EMV of the option initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If we believe that the actual outcome value might be outside that zone — thereby changing the optimal decision — we should reconsider our analysis and refine our estimate of the outcome value in question. Evaluating Non-Monetary Consequences During his negotiations with K2 and Pony, Seth realized that he did not particularly look forward to working with the K2 team. Based on past experience, he knows that interpersonal frictions can be highly frustrating and can make a collaboration unpleasant. This frustration is clearly a cost — albeit a non-monetary cost — associated with the K2 option. Sensitivity analysis can give us a "reality check" on how highly we value non-monetary consequences such as frustration, reputation costs and benefits, and sentimental values. Although it may be difficult to assign a value to such consequences, we can often answer questions about the most we'd be willing to pay to avoid (or to obtain) them by calculating a threshold value. Clearly Seth wouldn't want to spoil any magical Hollywood days just to make an additional $5 in profits. But the more the K2 option pays relative to the alternatives, the more willing Seth might be to suffer working with the K2 team. How can Seth determine whether he should accept the K2 deal and bear the resulting interpersonal trials and tribulations? Sensitivity analysis can help us analyze non-monetary consequences such as frustration. Let's use "F" to represent the frustration cost (in $millions) Seth will incur if he has to work with the K2 team. Since frustration will occur in any scenario involving the K2 option, $F million must be subtracted from all outcomes associated with the K2 option.
How large must the cost of frustration be for Seth's optimal decision to switch to the Pony option? Enter F, the cost of frustration, in $millions as decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as ''5.5''). Round if necessary. Adding $F million to the outcomes changes the EMV of the K2 option to $1.4 million - $F million. For Seth's decision to switch from the K2 deal to the Pony deal, the EMV of the K2 option, $1.4 million - $F million, must drop below $1.0 million. In other words, F must satisfy the inequality below. In order to lower the EMV of the K2 option below the EMV of the Pony option, the cost of frustration would have to be valued at least at $400,000. Seth needs to ask himself if he would pay $400,000 to avoid the frustration of working with K2. Sensitivity analysis provides a clear, monetary upper bound against which he can measure the strength of his feelings. In the end, Seth decides that he can learn to love the K2 team for $400,000.
Summary To incorporate non-monetary consequences into a decision, first find the option with the best EMV. Then, add the non-monetary consequence to the outcome values of all scenarios affected by that non-monetary consequence, and calculate the breakeven value for which the option recommended by the initial decision analysis ceases to have the best EMV. The breakeven value defines a comfort zone: If we believe that the actual value of the non-monetary consequences might be outside that zone — thereby changing the optimal decision — we should try to gain a firm estimate of the non-monetary consequence's value. A Decision's Sensitivity to Probability Estimates Seth is somewhat unsure about his estimates for the probabilities of how successful Cloven will be. He is quite sure that the probability of the film flopping is around 20%, but he's less sure about the probabilities of "Lackluster" and "Blockbuster" performance levels. He wants to know how sensitive his decision to choose the K2 deal is to the values of these probabilities. Let's call the probability of a "Blockbuster" "p." Seth is confident that the probability of a "Flop" is 20%. What is the probability of a "Lackluster" outcome?
a. 1.0 — p This is not the correct answer. Think about what probabilities must add up to 1.0.
b. 0.8 — p This is the correct answer.
c. p — 0.2 This is not the correct answer. Think about what probabilities must add up to 1.0.
d. The answer cannot be determined from the information provided. This is not the correct answer. Think about what probabilities must add up to 1.0.
Since the three outcomes are mutually exclusive and collectively exhaustive, their probabilities must add to 100%. Seth is confident that the probability of a "Flop" is 20%, so he knows that the probabilities of the remaining two outcomes ("Blockbuster" and "Lackluster") must add to 80%. Thus, the probability of a "Lackluster" performance is 0.8 p. Using p to denote the probability of a "Blockbuster" outcome and 0.8 - p to denote the probability of a "Lackluster" outcome, what is the EMV of the K2 option? a. p * $6 million - $0.4 million This is the correct answer.
b. p * $6 million + $0.4 million This is not the correct answer. Remember that the ''Flop'' scenario involves a loss.
c. $0.08 million - p * $6 million This is not the correct answer. Remember that p is the probability of a ''Blockbuster.''
d. $0.08 million + p * $6million This is not the correct answer. Remember that p is the probability of a ''Blockbuster.''
The EMV of the K2 option is p * $6 million - $0.4 million. For what values of p, the probability of Cloven becoming a "Blockbuster" film, is the Pony option preferred to the K2 option on the basis of EMV? a. p > 0.233 This is not the best answer. The higher the probability of a ''Blockbuster'', the more attractive the K2 option becomes relative to the Pony option.
b. p < 0.233 This is the best answer.
c. p > 0.300 This is not the best answer. The higher the probability of a ''Blockbuster'', the more attractive the K2 option becomes relative to the Pony option.
d. p < 0.300
This is not the best answer. At 0.29, for instance, the K2 option is still preferable to the Pony option. The breakeven value for the probability of a "Blockbuster" hit is 23.3%: when the probability is above 23.3%, the EMV of the K2 option is higher; when the probability is below 23.3%, the EMV of the Pony option is higher. If Seth is confident that the probability of a "Blockbuster" is at least 23.3%, he can feel comfortable choosing the K2 option. He need not expend additional effort trying to further refine his estimates for the probabilities of different levels of success. Decision-making is an iterative, multi-step process. When analyzing a decision, we should first construct and analyze a decision tree based on our best estimates of the outcomes and probabilities involved. After reaching a tentative decision, it is critical to scrutinize the data used in a decision analysis and conduct sensitivity analyses for each estimate that we feel unsure about. As long as we are comfortable that the true value we are estimating is within the range specified by the breakeven calculation, we can confidently proceed with our decision. If not, we should focus our efforts on refining our estimates for those values to which our decisions are most sensitive. Finally, we should note that managers sometimes need to test the sensitivity of their decisions to two or more estimates simultaneously. Sensitivity analysis techniques can be extended to address these situations; these techniques are beyond the scope of this course.
Summary After completing an initial decision analysis, always conduct a sensitivity analysis for each probability value you are uncomfortable with. First calculate the probability's breakeven value: the probability for which the EMV of the option initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If we believe that the actual probability might be outside that zone — thereby changing the optimal decision — we should reconsider our analysis and refine our estimate of the probability in question. Solving the Uncertain Estimates Problem Sensitivity analysis helps managers cope with the uncertainty that surrounds the estimates they base decisions on. With your new knowledge, you're ready to turn to Leo's problem. Your initial analysis recommends that Leo launch his floating restaurant, but Leo has expressed some doubt about the estimate of the losses he'd incur if the Tethys turns out to be a passing "Fad." How high do the losses have to be in the "Fad" scenario to make launching the Tethys ill-advised in terms of EMV?
a. Greater than $0.70 million. This is not the best answer. When you compute the EMV, make sure you are weighting the ''Fad'' outcome value by the likelihood that it will occur.
b. Less than $0.70 million. This is not the best answer. When you compute the EMV, make sure you are weighting the ''Fad'' outcome value by the likelihood that it will occur.
c. Greater than $1.08 million. This is the best answer.
d. Less than $1.08 million. This is not the best answer. Think about how the outcome value of a ''Fad'' would have to change to make launching the Tethys unprofitable in terms of the EMV. Launching the Tethys would be less attractive than not launching it if the EMV of launching is less than $0, the EMV of not launching. Solving the inequality below, we find that the losses in the event that the Tethys is just a "Fad" would have to exceed $1.08 million for Leo to abandon the project based on the EMV criterion.
Assume that, in fact, Leo's estimate — that he will incur an $800,000 loss if Chez Tethys turns out to be a "Fad" — is correct. For what probability of a "Fad" does launching the Tethys cease to be a worthwhile venture, in terms of the EMV? a. Greater than 71%. This is the best answer.
b. Less than 71%. This is not the best answer. Think about how the probability of a ''Fad'' would have to change to make the EMV of launching the Tethys higher than the EMV of not launching.
c. Greater than 29%. This is not the best answer. Make sure you are finding the probability of a ''Fad'' rather than the probability of a ''Phenomenon.
d. Less than 29%. This is not the best answer. Make sure you are finding the probability of a ''Fad'' rather than the probability of a ''Phenomenon.'' Launching the Tethys is less attractive than not launching it if the EMV of launching is less than $0, the EMV of not launching. Solving the inequality below, we find that the probability of a "Fad" would have to be higher than 71% for Leo to prefer to abandon the project based on the EMV criterion.
Hmm, I'm fairly confident that our estimate of the possible loss is on target — certainly it won't be higher than a million dollars! But the more I think about it, the more I wish we had a better estimate for the probability that the Tethys will be the success I've always dreamed of. I have some ideas for ways to get a better handle on the likelihood of the Tethys' success. But I need to make some phone calls to see if I'm on the right track. Let's break for today and meet tomorrow morning. Exercise 1: Sensitive as a Truck Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her current fleet. Based on her assessment of future trends in the transportation sector, Robin identified three scenarios that might occur: "Boom," "Moderate Growth," and "Slowdown," corresponding to the performance of the economy and its impact on the transportation sector. She associated estimates of total firm profits with each scenario, and calculated the EMVs of "leasing" and "not leasing" a new truck: $31,000 and $18,200, respectively. Robin is afraid that she may have been a little overoptimistic in her estimate of $70,000 of total profits in the scenario in which she "leases a new truck" and the transportation sector "booms." She wants to know how sensitive her decision is to the outcome value of this scenario. What must the value of total profits of the scenario in which Robin "leases the truck" and the transportation sector "booms" be in order for the "don't lease truck" option to be the more attractive one?
a. Less than $27,333. This is the best answer.
b. Greater than $27,333. This is not the best answer. Think about how the EMV of the ''Buy truck'' scenario would have to change to affect Robin's decision.
c. Less than $14,000. This is not the best answer. Think about which outcome value we are scrutinizing and which ones we aren't. d. Greater than $14,000. This is not the best answer. Think about which outcome value we are scrutinizing and which ones we aren't. For sufficiently low profits generated by the new truck in the "Boom" scenario, the EMV of leasing the truck is lower than the EMV of not leasing the truck. To find the breakeven value X, we solve the inequality between the two EMVs shown below. This inequality is solved when X, the value of total profits in the lease truck and "Boom" scenario, is $27,333.
Exercise 2: Sneakers of Discord Val Purcell, CEO of the supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid for a contract to re-engineer the procurement process for a potential new client: the Eris Shoe Company. Creating the bid will cost $16,000 in Val's time and legal fees. If he wins the bid, he will receive payment for the project upon completion, two years after signing the contract. He has estimated the present value of the profits if he wins the bid at $75,614. After factoring in the $16,000 cost of submitting the bid, the net present value of the profits if he wins the bid is $59,614.
Val believes he has a 20% chance of winning the bid, so the EMV for submitting the bid is -$877: under these circumstances, Val can't expect submitting a bid to be profitable. But Val hasn't been able to figure out how stiff the competition for the contract is, and he's fairly uncertain about the probability of winning. How high would the probability of winning the bid have to be to make Val change his mind and choose to put in a bid?
a. Less than 21.2% This is not the best answer. Will the probability of winning the bid have to be higher or lower for submitting the bid to become more attractive?
b. Greater than 21.2% This is the best answer.
c. Less than 78.8% This is not the best answer. Will the probability of winning the bid have to be higher or lower for submitting the bid to become more attractive?
d. Greater than 78.8% This is not the best answer. Will the probability of winning the bid have to be higher or lower for submitting the bid to become more attractive? To change Val's decision, the EMV of bidding will have to be higher than the EMV of not bidding: $0. To find out how high p, the probability of winning the bid would have to be, we solve the inequality below. Note that the probability of losing the bid (1 - p) must change as the probability of winning changes. The inequality is solved when the probability of winning the bid is greater than 21.2%.
Exercise 3: The SHAMH Continues Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations. He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to be able to do both. Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. He estimates that a local fundraiser, he has a 30% chance of raising $30,000 and 70% chance of raising $20,000. The EMV of the fundraiser option is $23,000, higher than $22,500, the EMV of applying for the grant. Based on the initial analysis, Jari should organize a fundraiser. Jari has expressed uncertainty about his estimate for the probabilities of the two levels of fundraising success. How low would the probability of the raising $30,000 have to be to change his decision?
For Jari to change his initial decision, the EMV of the fundraising option would have to be less than $22,500. That would be the case if the probability of the high fundraising success — valued at $30,000 — were less than 25%. Jari needs $20,000 to operate the SHAMH through the next year. Raising $25,000 or more would permit Jari to operate the SHAMH and expand the capacity of the operation, thereby allowing even more abandoned miniature horses to be saved. This year will be Jari's last running the SHAMH, and he really wants to expand its capacity before he leaves. If he's unable to raise at least $25,000 to cover the SHAMH expansion, he'll be very disappointed. How highly would Jari have to value his disappointment in order for him to prefer the government grant option that is more likely to secure funds for expansion? Assume that his original probability assessments for the success of the fundraiser are correct.
a. Less than $833.33. This is not the best answer. Think about what range of values of the disappointment cost will lead to an EMV for the grant-writing scenario that is higher than the EMV for the fundraiser scenario. b. Greater than $833.33. This is the best answer.
c. Less than $714.29. This is not the best answer. Think about which scenarios will not bring in enough funds to expand the SHAMH.
d. Greater than $714.29. This is not the best answer. Think about which scenarios will not bring in enough funds to expand the SHAMH.
If Jari is only moderately successful in his fundraising activities, he won't be able to expand the SHAMH on his watch. The disappointment cost D should be subtracted from the outcome value for the scenario in which he takes in only $20,000 in contributions. After incorporating the disappointment cost, the EMV of the fundraiser option becomes $23,000 - 0.7D. However, if he applies for the grant and doesn't win it, he won't be able to expand, either. The disappointment cost D must also be subtracted from the outcome value in the scenario in which he writes the grant but doesn't win it. After incorporating the disappointment cost, the EMV of the grant-writing option becomes $22,500 0.1D. The breakeven value is the value of disappointment for which the EMVs of the two options are equal. Thus, to find the breakeven value of D, we must set up an inequality between the two EMVs and solve for D. In this case, the disappointment cost would have to be greater than $833.33 in order for Jari to prefer the grantwriting option.
Decision Analysis II Conditional Probabilities After a Kahana breakfast of crab benedict, you meet once more with Leo.
Dining Chez Tethys: The Market Research Problem So, I've been thinking: my decision is heavily dependent on the likelihood that the Chez Tethys will be a success. Couldn't we do some market research to find out how interested our target market would be? Then I could make a better decision ! one that would be less risky! Sure, Leo. But market research costs money. How much would you be willing to pay for information that would help you better predict the success of the Chez Tethys? Hmmm. Good question...frankly, I don't have a clue how to even start thinking about that. Can you two help me? "Whatever market research Leo has in mind," remarks Alice, "it won't reveal with certainty how consumers will take to the Tethys. In other words, we're uncertain about the Tethys' chance of success, and we'll be uncertain that the market research we collect is accurate."
Joint and Marginal Probabilities Pondering two layers of uncertainty you begin to feel vertigo... When analyzing a decision, we typically use estimates for the probabilities and financial implications of various outcomes that could occur. Often, we can imagine additional information — scientific tests, market research data, or professional expertise — that would help make our estimates more accurate. How much should we be willing to pay for this type of information? And how do we incorporate new information into our decision analysis?
Before we answer these questions, we'll need to expand our understanding of probability and introduce the important concepts of conditional probability and statistical independence. Let's look at an example first. British automaker Chariot's most sought-after model is the Ben Hur. Consumers love the Bennie, as it's affectionately called. It was offered as a limited edition: to date, only 1,000 Bennies are on the road in Britain. We'll take a closer look at two properties of the Bennie: its "Color" and its "Stereo." The Bennie comes in two color options: Burgundy and Champagne. Also, the Bennie is offered with a high-end car stereo system by Sweetone. Let's look at a table of the 1,000 Bennies currently on the road and see how the two properties — "Color" and "Stereo" — are distributed. Of the 1,000 Bennies, 150 are Burgundy and have a Sweetone stereo. 600 are Burgundy and do not have the Sweetone stereo. Furthermore, there are 50 Champagne Bennies with the stereo and 200 without. Let's add another column to our table and fill in the total numbers of Burgundy and Champagne Bennies. To calculate these numbers, we simply take the sums of the rows. For example, the number of Burgundy Bennies is simply the number of Burgundy Bennies with the Sweetone stereo added to the number of Burgundy Bennies without. Next, in the final row, we'll fill in the total numbers of Bennies with and without the Sweetone stereo. Here, we simply take the sums of the two columns. In the bottom right cell, we enter the total number of cars: 1,000. This number — 1,000 — should be equal to the sum of the numbers in the final row: the total number of Bennies with the Sweetone stereo added to the total number without one. Also, it should be equal to the sum of the numbers in the final column: the number of Burgundy Bennies plus the number of Champagne Bennies. Finally, since 1,000 is the total number of all
Bennies, it should be the sum of all the original numbers we entered into the table. A Venn diagram is a useful graphical way to represent the contents of the table. The Burgundy rectangle on top represents the set of all Burgundy Bennies. The Champagne rectangle below represents the set of all Champagne Bennies. The rectangles do not intersect because a Bennie cannot be both Champagne and Burgundy. We now add a patterned rectangle on the left to represent the set of Bennies with the Sweetone stereo. The area without the pattern represents the set of Bennies that are not equipped with the Sweetone. These areas intersect the rectangles that represent the distribution of "color": some Burgundy Bennies are Sweetone-equipped; some are not. Some Champagne Bennies are Sweetone-equipped; some are not. The sizes of the areas in the Venn diagram are directly proportional to the incidence of the different Bennie properties: Burgundy/Champagne and Sweetone/no Sweetone. We use Venn diagrams to effectively communicate information about sets of things — for example, Burgundy cars and Sweetone-equipped cars — and their interactions. The table is a useful tool for calculating proportions of potential interest to managers. For instance, we can find the proportion of Burgundy Bennies with the Sweetone stereo simply by locating the cell that contains their number and dividing it by the total number of cars: 15%. Or, to find the proportion of Burgundy Bennies in general, we find the cell that contains the total number of Burgundy Bennies and divide it by the total number of cars: 75%. In this way we can create an entire table of proportions. These proportions can be interpreted as probabilities. For example, since 15% of the Bennies on the road are Burgundy-colored and Sweetone-equipped, the probability that a randomly selected Bennie will be Burgundy-colored and have the Sweetone is 15%. The probability that a randomly selected Bennie will be Burgundy is 75%. Going forward, in talking about the table, we'll use the words "proportion" and "probability" interchangeably. The probabilities on the inside of the table are called joint probabilities: the probabilities of a single car having two particular Bennie features, for example, Burgundy-colored and Sweetone-equipped. We'll denote joint probabilities in the following way: P(Burgundy & Sweetone) is the probability that a Bennie is Burgundy-colored and Sweetone-equipped. The probabilities of each "Color" or "Stereo" option occurring in a given Bennie — Burgundy or Champagne, Sweetone-equipped or not Sweetone-equipped — are often called marginal probabilities. They are denoted simply as P(Burgundy), P(Champagne), P(Sweetone), and P(no Sweetone). Information about the distribution of properties in populations is often available in terms of probabilities, so the table of probabilities is a very natural way to represent the Bennie data. Summary For two events A and B with outcomes A1, A2, etc. and B1, B2, etc., respectively, the joint probability P(A1 & B1) is the probability that the uncertain event A has outcome A1 and the uncertain event B has outcome B1. The joint probabilities of all possible outcomes of two uncertain events can be summarized in a probability table. The marginal probability of the outcome A1 of the first uncertain event is the sum of the joint probabilities of outcomes A1 and all possible outcomes B1, B2, etc. of the second uncertain event. Conditional Probabilities Automaker Chariot's limited edition Ben Hur model comes in two possible colors — Champagne or Burgundy — and with or without a high-end Sweetone stereo. The table below shows the distribution of these properties — "Color" and "Stereo" — in the population of 1,000 Bennies that Chariot has sold to date. Restricting our focus to the Burgundy Bennies, we ask the following question: among the set of Burgundy Bennies, what is the proportion of Burgundy Bennies with a Sweetone stereo?
Stated differently: what is the probability that a randomly selected Bennie has a Sweetone given that we know it is Burgundy? a. 15% This is not the best answer. Compare the total number of Burgundy Bennies that are Sweetone-equipped with the total number of Burgundy Bennies. b. 20% This is the best answer.
c. 25% This is not the best answer. Compare the total number of Burgundy Bennies that are Sweetone-equipped with the total number of Burgundy Bennies. d. 75% This is not the best answer. Compare the total number of Burgundy Bennies that are Sweetone-equipped with the total number of Burgundy Bennies.
To answer the question, we find the ratio of Sweetone-equipped Burgundy Bennies among the set of all Burgundy Bennies. That is, we divide the number of Bennies that are both Burgundy and have a Sweetone — 150 — by the total number of Bennies that are Burgundy — 750. The probability is 20%. This probability is called a conditional probability: the probability that a Bennie is Sweetoneequipped given that it is Burgundy. We'll denote this probability P(Sweetone | Burgundy), and read the vertical line as "given." We can calculate P(Sweetone | Champagne) as:
a. The number of Sweetone-equipped Bennies divided by the total number of Bennies. This is not the best answer. Think about what a conditional probability represents.
b. The number of Sweetone-equipped Bennies divided by the number of Champagne Bennies. This is not the best answer. Think about what a conditional probability represents.
c. The number of Champagne and Sweetone-equipped Bennies divided by the total number of Bennies. This is not the best answer. Think about what a conditional probability represents.
d. The number of Champagne and Sweetone-equipped Bennies divided by the number of Champagne Bennies. This is the best answer.
P(Sweetone | Champagne) is the number of Bennies that are Champagne and Sweetoneequipped divided by the total number of Champagne Bennies. What is P(No Sweetone | Burgundy)? Enter the percentage as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary.
Using the table of actual numbers of cars we can calculate P(No Sweetone | Burgundy) and a fourth conditional probability, P(No Sweetone | Champagne). Earlier, we used the table of actual numbers of cars to calculate a table of probabilities. When given information as a table of probabilities, we can use the probabilities to calculate conditional probabilities as well: we simply form the ratios of the appropriate probabilities.
For example, the probability that a Bennie is Sweetone-equipped given that it is Burgundy is the probability of a Sweetone-equipped, Burgundy Bennie divided by the probability of a Burgundy Bennie. In fact, a conditional probability is formally defined in terms of the ratio of a joint probability to a marginal probability. If P(Sweetone | Burgundy) represents the probability that a Bennie is Sweetone-equipped given that it is Burgundy, what does P(Burgundy | Sweetone) represent?
a. The probability that a Bennie is Sweetone-equipped given that it is Burgundy. This is not the best answer. Think about the different ratios you can form in the table of probabilities.
b. The probability that a Bennie is Sweetone-equipped. This is not the best answer. Think about the different ratios you can form in the table of probabilities.
c. The probability that a Bennie is Burgundy given that it is Sweetone-equipped. This is the best answer.
d. The probability that a Bennie is Burgundy. This is not the best answer. Think about the different ratios you can form in the table of probabilities.
P(Burgundy | Sweetone) represents the probability that a Bennie is Burgundy given that it is Sweetone-equipped. Note that P(Sweetone | Burgundy) and P(Burgundy | Sweetone) are not the same. We can calculate the other conditional probabilities, this time conditioning the property "Color" on the property "Stereo." It is useful to write the conditional probailities next to the table of probabilities to facilitate calculation: since P(Burgundy | Sweetone) and P(Champagne | Sweetone) require only the probabilities in the Sweetone column, we write them directly below the Sweetone column. Similarly, since P(Burgundy | No Sweetone) and P(Champagne | No Sweetone) require only the probabilities in the "No Sweetone" column, we write them directly below the "No Sweetone" column. Note that the new rows mimic the rows in the original table: Burgundy on top and Champagne below it. Similarly, we write the conditional probabilities for the property "Stereo" conditioned on the property "Color" to the right of the table. Again, we mimic the columns in the original table: a column for "Sweetone" and one for "No Sweetone." We now have a full table of joint, marginal, and conditional probabilities. It is important to double check which event we are conditioning on, and to do a "reality check" to make sure our calculations are realistic. For example, the probability that a randomly chosen Indian citizen is the prime minister is nearly zero, but the probability that the prime minister of India is an Indian citizen is 100%!
The table of joint probabilities informs our understanding of the likelihood of different properties or events in a crucial way: as we've seen in the Bennie example, once we have the joint probabilities, we can compute any conditional probability and any marginal probability. Thus, when presented with a decision problem in which the outcomes are influenced by multiple uncertain events, constructing the joint probabilities of these events is almost always a wise first step.
Summary The conditional probability P(A | B) is the probability of the outcome A of one uncertain event, given that the outcome B of a second uncertain event has already occurred. The table of joint probabilities provides all the information needed to compute all conditional probabilities. First calculate the marginal probabilities for each event, then compute the conditional probabilities as shown below:
Statistical Independence Let's return once more to our Chariot Ben Hur example. The Bennie comes in two possible colors — Champagne or Burgundy — and with or without a high-end stereo by Sweetone. The probability table below shows the distribution of these properties - "Color" and "Stereo" - in the population of 1,000 Bennies that Chariot has sold to date. Note that the proportion of Sweetone-equipped Bennies in the Burgundy population is the
same as the proportion of Sweetone-equipped Bennies in the overall population: 20%. In the language of conditional probabilities, this is the same as saying that P(Sweetone | Burgundy) = P(Sweetone). In other words, if we randomly select a Bennie, then discovering that it is Burgundy gives us no additional information about whether or not it is equipped with a Sweetone stereo, beyond what we had before we knew its color: we still think there is a 20% chance it has a Sweetone. Similarly, the proportion of Burgundy Bennies in the Sweetone-equipped population is the same as the proportion of Burgundy Bennies in the overall population: P(Burgundy | Sweetone) = P(Burgundy) = 75%. A Bennie's color tells us nothing about its stereo system. Its stereo system tells us nothing about its color. When this is true, we say that that the Bennie properties "Color" and "Stereo" are statistically independent, or, more simply, independent. In general, we can interpret the fact that two uncertain events are independent in the following way: knowing that one event has occurred gives us no additional information about whether or not the other event has. For example, the results of two spins of a wheel of fortune are independent. The first result does not reveal anything about the second.
We can confirm the independence of the Bennie's stereo and color by looking at our Venn diagram and noting that Sweetone-equipped Bennies occupy the same percentage in the population of burgundy Bennies — 20% — as they do in the entire Bennie population. Thus, to find the joint probability that a Bennie has a Sweetone stereo and is burgundy, we take 20% of the 75% of Bennies that are burgundy, giving us a joint probability of 15%. This property is true for any two statistically independent properties: the joint probabilities are simply the products of the marginal probabilities. Although it may seem plausible to assume that certain properties are independent, managers who take statistical independence for granted do so at their peril. We need to verify the assumption that the properties are independent by looking at and evaluating data or by proving independence on the theoretical level.
Statistical Dependence When are two outcomes statistically dependent? Let's look at another optional feature of the Bennie: a unique factory-installed theft discouragement system (TDS). Using Chariot's data, we can create the following table of the distribution of the properties "Stereo" and "Protection." Before going forward, practice your skills by calculating the joint, marginal, and conditional probabilities for these properties. Place them in the usual format in the complete probability table. The completed table is shown below. Which of the following is true?
a. "Protection" and "Stereo" are independent properties. This is not the best answer. Which probabilities must be compared to determine if "Protection" and "Stereo" are independent properties?
b. "Protection" and "Stereo" are not independent properties. This is the best answer.
c. Whether "Protection" and "Stereo" are independent or not cannot be determined from the information provided. This is not the best answer. Which probabilities must be compared to determine if "Protection" and "Stereo" are independent properties?
From the table, we can see that P(TDS | Sweetone) is not equal to P(TDS). We can infer that the properties "Stereo" and "Protection" are not statistically independent — once you know that a randomly selected Bennie has a Sweetone, you know the chances of it having a TDS system are greater than they would be if you didn't have that information. Once again our Venn diagram provides visual confirmation, in this case that the properties are not independent. In the overall population, the proportion of Bennies that are TDSprotected is only 35%. However, in the population of Sweetone-equipped Bennies, the proportion is significantly higher: 75%. Why might that be?
Bennie buyers who opt for the Sweetone stereo feel that their cars will be especially targeted for theft or vandalism, and are more likely to choose the TDS option than are buyers who choose the low-end car stereo. A savvy car thief knows that the next Bennie he passes has a 35% probability of being protected by a theft discouragement system. Once he identifies that a particular Bennie has a Sweetone stereo, he gains more information: then he knows that the car has a 75% chance of being TDS-protected. This may affect his decision about whether or not to break into that car. Summary Two uncertain events A and B are said to be statistically independent if knowing that A has occurred does not tell us anything about the probability of B occurring, and vice versa. Statistical independence of two events can be demonstrated based on data or proved from theory; it should never be assumed. Events that are not statistically independent are said to be statistically dependent.
Conditional Probabilities in Decision Analysis Probability theory is fascinating, sure, but what do conditional probabilities have to do with decision analysis?
To see how we might apply conditional probabilities in a decision analysis, let's revisit the Cloven film production example. Seth Chaplin of S&C Films is ecstatic. In a sushi lunch meeting with the agent of superstar actor Shawn Connelly, the agent agreed to try to convince Connelly to be the voice of the main character in Seth's new film, Cloven. Under Seth's charming yet irresistible pressure, the agent told Seth that — just between the two of them — he thought he had a 40% chance of persuading Connelly to take the role. Connelly's fame is such that by lending the prestige of his name to Cloven, the likelihood of Cloven's success increases dramatically. Seth has booked the services of a crack animation team, so he'll have to start production soon — probably before Connelly makes a decision. But that shouldn't hinder production because Connelly can add his voiceovers well after the animation has been created.
Seth has not yet signed either of the two deals he negotiated, and feels he should reconsider his options in light of Connelly's possible participation. With a decent shot at star power behind Cloven, Seth feels he can close a better deal with Pony. Connelly's voice would also substantially increase the likelihood of blockbuster success, making the deal with K2 potentially more lucrative. Seth returns to Pony and hammers out a second agreement, contigent on Connelly's participation. Pony will pay an additional $5 million — $15 million in total — if Seth can retain Connelly's voice acting services. Even after accounting for Connelly's salary, Seth expects to make a total of $2.2 million in profits from the Pony deal if Connelly agrees to take the role. How should all of this new information affect Seth's decision? Let's take a look at the new decision tree.
If Seth chooses the Pony deal, there are now two possible scenarios: in the first, Connelly agrees to lend his voice to Cloven, in the second, Connelly declines. These two scenarios are associated with two different profit outcomes: $2.2 million and $1 million, respectively. Based on Connelly's agent's assessment, there is a 40% chance that Connelly will take the role, and a 60% chance that he won't. What about the K2 option? Which of the following decision trees correctly reflects the newly introduced circumstances?
a. a This is not the best answer. Think about the order in which the outcomes of the uncertain events are revealed to Seth.
b. b This is the best answer.
The nodes of a decision tree are arranged from left to right in the order in which the decision
maker will eventually know their results. In this case, Seth will first discover whether or not Connelly takes the role; only then will he see how Cloven performs at the box office. What are the probabilities of the different branches? As in the Pony option, the first branching in the K2 option is a split between the scenarios in which Connelly signs onto the Cloven project and those in which he doesn't. These branches are associated with probabilities of 40% and 60%, respectively. What are the probabilities of the six final branches? Let's look closely at the top three branches. Once we know that Connelly has signed on, we need to know what the respective probabilities are of a "Blockbuster," a "Lackluster" performance, and a "Flop." In other words, we need the conditional probabilities of the three outcomes, given that Connelly takes part in Cloven. In any decision tree, to "be at a node" is to assume that all the events on the path leading to that node from the left have already taken place. Thus, the data we associate with any decision or chance node depend directly on the sequence of events on the unique path leading up to that node from the left. Specifically, the probabilities after a chance node must be conditioned on all events on the preceding path, and the outcome values must incorporate the effects of all of the preceding events.
Returning to Seth's decision: if Connelly takes part in Cloven, Seth estimates at 50% each the conditional probability that Cloven will be a "Blockbuster" and the conditional probability that it will have a "Lackluster" theater run. At this stage in Connelly's career, a "Flop" is virtually impossible. If Connelly doesn't take part, the conditional probabilities are simply the original probabilities of 30%, 50%, and 20% for the three possible outcomes. Using these conditional probabilities, we can determine which of the two options Seth should choose by calculating the EMVs of all the nodes and folding back the decision tree. Let's begin with the Pony option. What is its EMV? Enter the EMV in $millions as a decimal number with two digits to right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. The EMV of the Pony option is $1.48 million.
Now we find the EMV of the K2 option, constructing it step by step. First, we must find the EMV for the event that Connelly decides to voice the main character in Cloven. That EMV is $3 million. Next, we find the EMV for the event in which Connelly refuses the part. That EMV is simply the EMV of the original K2 option: $1.4 million.
The total EMV for the K2 option is $2.04 million: the EMV with Connelly's participation weighted by the probability of his participation, plus the EMV of Cloven filmed without Connelly, weighted by the probability that he refuses the part. The possibility that Shawn Connelly might take part greatly enhances the attractiveness of the K2 option.
Summary The eventual outcomes of many decisions involve sequential uncertain events whose outcomes are determined over time. To conduct a decision analysis in such cases, we need the probabilities of the first uncertain event's possible outcomes and conditional probabilities of future uncertain events' possible outcomes, conditioned on the previous events' outcomes.
Exercise 1: Captain Ahab Fisheries Captain Ahab Fisheries cans herring and sardines in either spicy tomato sauce or vegetable oil. The table to the right summarizes the distribution of Ahab's product line. What is the marginal probability that a randomly selected can of fish is canned in tomato sauce? Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary. The marginal probability of tomato sauce is 23%. Similarly, the marginal probability of vegetable oil is 77%.
What is the marginal probability that a randomly selected can of fish contains sardines? Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary.
The marginal probability of sardines is 40%. Similarly, the marginal probability of herring is 60%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of sardines? Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary. The conditional probability of tomato sauce given that a can contains sardines is 35%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of herring? Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if necessary. The conditional probability of tomato sauce given that a can contains herring is 15%.
Which is larger, the conditional probability that a randomly selected can of fish contains sardines, given that it is canned in vegetable oil, or the conditional probability that a randomly selected can of fish is canned in vegetable oil, given that it contains sardines? a. P(Sardines | Oil). This is not the best answer. Calculate both conditional probabilities and compare them.
b. P(Oil | Sardines). This is the best answer.
c. The two probabilities are the same. This is not the best answer. Calculate both conditional probabilities and compare them. The conditional probability of a can containing sardines given that it contains vegetable oil is 34%. The conditional probability of the fish being canned in vegetable oil given that the can contains sardines is 65%. Note that P(Sardines | Oil) is not the same as P(Oil | Sardines).
Exercise 2: Mutually Exclusive and Collectively Exhaustive The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities is less than 100% tells us:
a. The events are mutually exclusive. This is not the best answer. Try to think of an example of a set of events with these probabilities in which at least two events can occur simultaneously.
b. The events are not mutually exclusive. This is not the best answer. Try to think of an example of a set of mutually exclusive events with these probabilities.
c. Neither of the above can be inferred from the fact that the sum of the probabilities is less than 100%. This is the best answer. We cannot infer anything about whether or not these events are mutually exclusive from the information provided. The events A, B, and C at left are mutually exclusive. The events D, E, and F at right are not mutually exclusive. The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities is less than 100% tells us:
a. The events are collectively exhaustive. This is not the best answer. Think about what the probabilities of a set of collectively exhaustive events must sum to.
b. The events are not collectively exhaustive. This is the best answer.
c. Neither of the above can be inferred from the fact that the sum of the probabilities is less than 100%.
This is not the best answer. Think about what the probabilities of a set of collectively exhaustive events must sum to. For a set of events to be collectively exhaustive, the probabilities of the events must sum to at least 100%. The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities is greater than 100% tells us:
a. The events are mutually exclusive. This is not the best answer. Think about what the largest number is, that the probabilities of mutually exclusive events can add up to.
b. The events are not mutually exclusive. This is the best answer.
c. Neither of the above can be inferred from the fact that the sum of the probabilities is greater than 100%. This is not the best answer. Think about what the largest number is, that the probabilities of mutually exclusive events can add up to. The probabilities of the events in a mutually exclusive set cannot add up to more than 100%. If the probabilities of a set of events add up to more than 100%, then at least two of them are not mutually exclusive, i.e., they can occur simultaneously. The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities is greater than 100% tells us:
a. The events are collectively exhaustive. This is not the best answer. Try to think of an example of three events with these probabilities that are not collectively exhaustive.
b. The events are not collectively exhaustive. This is not the best answer. Try to think of an example of three events with these probabilities that are not collectively exhaustive.
c. Neither of the above can be inferred from the fact that the sum of the probabilities is greater than 100%. This is the best answer. We cannot infer anything about whether or not these events are collectively exclusive from the information provided. Below are examples of events with these probabilities of which one set is collectively exhaustive and one isn't.
The Value of Information Market research can be expensive. If the cost of the research outweighs its value to Leo, then obviously he shouldn't spend money on the research. "We'll need to assess the value of Leo's market research," Alice tells you, "to determine whether or not paying for it is a wise choice." The Expected Value of Perfect Information The Eris Shoe Company is considering sourcing some of its production from the developing country Arboria. Arboria was a land of civil war and strife until a controversial UN intervention two years ago reconciled warring factions and helped install a national unity government. Today, guerrilla warfare persists in the mountains, but the major coastal cities are relatively secure. Eris CEO Emily Ville has identified Arboria as a candidate location due to its low labor costs and is considering opening a small buying office in the capital city. However, she also recognizes the potential for substantial risk if civil war should break out, including the loss of Eris' investments in the office and the disruption of its supply chain. Emily estimates the present value of sourcing from Arboria at $1 million in net savings as long as the Arborian production facility and infrastructure work at a high level of reliability. In her opinion, the probability of such a "Serene" scenario is 40%. Emily believes there is a 40% chance of a more "Troubled" environment beset with minor
supply chain disruptions, which would reduce Eris' expected cost savings to $0.5 million. Finally, she estimates the likelihood of major political unrest at 20%, and the net present value of the losses associated with such a "chaotic" scenario at $1 million. Instead of sourcing from Arboria, Eris could continue its current sourcing agreements, which would not result in any cost reduction. The EMV of sourcing from Arboria is $0.4 million. Based on the EMV criterion, Emily should choose to source from Arboria. However, given the stakes involved, Emily would love to have additional information that would increase her understanding of Arboria's political situation. Ideally, she'd like to base her decision on perfect information, i.e., she'd like to know now exactly which one of the three outcomes will materialize. Such perfect information would be extremely valuable: how much should Emily be willing to pay for it? Managers often have the opportunity to gather more data or expertise to help inform their business decisions. Depending on the business context, new information can be obtained from statistical data, expert consultants, or scientific tests and experiments. In many cases, such information can improve the accuracy of the estimates incorporated into a decision analysis. However, the cost of such data can drag down the bottom line. Clearly the cost of additional information must be weighed against its value. How can managers assess the value of information? Suppose — for the sake of argument — that Emily knows a fortune-teller, Frieda Featherlight, who has genuine psychic powers and can accurately predict the future, giving Emily the perfect information she craves. How much should Emily pay Frieda for her supernatural services? To answer this question, we calculate the Expected Value of Perfect Information — the EVPI — provided by Frieda. We can determine this value by framing a decision: should Emily purchase Frieda's information before she makes her decision to source from Arboria? Characterized as a decision problem, we can determine the EVPI using familiar tools: a decision tree and the expected monetary value. Let's construct a tree for Emily's decision, beginning with a decision node that branches into two options: "Buy the information" or "Don't buy the information." The lower branch — "Don't buy the information" — is simply the original tree for the decision about whether or not to source from Arboria. This tree uses only the information Emily already possesses. The EMV of this option is $0.4 million. If Emily engages Frieda's services and "buys the information," three possibilities emerge based on which scenario Frieda predicts — "Serene," "Troubled," or "Chaotic." If the probabilities of these scenarios actually occurring are 40%, 40%, and 20%, respectively, what is Emily's best assessment of the likelihood that Frieda will predict each scenario? a. The same as her best estimates of the probabilities that each scenerio will occur 40%, 40%, and 20%, respectively. This is the best answer.
b. Different from her best estimates of the probabilities that each scenerio will occur 40%, 40%, and 20%, respectively. This is not the best answer. What new information has Emily gained that would cause her to estimate different probabilities for Frieda's predictions?
Without further information, Emily has no reason to change her initial assessments. Unless she probes Frieda for information, Emily's best guess that Frieda will predict each outcome — "Serene," "Troubled," and "Chaotic" — is the same as her best estimates for the probabilities that those outcomes will actually occur: 40%, 40%, and 20%, respectively. Let's assume for the moment that Frieda won't charge Emily for her services. The beauty of perfect information is that Emily can delay her decision until after she has heard Frieda's
completely accurate prediction of what the Arborian political climate will be. If Frieda predicts a "Serene" sourcing experience then Emily will choose to source from Arboria, and cost savings of about $1 million are certain. Likewise, if Frieda predicts a "Troubled" sourcing experience, Emily will source from Arboria and cost savings of about $0.5 million are certain. If Frieda predicts a "Chaotic" sourcing experience, then Emily will choose not to source from Arboria, thereby avoiding a certain loss in the $1 million range. In this case, Eris will receive no cost savings. Still working under the assumption that Frieda won't charge for her services, what is the EMV of "buying the information?" Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. The EMV of "buying the information" is $600,000. At $600,000, the EMV of "buying the information" is $200,000 higher than the EMV of "not buying the information." This difference between the EMVs of the option with perfect information and without perfect information is the expected value of perfect information (EVPI). The EVPI of $200,000 is the maximum amount Emily should pay Frieda for a séance. The EVPI establishes an upper bound on what we should pay for perfect information. In some business cases, perfect — or near perfect — information may be available without supernatural means. For instance, suppose that aerospace company Airbus wants to know if it could perfect all of the technologies necessary to create a safe and reliable SpaceCruiser, a space shuttle for tourists that would have all the comforts of a luxury cruise ship. Airbus could build a small but completely functional prototype and see if it works. Although Airbus might collect near perfect information by doing so, the cost of developing the SpaceCruiser prototype would likely be higher than the expected value of that perfect information. Instead, Airbus might first use computer simulations and limited functionality prototypes to assess the technical viability of the SpaceCruiser. The information thus gained wouldn't be perfect, but it would be helpful and it would cost far less than a fully functional prototype. The EVPI establishes an upper bound on the value of any information, perfect or flawed. Through sampling, educated expertise, or imperfect testing, we might be able to gain better — though not perfect — estimates of the probabilities and the outcome values of our decision's possible scenerios. Shortly, we'll learn how to value imperfect information — information that reduces, but does not eliminate, our uncertainty about future events. For the time being, we know that we should never spend more for imperfect information than we are willing to pay for perfect information. In her quest for an infallible choice in the Arborian sourcing decision, if Emily won't pay more than $200,000 for perfect information, she certainly shouldn't pay more than $200,000 for imperfect information. If the price of imperfect information exceeds the EVPI, we should not expend resources on it. Summary As managers, we would like to know exactly which outcomes will occur so we can make the best decisions. We can calculate the expected value of such perfect information — EVPI — to find an upper limit on the amount we would be willing to pay for any additional information. To calculate the EVPI, we first frame the decision problem "to buy or not to buy the perfect information." We subtract the EMV of buying perfect information — assuming it's free — from the EMV of not buying perfect information to find the EVPI. The Expected Value of Sample Information
In almost all circumstances, perfect information is either impossible to assemble or prohibitively expensive. In these cases, we use imperfect information — often called "sample" information — to inform our decision. As with perfect information, the cost of sample information must be weighed against its value. As managers, how do we assess the value of sample information? Let's return the Eris Shoe Company's decision to source production in Arboria. Emily Ville has distinguished three possible scenarios for the Arborian political climate: "Serene," "Troubled," and "Chaotic," which she believes have probabilities of 40%, 40%, and 20% respectively. The outcomes she associates with these scenarios — in cost savings for Eris — are $1 million, $0.5 million, and -$1 million, respectively. Emily would like to improve her estimates of the probabilities. She contacts PoliFor, a consulting company that specializes in business outlook intelligence. PoliFor produces an analysis of a country's or region's political climate, and then provides a risk assessment specific to the needs of its client, distinguishing between "High" and "Low" risk situations. In a world of uncertainty, PoliFor's assessments are not always accurate. Sometimes, regions with "Low" risk assessments burst into revolutionary flame. Sometimes, regions with "High" risk assessments turn out to be as stable and placid as the moon's orbit. How high a price should Emily pay for PoliFor's services given that PoliFor's assessments are not perfectly reliable? Based on preliminary conversations with PoliFor's lead consultant, Emily assesses the probabilities that PoliFor will report "High" or "Low" risk for sourcing in Arboria at 30% and 70%, respectively. She then considers how each of these risk level assessments would affect her estimates of the relative likelihood of her three representative scenarios: "Serene," "Troubled," and "Chaotic." Emily believes that if PoliFor predicts "Low" risk, then the likelihood of a "Chaotic" political climate would be low: 5%, and the probabilities of "Troubled" or "Serene" scenarios would be 40% and 55% respectively. If we use "Low" to represent PoliFor predicting a low-risk political environment which of the following best expresses the beliefs Emily has summarized above? a. P(Serene) = 55%, P(Troubled) = 40%, P(Chaotic) = 5% This is not the best answer. What information would cause Emily to change her assessments of the probabilities of the three political scenarios?
b. P(Serene | Low) = 55%, P(Troubled | Low) = 40%, P(Chaotic | Low) = 5% This is the best answer.
c. P(Low | Serene) = 55%, P(Low | Troubled) = 40%, P(Low | Chaotic) = 5% This is not the best answer. Emily is changing her assessments of the likelihood of the political scenarios, not for PoliFor's prediction.
The information upon which Emily is basing her assessments is the PoliFor prediction of a "Low" risk political climate. Thus, she is estimating conditional probabilities that Eris' experience sourcing from Arboria will be "Serene," "Troubled," or "Chaotic," given that the consultants predict a "Low" level of risk. Similarly, Emily assesses the conditional probabilities that Eris' experience sourcing from Arboria will be "Serene," "Troubled," or "Chaotic" given that the consultants predict a "High" level of risk at 5%, 40%, and 55%, respectively. How do we calculate the expected value of the imperfect information PoliFor offers? As we did for perfect information, we frame this question as a decision problem: should Emily purchase the imperfect information from PoliFor? First, we'll construct a decision tree. The tree begins with a decision node for the choice Emily faces: "Buy the information" or "Don't buy the information. The lower branch for the option "Don't buy the information" is the original tree that Emily
constructed using her initial estimates for the outcome values and probabilities of the three basic scenarios. What should emanate to the right from the "Buy Information" branch? a. A decision node to source or not. This is not the best answer. After Emily buys the PoliFor report, what is the first thing she learns?
b. A chance node for the PoliFor forecast of the two risk levels: High and Low. This is the best answer.
c. A chance node for the three sourcing scenarios: Serene, Troubled, Chaotic. This is not the best answer. After Emily buys the PoliFor report, what is the first thing she learns?
d. The outcome values. This is not the best answer. After Emily buys the PoliFor report, what is the first thing she learns? She has decisions to make and uncertainties to resolve before she knows what endpoint values to assign to the outcomes!
If we are at the end of the "Buy Information" branch, we assume that Emily has purchased the information. In that case, the first thing that she will do is open the report envelope and examine the information to learn what level risk PoliFor has predicted. Thus, the upper branch for the option "Buy the information" splits the chance node into two branches, one for each risk level the consulting company might predict. What should emanate from each of the Low and High branches. a. A decision node to source or not. This is the best answer.
b. A chance node for the three sourcing scenarios: Serene, Troubled, Chaotic. This is not the best answer. How will Emily actually use the PoliFor forecast? c. The outcome values. This is not the best answer. How will Emily actually use the PoliFor forecast?
The information that Emily has purchased is valuable only if she uses it to support her decision-making. The value of the information resides in Emily's ability to make her sourcing decision after learning the level of risk PoliFor predicts. Thus, after each prediction, we place a decision node: should she source from Arboria or continue to source from her current suppliers? To complete the decision tree, we must incorporate the possible scenarios that can occur if Emily chooses to source from Arboria: "Serene," Troubled," or "Chaotic." What probabilities should we assign to the three branches emanating from the chance node highlighted to the right? a. P(Serene), P(Troubled), P(Chaotic). This is not the best answer. What information would cause Emily to change her assessments of the probabilities of the three political scenarios?
b. P(Serene | High), P(Troubled | High), P(Chaotic | High). This is the best answer.
c. P(High | Serene), P(High | Troubled), P(High | Chaotic). This is not the best answer. Emily is changing her assessments of the political scenarios, not about PoliFor's prediction.
At the highlighted node, we assume that every event on the unique path leading from the node back to the beginning of the tree has transpired: Emily bought the information; PoliFor
reported "High" risk; and Emily chose to source from Arboria. Thus, the probabilities assigned to the branches must be conditioned on those events. For example, we assign P(Serene | High) to the "Serene" scenario on the "High" branch. Now we complete the tree, substituting the values for the conditional probabilities and placing the appropriate monetary values at the endpoints. As usual, we'll assume for the moment that the information is free, and start to fold back the tree. What is the EMV of the "High Risk" branch's decision node? a. -$0.3 million This is not the best answer. Think about how we usually calculate the EMV at a decision node.
b. $0 This is the best answer.
c. $0.9 million This is not the best answer. Think about how we usually calculate the EMV at a decision node.
d. -$1 million. This is not the best answer. Think about how we usually calculate the EMV at a decision node.
If PoliFor predicts a "High" risk level, Emily should not choose to source from Arboria since the EMV of the "Source" from Arboria node, -$0.3 million, is less than the EMV of the "Don't source" node, $0 million. Instead she should continue her current supply arrangements, in which case she will realize $0 in cost savings. Folding back the decision tree, we find an EMV of $490,000 for the option "Buy the information." What is the most Emily should pay for the imperfect information? The difference between the EMVs of the "Buy" and "Don't buy the information" options is $90,000, the expected value of imperfect (or sample) information (EVSI) provided by PoliFor. The EVSI is the maximum amount that Emily should be willing to pay for PoliFor's report. As we would anticipate, the expected value of this imperfect information is quite a bit less than $200,000, the expected value of perfect information we computed earlier. PoliFor's information, although imperfect, has value because it allows Emily to update her original probability estimates based on the risk level PoliFor predicts. She can make better decisions based on these more accurate updated probability assessments. The probability estimates we started with — Emily's estimates for P(Chaotic), etc. — are called prior probabilities. The updated conditional probabilities — P(Chaotic | High Risk), P(Troubled | High Risk), etc., — are called posterior probabilities. Summary By investing in professional expertise, statistical studies or tests, managers can often improve their understanding of uncertain events. Such imperfect (or "sample") information is itself a source of uncertainty because it isn't a perfect predictor of the outcomes of interest. Although imperfect information can improve our predictions, such information comes at a price. Responsible managers calculate the expected value of sample information (EVSI), and purchase information only if its expected value exceeds its cost. To calculate the EVSI, we pose the decision problem "to buy or not to buy the sample information," and subtract the EMV of buying the information — assuming that it's free — from the EMV of not buying it. Updating Prior Probabilities Zeke "Claw" Crankshaw owns and runs Oligarchol, one of the largest independent oil producers in the Western Hemisphere.
Recently, Zeke acquired the mineral rights to a piece of land near Chanchito Gordo Canyon. Based on his own professional assessment, Zeke believes that he has a 50% chance of finding an oil field if he drills in Chanchito Gordo, a 50% chance of of drilling nothing but "Dry" holes. Like any subjective probabilities, these estimates are the product of Zeke's educated guesswork. Clearly, the oil is either beneath the surface at Chanchito or it isn't. However, Zeke's experience tells him, for example, that in about half of all similar situations — similar terrain, similarly exposed rock strata, etc. — oil prospectors have struck oil. He estimates the value of striking oil — in present value of future profits — to be $8 million after netting out drilling costs. If all of his boreholes come up "Dry," Zeke will have lost the $2 million cost of drilling. With these assessments in mind, Zeke must decide if he should drill on the property. If Zeke decides not to drill, his investment in the mineral rights will not pay off in the way he had hoped. However, since the purchase price of the mineral rights has already been incurred, it is a sunk cost that shouldn't bear on the decision. What is the expected monetary value of drilling in Chanchito Gordo? Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. The EMV of drilling in Chanchito Gordo is $3 million. Without further information, Zeke should start shipping drill bits and oil rigs over the dusty back-country roads to Chanchito. But old Zeke is not the impetuous young whippersnapper of the early days of Oligarchol, back when the company was derisively referred to in industry circles as "Pipe Dreams, Incorporated. Now, Zeke is happy to use the latest in seismic testing technology to gain better information and make more informed decisions. Zeke can order a $100,000 seismic test of the Chanchito Gordo property. Although the seismic test will improve his estimate of the likelihood of finding oil, it won't provide perfect information. The seismic test will report one of two possible outcomes: a "Positive" or a "Negative" result, depending on whether or not the test indicates the presence of oil. Because the seismic test's accuracy is a key determinant of its value, Zeke has gathered some historical data on the test's performance. Zeke believes that if there is an oil field on his property, then the probability that the test will return a "Positive" result is 90%. This is the conditional probability P(Positive | Oil). Clearly the test isn't perfect — even if there is a reserve on his property, there is a 1 in 10 chance the test will fail to detect it. This is the conditional probability of a "False Negative" result: P(Negative | Oil) = 10%. And, unfortunately for Zeke, even when there isn't a drop of oil to be found, the test will still report a positive result — a false "Positive" — with probability 30%. The probability that the seismic test will indicate the absence of oil when there is, in fact, no oil to be found — P(Negative|Dry) — is 70%. How can Zeke use these data to calculate the EVSI — the expected value of the test information? As usual, we start with a decision node that branches into two options: "Test" or "Don't Test." The lower branch, "Don't Test," is the same as Zeke's original tree for whether to drill or not. The EMV for the "Don't Test" branch is $3 million; it is based only on the information Zeke previously had available to him. If Zeke buys the test, he will learn the test's result — positive or negative — before he has to decide whether to drill or not. What probability should Zeke assign to the "Test Positive" branch? The probability that Zeke should assign to the "Test Positive" branch is P(Positive).
However, this probability is not in the information Zeke originally had available, which is summarized in the table to the right. To find the P(Positive), we must first find the joint probabilities and enter them into the table. What is P(Oil | Positive)? Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500"). Round if necessary. To find P(Oil | Positive), we use the familiar relationship for conditional probabilities. If you wish to practice, calculate the other joint probabilities and P(Positive) and P(Negative) before advancing to the next screen. Adding the joint probabilities, we find that P(Positive) = 60% and P(Negative) = 40%. We are now ready to complete the tree. After Zeke learns to test the result, he makes the decision about whether or not to drill. If he drills, he will either strike oil or not. What probability should we associate with the branch highlighted to the right? a. P(Positive | Oil) This not the best answer. What event should Zeke condition this probability on?
b. P(Oil | Positive) This is the best answer.
c. P(Oil) This is not the best answer. Zeke must find a probabilty conditioned on preceding events.
d. P(Positive) This is not the best answer. What are the posible outcomes of drilling?
If Zeke is at the highlighted node, then all the events on the branch leading up to the node must have occurred: he decided to test, the test was positive, and he decided to drill. Thus, Zeke must condition the probability of striking oil on past events: specifically, on the fact that the test report was "Positive." Thus, he should associate the conditional probability P(Oil | Positive) with the highlighted "Oil" branch. We find the conditional probabilities in the usual way. The full table of joint, marginal, and conditional probabilites is shown to the right. Our tree now has all the necessary data. We fold it back and find that the EMV for conducting a seismic test — assuminig the test is free — is $3.3 million. What is the EVSI of the seismic test? Enter the EVSI in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. The EVSI is $300,000 — the difference between the EMV of conducting the test, $3.3 million, and the EMV of not conducting the test, $3 million. This is the maximum amount Zeke should be willing to spend on the seismic test. Since the expected value of the test exceeds its $100,000 cost, Zeke should order the test. Summary Often, information is not available in the form we need it in to inform our decision process and the calculation of the EVSI. Frequently, for example, we have access to data about the reliability of a test. The reliability of a test is given as the set of conditional probabilities P(TRi | Si) where TRi is a test result and Sj is one of the scenarios the test is intended to help predict. Collectively, these conditional probabilities reveal the test's ability to predict the outcome of the uncertain event in question. Using this information, we update our initial estimates for the probabilities of each scenario — the prior probabilities P(Si) — to achieve improved, conditional probabilities of each scenario, given the result of the test —
posterior probabilities P(Si | TRi). We use the posterior probabilities to calculate the EVSI and aid our decision process. Solving the Market Research Problem You know how to calculate the expected value of information and set an upper limit on the amount Leo should be willing to pay for market research. But what kind of market research could Leo have in mind? And would it be worth its cost? Leo wants to know how much money he should be willing to spend on additional information about consumers' likely reaction to a floating restaurant. First, you calculate the expected value of perfect information to give Leo on absolute upper bound on the amount he should pay for any kind of information. You begin by setting up a decision tree to help inform the decision of whether or not to buy perfect information. Which of the following best represents the "Buy Information" branch of the perfect information tree? a. a This is not the best answer. Think about the order in which the events must occur.
b. b This is not the best answer. Think about which branches split off due to uncertainty, and which split off when Leo makes a choice.
c. c This is the best answer.
d. d This is not the best answer. Think about the order in which the events must occur.
Whatever the source of perfect information, it will predict the occurrence of either a "Phenomenon" or a "Fad." Which one of these scenarios it will predict is uncertain; the respective probabilities are equal to the probabilities of the two scenarios: 35% and 65%. Once Leo has perfect information, he can choose to either launch the Tethys venture or not. What is the EMV of acquiring the perfect information, assuming that it is free? Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. If Leo is certain that the Chez Tethys will be a "Phenomenon," he will launch her and realize a $2 million profit. If he is certain that the Tethys would be a mere "Fad," he'll avoid the $800,000 loss by not launching, thereby realizing no profits or losses in the floating restaurant industry. The EMV of acquiring perfect information is $700,000. Recall that the EMV of launching the Tethys — based on Leo's original estimates — was $180,000. What is the expected value of the perfect information? Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. The expected value of perfect information is the difference between $700,000 — the EMV of acquiring the perfect information — and the EMV of not acquiring the perfect information. In this case, the EMV of not acquiring the perfect information is the original EMV of launching the Chez Tethys, $180,000. The difference is $520,000. That's a lot of money! I have an idea for a market research event that is expensive, but it's much cheaper than that. Let's get to it! Not so fast, Leo. Up to $520,000 is what you should be willing to pay for perfect information. But real world information is usually much less than perfect. What exactly do you have in
mind? You two are going to love this. I'll rent a cruise ship and do a week-long test cruise around the islands, inviting guests at the finest hotels on board for free. At dinner, I'll run an advertisement for the Tethys, then survey all the guests and ask them if they'd be interested in dining there! That's an audacious market research plan for an audacious business venture. I like the idea, though, and I think you'll get much better information than if you hire a firm to do conventional market research. Let's try to estimate the value of this information. The three of you debate the merits of Leo's marketing event. You split the results of Leo's event into two response scenarios based on how Leo's guests/subjects receive his demonstration: "Enthusiastic" and "Tepid," estimating the probabilities of these responses at 40% and 60%, respectively. If the subjects respond to Leo's demonstration "Enthusiastically," you believe that the Tethys becoming a "Phenomenon" is much more likely than your earlier estimate: 65%. On the other hand, if the reception is "Tepid," you estimate that the probability of a "Phenomenon" is a mere 15%. Given these data and the tree to the right, what is the expected value of Leo's sample information? Enter the EVSI in $millions as a decimal number with four digits to the right of the decimal point (e.g., enter "$5,555,000" as "5.5550"). Round if necessary. To calculate the EVSI, calculate the EMV of Leo's running his planned event. First, calculate the EMV of launching the Tethys given that the event participants' response is "Enthusiastic." The EMV is the sum of the outcomes of the "Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional probabilities that these scenarios will occur given an "Enthusiastic" reception: $1.02 million. Since the EMV for launching the Tethys is higher than the EMV for not launching her, Leo should embark on the Tethys venture if his guest/subjects respond "Enthusiastically" to his event. Thus, the EMV if guests give an "Enthusiastic" response to Leo's event is $1.02 million. Next, calculate the EMV of launching the Tethys given that the event participants' response is "Tepid." The EMV is the sum of the outcomes of the "Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional probabilities that these scenarios will occur given an "Tepid" reception: -$380,000. Since the EMV for launching the Tethys is lower than the EMV for maintaining the status quo at the Kahana, Leo should not embark on the Tethys venture if the participants' response is "Tepid". Thus, the EMV of a "Tepid" response to Leo's event is $0. The EMV of Leo's event is $408,000: the sum of the EMVs of the "Enthusiastic" and "Tepid" responses, after they have been weighted by the probabilities that these responses will occur. The expected value of Leo's sample information is the difference between the EMV of his event and $180,000 — the EMV of launching the Tethys without further information. The difference is $228,000. I talked to some friends last night, and the total cost of renting and running a cruise ship for a week, and shooting an advertising short comes to at least $240,000. That's higher than the expected value of this information we calculated: $228,000. Are you telling me that my event would not be worth its cost? I'm afraid so, Leo. But that doesn't necessarily mean that you shouldn't do it. You might choose to have your event even if its expected monetary value is low. It's late. Let's break for tonight and meet over breakfast tomorrow.
Exercise 1: The FUD Grocery Chain The grocery chain FUD offers two lines of private label products: "FUD Brand," and "Pleasant Valley Fare." The two brands each are used to label a set of grocery items with very little overlap. For example, "FUD Brand" is used for meat and cheese products; "Pleasant Valley Fare" for canned goods. Esther Smith, CEO of FUD, anticipates that eliminating the brand with weaker consumer loyalty and repackaging its product line under the label of the stronger brand would result in an additional $200,000 in profits from increased sales and from cost savings on packaging and advertising. If Esther eliminates the stronger brand, she anticipates a net bottom line improvement of $0 — reductions in sales volume or price discounts would roughly offset cost savings. Esther doesn't know which of the brands enjoys greater customer loyalty — since the product lines don't overlap, she can't directly compare the two brands' performances. Initially, she believes that there is a 50% chance that customers are more loyal to FUD Brand, and a 50% chance that they are more loyal to Pleasant Valley Fare (PVF). Based on this information, which of the two options - eliminating FUD Brand or Pleasant Valley Fare - has the higher EMV? a. Eliminate FUD Brand. This is not the best answer.
b. Eliminate Pleasant Valley Fare. This is not the best answer.
c. The EMVs for both options are the same. This is the best answer.
d. The answer cannot be determined from the information provided. This is not the best answer.
Each option has an EMV of $100,000. Based on Esther's current information, she has no basis for preferring one brand over the other. Suppose Esther had access to perfect information, i.e., she could discover with certainty which brand her customers prefer, so she could eliminate the weaker brand and increase FUD's profits by $200,000. How much should she be willing to pay for such perfect information? Enter the expected value of perfect information in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary. First, construct the tree for the decision to buy or not to buy the information. The EMV for not buying the information is $100,000 no matter which brand Esther chooses to eliminate. There is a 50% chance that the perfect information Esther acquires will reveal that customers prefer the FUD Brand and a 50% chance it will reveal that customers prefer Pleasant Valley Fare. Once Esther has the perfect information, she will choose to eliminate the brand that customers are less loyal to, adding $200,000 to FUD's bottom line. The EMV of buying perfect information is $200,000. The expected value of perfect information, EVPI, is the difference between the EMVs of buying and not buying perfect information: $100,000. Since Esther cannot get perfect information, she's willing to settle for information that is less than perfect and decides to hire a market research firm to determine which of FUD's two brands customers prefer.
Esther believes that the probability that the market research firm will find that FUD Brand has higher customer loyalty is 50%. Likewise, the probability that the market research firm will predict that Pleasant Valley Fare has higher customer loyalty is 50%. Whichever brand the market research firm indicates is stronger, Esther believes that the finding will be correct with probability 85%. Thus, there is a 15% chance that the finding the firm provides will be wrong. What is the expected value of this sample information? Enter the expected value of sample information in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter "$5,500,000" as "5.500"). Round if necessary. First, construct the tree for the decision to buy or not to buy the sample information using the probabilities and outcomes cited above. The EMV for not buying the information is $100,000 no matter which brand Esther chooses. Once Esther has the sample information, she will choose to eliminate the brand that the market research indicates is weaker. The EMV of eliminating the brand that market research predicts enjoys less customer loyalty is $170,000. The EMV of buying this imperfect information is $170,000. Thus, the expected value of sample information, EVSI, is the difference between the EMVs of buying and not buying perfect information: $70,000. Exercise 2: PPD Immunity After a public relations debacle over a possible outbreak of pulluscular pig disorder (PPD) at one of Bowman-Lyons-Centerville's hog farms, Paul Segal, head of the hog farming division, is considering vaccinating another herd at a hog farm located near the original outbreak. Immunity to PPD is fairly common among pigs; the probability that the entire herd in question is immune is fairly high: 60%. However, if any pigs in the herd are not immune, the expected cost to Bowman-LyonsCenterville is $150,000. This cost has been calculated based on the probability of a PPD outbreak and the ensuing costs to contain the disease and manage the expected PR fallout. The cost of vaccinating the herd is $40,000. What is the EMV of the cost of not vaccinating the herd? Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter "$5,500" as "5.5"). Round if necessary.
The EMV of the cost of not vaccinating the herd is $60,000. Since the cost of vaccinating the herd — $40,000 — is lower than the expected cost of not vaccinating the herd, Paul should have the herd vaccinated, barring the emergence of any further information. There is a test that Paul could perform on the herd to determine if the herd is immune to PPD. This test delivers two results: "Positive" for immunity, or "Negative" for immunity. The test is not perfectly accurate: if the entire herd is immune, the test will report "Positive" with probability 85% and "Negative" with probability 15%. Similarly, if any of the pigs in the herd are not immune, the test will report "Positive" with probability 30% and "Negative" with probability 70%. Using Paul's prior probability for the immunity of the herd, calculate P(Positive), the marginal probability that the test will report a "Positive" result. Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500"). Round if necessary. First, calculate the joint probabilities P(Positive & Immune) and P(Positive and Not Immune), using the marginal probabilities for immunity and non-immunity — 60% and
40%, respectively — and the conditional probabilities that quantify the reliability of the test - P(Positive | Immune) and P(Positive | Not Immune). Then, sum the joint probabilities to find the marginal probability of a "Positive" test result: 63%. Since the test results "Positive" and "Negative" are mutually exclusive and collectively exhaustive events, the probability of a "Negative" is simply 100% minus the probability of a "Positive," i.e., P(Negative) is 37%. The tree below will help Paul determine whether or not ordering the immunity test will be worthwhile. The calculated probabilities of the test results are entered in the appropriate branches. However, to reach a decision, Paul needs the conditional probabilities of immunity and non-immunity conditioned on a "Positive" test result. What is P(Immune | Positive)? Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500"). Round if necessary. To calculate P(Immune | Positive), use the definition of conditional probability and divide the joint probability P(Positive & Immune) by the marginal probability P(Positive). P(Immune | Positive) is 81%. Since immunity and non-immunity are mutually exclusive and collectively exhaustive events, P(Not Immune | Positive) is simply 100% minus P(Immune | Positive) i.e., 19%. Calculate the remaining probabilities on the decision tree. What is the highest amount Paul should be willing to spend on the immunity test? Enter the expected value of sample information in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter "$5,500" as "5.5"). Round if necessary. The expected value of not vaccinating given a "Positive" test result is $28,500. Since this is lower than the cost of vaccination, Paul should choose not to vaccinate on EMV grounds. The probability that the herd is immune given that the test reports "Negative" is 24.3%. Since the EMV of the cost of vaccinating the herd ($40,000) is lower than the EMV of the cost of not vaccinating ($113,550) Paul should choose to vaccinate if the test result is "Negative". The EMV of conducting the immunity test is $32,755, i.e., the sum of the EMVs of "Negative" and "Positive" test results, weighted by their respective probabilities. The expected value of sample information, EVSI, is the difference between the EMVs of testing and not testing for immunity, i.e., $7,245. The most Paul should be willing to pay for the immunity test is $7,245. Risk Analysis Just as an e-learning course must have a bittersweet last section, so too must your internship on Hawaii have a final day. A parting Pacific frolic is refreshing preparation for your meeting with Leo. Introducing Risk As you dry in the morning sun, Alice invites you to consider Leo's position should he wager the Kahana on the Tethys' success. "Even if the potential profits are substantial, losing the Kahana if the Tethys tanks would be a severe blow. I wonder how seriously he's considered the potential downside." Imagine: upon your arrival at business school you buy yourself a nice new car. It's a sleek, powerful, Burgundy Ben Hur by Chariot — with both a Sweetone stereo and a Theft Discouragement System — worth over $25,000. City driving can be rough: the probability that you'll be involved in an accident that "totals" your car — reduces its value from $25,000 to zero — in your first year is 0.01%. Collision insurance to cover such a loss in your new city is optional: it will cost you $1,600 per year above and beyond the premium for required liability insurance. Will you buy the collision
insurance? a. Yes, I will buy collision insurance. There is no single correct answer to this question. Your answer depends on your risk tolerance. Continue on with the text to explore ways to think about risk.
b. No, collision insurance is for the weak. There is no single correct answer to this question. Your answer depends on your risk tolerance. Continue on with the text to explore ways to think about risk.
c. A friend of mine in town can get me much cheaper insurance than that! Good for you! Please pass your friend's contact information on through this application's discussion board.
d. Could I please have the Bennie in Champagne instead? There is no single correct answer to this question, but there are an infinite number of wrong ways to answer, responding with an unrelated question being one of them.
Most people would pay the premium to mitigate the risks associated with a car accident leading to such large losses. But let's look at the decision tree for buying collision insurance. For simplicity, we'll ignore accidents that do not total your car. After a year in school, your Bennie's market value will have a present value of $25,000. If you don't buy collision insurance and the car is totaled, you'll lose the full value of the car — $25,000. If — as we all hope — you and your car survive the city's roads unscathed, you won't lose anything. Opting for collision insurance entails a $1,600 insurance premium. If you are spared the calamity of a wreck, you'll have incurred just the premium cost at the end of the year. If traffic tragedy befalls you, your insurance company will pay out the value of the car minus a $1,000 deductible: so at the end of the year you will be out $2,600: the premium plus the $1,000 deductible. Which option is better in terms of the EMV? a. Buying collision insurance. This is not the best answer. Think about what a "better" outcome is in this situation: a higher EMV or a lower EMV?
b. Not buying collision insurance. This is the best answer.
At $2.50, the EMV of not buying collision insurance is much better than the EMV of buying the insurance. Essentially, if you buy the collision insurance, you are paying $1,600 to avoid an expected loss of $2.50! We shouldn't be surprised — insurance companies are not charities, and from their perspective, selling you insurance has a very favorable EMV. Still, a majority of drivers choose collision insurance, even though it has a significantly lower EMV. Why might this be the case? Let's consider a different situation. Suppose you owned a bright and shiny miniature windup toy Bennie for $25. Would you take out a $1.60 insurance policy on it, even if the probability were upwards of 50% that it would be stepped on or lost in the coming year? Probably not. A $25 loss is not even a minor catastrophe. If you lost or accidentally destroyed your miniature Bennie, you'd either replace it quickly, or accept its departure without fuss. If instead of a nice, new, $25,000 Bennie you owned a 20-year old beat-up and rusty Oldsmobile Delta-88, worth about $2,500, you'd also think twice about paying $160 to insure it against collision. A loss of $2,500 — though stinging — is unlikely to be a major setback, especially when weighed against a $160 premium.
But, imagine that you — assisted by your MBA degree — amass a multi-million dollar fortune over the next decades. Would you still insure a $25,000 car against collision? Perhaps not. Clearly, the value of the potential loss relative to your net worth has an important influence on your willingness to take on the risk associated with uncertainty. Suppose you are invited to play a gambling game in which you could win $10 million with 50% probability or lose $2 million with 50% probability. Even though the EMV of playing this game is $4 million, you'd probably decline participation unless you can afford a $2 million loss. For most of us, the pain of the $2 million loss would weigh more heavily than the joy of winning $10 million, so we would not accept this gamble. For someone who can afford a $2 million loss, this game might be very profitable, especially if played multiple times. There are formal ways to measure such assessments by assigning a personal utility value to each outcome. Then we can show that maintaining the status quo — our current assets — provides greater utility than playing the game does. Rigorous methods of quantifying utility are beyond this course's scope. For now, let's look at how we might gain insight into the personal utility we associate with different monetary outcomes. Returning to collision insurance, let's summarize the scenarios and their probabilities and outcome values. You have two options: "Buy collision insurance" and "Don't buy collision insurance. If you insure against collision, there are two possible outcomes: either you avoid a wreck for one year and lose the insurance premium of $1,600, or misfortune strikes and your Bennie is totaled. In the latter case, you lose the premium and the deductible. The probabilities associated with these two outcomes are 99.99% and 0.01%, respectively. The possible outcomes and their respective probabilities are listed in the table below. If you don't insure against collision and park your car safely at year's end, you won't sustain any loss at all. If the the unfortunate alternative occurs and your Bennie is totaled, you'll have lost its value of $25,000. Again, the probabilities of these outcomes are 99.99% and 0.01%, respectively. We now have two tables that completely summarize the possible scenarios of our decision. The monetary values of these scenarios are arranged from top to bottom, from the best outcomes to the worst. These tables — one for each option — are together referred to as the risk profiles for the decision. From the risk profiles, it easy to recognize that the option of not buying insurance will deliver the most preferred outcome if you don't have an accident but will deliver the least desirable outcome if you do. This insight provides the basis for you to prefer to buy the collision insurance: it exposes you to lower risk, even though it has a lower EMV. Summary The EMV criterion is not the only criterion that informs decisions. Besides wanting to maximize our average outcomes in the long run, we want to minimize our exposure to risk, especially when potential losses are high relative to our own net worth. Risk Attitudes Recall Seth Chaplin and S&C Films' production of the movie Cloven. Seth has a choice between two options. In a deal with Pony Pictures, S&C produces the film, and Pony acquires the complete rights to Cloven for a flat production fee. In a second deal with K2 Classics, S&C retains part ownership, and the financial outcomes for S&C depend on Cloven's performance at the box office. Seth knows from previous analysis that the K2 deal has a higher EMV: $2.04 million vs. $1.48 million for the Pony deal. However, he also knows the K2 deal is riskier. Before making a final decision, Seth wants to understand the full implications of his two options. Let's build risk profiles for each deal.
There are seven possible scenarios that can occur. The outcome values of these scenarios depend on Seth's choice of a production partner, on whether or not superstar actor Shawn Connelly lends his gravelly baritone to the film's lead character, and on the audience's reception of the film. Let's summarize the possible scenarios. If Seth chooses the Pony option, two scenarios could occur. In one, Connelly participates in the movie and Seth's profits are $2.2 million. In the other, Connelly does not participate in the movie and Seth's profits are only $1 million. If Seth opts for the Pony deal, the probabilities of these two scenarios occurring are 40% and 60%, respectively. If Seth takes the K2 option, there are five possible scenarios, but only three possible outcomes: "Blockbuster" success, a "Lackluster" performance, and a "Flop. A "Blockbuster" success can occur whether or not Connelly participates in the production. The probability of a "Blockbuster" is 38%: this probability is calculated by adding the joint probability that Cloven stars Connelly and is a "Blockbuster" to the joint probability that Cloven is a "Blockbuster" without Connelly's participation. A "Lackluster" performance can also occur whether or not Connelly participates in the production. The probability of a "Lackluster" performance is 50%: this probability is calculated by adding the joint probability that Cloven stars Connelly and has a "Lackluster" performance to the joint probability that Cloven has a "Lackluster" performance without Connelly's participation. A "Flop" occurs only when Connelly doesn't take part in the production of Cloven. The total probability of a "Flop" is the joint probability of Cloven being a "Flop" and Connelly not taking part: 12%. Seth now has complete risk profiles for the two options. The risk profiles give Seth a quick overview of the possible outcomes — and their respective probabilities — for each choice. Note that we've combined scenarios so that we now have probabilities for each distinct outcome value. How does a manager like Seth use risk profiles to inform a decision? Risk profiles allow Seth to compare options not just in terms of their EMVs but in terms of the risk they expose him to. From Seth's risk profile we can see that, although the K2 option has the higher EMV, it is also associated with greater risk: a $2 million loss is possible, whereas the Pony option doesn't include any possible loss. Which option Seth should choose ultimately depends on how comfortable he is with the risk associated with the K2 option. The Pony option is certain to deliver a profit. The K2 option might generate a substantial loss. If a $2 million loss is more than Seth believes his company can — or should — sustain, he may want to choose the Pony option despite its lower EMV. If Seth chooses the Pony option, he will be demonstrating that he is risk averse: he is choosing an option with a lower EMV in return for lower exposure to risk. Most people are at least slightly risk averse, as shown by their inclinations to buy insurance. Some people tend to be risk seeking, that is, they are willing to forgo options with higher EMVs and accept higher levels of risk in return for the possibility of extremely high returns. Although risk-seeking behavior is generally rare in the population when significant downside risk is involved, many people tend to be somewhat risk seeking when the value of the loss risked is low. For example, the EMV for playing the lottery is lower than for not playing the lottery. However, since the amount risked by playing the lottery is so low — the price of a lottery ticket — many people choose to pay it in return for the unlikely outcome that they will win a multi-million dollar prize. When we use the EMV as the basis for decision making, we are acting in a risk neutral manner. Most people are risk neutral when they make routine "day in, day out" types of decisions — those for which potential losses are not terribly harmful to the decision-maker or her organization.
People feel comfortable using the EMV for routine decisions because they feel comfortable "playing the averages": some decisions will result in losses and some in gains, but over the long run, the outcomes will average out. We expect that, over time, choosing options with the highest EMVs will lead to highest total value. Most people feel comfortable using EMV as the basis for decisions that are not made repeatedly — provided the decisions have outcome values similar to those of other, more routine decisions that they make on a regular basis. Why would Pony Pictures agree to purchase the rights to Cloven and take on all the risk associated with marketing and distributing the film? For Pony, the deal with Seth is a routine decision. Pony is a large enough company to sustain a potential multi-million dollar loss if Cloven flops. Pony also distributes 10 to 12 movies a year, so overall, the positive EMV of a film release promises that in the long run, Pony's operations will be profitable. Seth knows that the K2 deal is very risky: if he chooses the K2 deal, he will tie his company's financial success to box office performance and to Shawn Connelly's whims. If Cloven "Flops," S&C Films might well go bankrupt. But Seth likes a gamble... Summary Risk profiles allow us to assess the utility different outcomes bring us, as opposed to their monetary value. The concise summary risk profiles provided helps us compare and contrast our different decision options, allowing us to choose the option we prefer based on our attitude to risk: risk averse, risk seeking, or risk neutral. Solving the Market Research Problem (II) "Although Leo's market research event doesn't pay off in terms of its expected value," Alice explains, "it could help him manage his risk." Let's assemble the risk profile for Leo's decision, including the decision about whether or not to run his market research event. We have to keep in mind that the cost of running his event is $240,000. Which of the following tables correctly summarizes the outcomes and probabilities if Leo chooses to run his market research event? a. a This is not the best answer. Think about the costs associated with the market research event.
b. b This is not the best answer. Think about the costs associated with the market research event.
c. c This is not the best answer. Think about what the probabilities of any set of mutually exclusive events must add up to.
d. d This is the best answer.
If Leo chooses to run his market research event, then the participants' response can be either "Enthusiastic" or "Tepid." If it's "Tepid," Leo will choose to stay out of the floating restaurant business, so there is a 60% chance that Leo will incur only the cost of the event, $240,000. If the response to the event is "Enthusiastic," then there is a 65% chance that Leo will make a profit of $1.76 million — the original profit estimate of $2 million, minus the event's cost of $240,000. The likelihood of a $1.76 million profit is (40%)(65%), or 26%. If the response to the event is "Enthusiastic," then there is an 35% chance that Leo will suffer a loss of $1.04 million — the original loss estimate of $800,000, minus the event's cost of
$240,000. The likelihood of a $1.04 million loss is (40%)(35%), or 14%. You next complete the risk profile for Leo's profits if he chooses not to run the event. You present your risk analysis to Leo to show him why he might want to run his event even if its expected monetary value is not optimal. I see. So if I run my event, I'm most likely to incur the cost of the event, and nothing else. But, if the response is "Enthusiastic," I should go ahead with my plan. Then, there is a small chance that I will incur a large loss if the restaurant becomes a mere "Fad," and a good chance that the Tethys will become a "Phenomenon." On the other hand, if don't run my event, although my potential profits are slightly higher, the risk is too great. I just have too much to lose at this point. At the same time, I think I can afford to pay for a market research event now, without going to the bank. Then if my guests react favorably, I might consider diving into the Tethys adventure. I want you two to enjoy your last day here and have dinner with me tonight. Anywhere special you'd like to go? Hmmm. I know this wonderful place that serves the most delightful mix of French and Hawaiian cuisines... Aw, shucks. Exercise 1: The Frivolous Lawsuit of D. Pitt Danforth Pitt was injured in a bisque-spilling incident in a Hawaiian luxury resort restaurant. Danforth, who insists he hasn't been able to remove the scent of crab from his knees, is pursuing his legal options. After initial investigation and consultation, Pitt's lawyers explain to Danforth that he can expect to win $250,000 in damages with 5% probability or $100,000 with 25% probability, before subtracting the attorney's fees. The fees associated with pursuing legal action are estimated at $50,000, which Pitt must pay whether he wins or loses the suit. What is the EMV of pursuing legal action against the resort hotel? Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,000" as "5.00"). Round if necessary.
The EMV of pursuing legal action is -$12,500. To avoid a legal battle, the hotel's owner agrees to settle out of court, offering Danforth two full weeks in the hotel's penthouse suite free of charge. Including amenities, the value of this offer is $12,000. If Danforth chooses to reject the offer to settle out of court, and to sue the hotel instead, he should be characterized as which of the following? a. Risk averse. This is not the best answer. Think about why a risk averse person chooses a lower EMV.
b. Risk neutral. This is not the best answer. Think about which criterion the risk neutral decision-maker uses to make his or her decision.
c. Risk seeking. This is the best answer.
d. Demented. This is not the best answer. Think about which risk attitudes we have defined in this unit.
Since the EMV of pursuing legal action is lower than the EMV of settling out of court, but the possible gains of legal action, though relatively unlikely, are much higher than the EMV of settling, Danforth would have to be characterized as risk seeking. Final Assessment Test I Introduction Welcome to the post-assessment test for the HBS Quantitative Methods Tutorial. Most students choose to complete the tutorial and take the final assessment test to satisfy the quantitative methods requirement. To satisfy the requirement, you must pass one of the two post-assessment tests. To pass a test, you must answer at least 65% of the questions correctly. This is an open-book multiple-choice exam. To advance from one question to the next, you must select one of the four answer choices and click the Submit button. After submitting your answer, you will not be able to change it or return to the question, so make sure you are satisfied with your selection before you submit each answer. In the briefcase, links to Excel spreadsheets containing z-value and t-value tables as well as utilities for finding confidence intervals and conducting hypothesis tests are provided for your convenience. For some questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text. Your exam results will be displayed immediately upon completion of the exam. The exam results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for the questions you answered incorrectly. After completing the exam, you can review your test results at any time by returning to this screen and clicking OK. If you haven't yet taken the test, click Final Assessment Test I on the navigation on the left to begin. Good luck! Frequently Asked Questions How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the course. Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book examination. May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question. Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with the material, but you may take longer if you need to. What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to the exam site. How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for any questions you answered incorrectly. Final Assessment Test I [Exam content not shown] Final Assessment Test II Introduction Welcome to the second post-assessment test for the HBS Quantitative Methods Tutorial. IMPORTANT: You should take this test only after taking the first final assessment test. There are only two final assessment exams. Thus, if you did not receive a passing score on the first exam, it
is critical that you review the course material carefully before taking this exam. To pass this test, you must answer at least 65% of the questions correctly. This is an open-book multiple-choice exam. To advance from one question to the next, you must select one of the four answer choices and click the Submit button. After submitting your answer, you will not be able to change it or return to the question, so make sure you are satisfied with your selection before you submit each answer. In the briefcase, links to Excel spreadsheets containing z-value and t-value tables as well as utilities for finding confidence intervals and conducting hypothesis tests are provided for your convenience. For some questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text. Your exam results will be displayed immediately upon completion of the exam. The exam results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for the questions you answered incorrectly. After completing the exam, you can review your test results at any time by returning to this screen and clicking OK. If you haven't yet taken the test, click Final Assessment Test II on the navigation on the left to begin. Good luck! Frequently Asked Questions How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the course. Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book examination. May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question. Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with the material, but you may take longer if you need to. What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to the exam site. How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The results screen will indicate which questions you answered correctly, and which area of the tutorial you should review for any questions you answered incorrectly. Final Assessment Test II [Exam content not shown]
View more...
Comments