Quantitative Methods Online Course
Short Description
This is online course of Qualitative methods used in MBA....
Description
5/13/2016
Quantitative Methods Online Course
This document is authorized for use only by Nishithkumar Raval. Copy or posting is an infringement of copyright.
Quantitative Methods Online Course PreAssessment Test Introduction Welcome to the preassessment test for the HBS Quantitative Methods Tutorial. All questions must be answered for your exam to be scored. Navigation: To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice and click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied with your selection before you submit each answer. You may also skip a question by pressing the forward advance arrow. Please note that you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the navigational arrows at any time. Although you can skip a question, you must navigate back to it and answer it all questions must be answered for the exam to be scored. In the briefcase, links to Excel spreadsheets containing zvalue and tvalue tables are provided for your convenience. For some questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text. Your results will be displayed immediately upon completion of the exam. After completion, you can review your answers at any time by returning to the exam. Good luck!
Frequently Asked Questions How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the course. Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open book examination. May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question. Is this a timed exam? No. You should take about 6090 minutes to complete the exam, depending on your familiarity with the material, but you may take longer if you need to. What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to the exam site. How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The results screen will indicate which questions you answered correctly.
Overview & Introduction Welcome to QM... Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis. This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve management decisionmaking problems. Click on the link labeled "The Tutorial and its Method" in the left menu to get started.
The Tutorial and its Method 1/135
5/13/2016
Quantitative Methods Online Course
QM is designed to help you develop quantitative analysis skills in business contexts. Mastering its content will help you evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon below to advance to the next page. This isn't a formal or comprehensive tutorial in quantitative methods. QM won't make you a statistician, but it will help you become a more effective manager. The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to improve your ability to formulate, analyze, and solve managerial problems. You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual management problems. You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM offers many features we hope you will explore, utilize, and enjoy.
The Story and its Characters Naturally, the most appropriate setting for a course on statistics is a tropical island... Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after all, staying at a 5star hotel as a Summer Associate with Avio Consulting. This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer internship, as you prepared to enroll in a twoyear MBA program this fall. You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio partners consider you a high potential intern — they are willing to invest in you with the hope that you will later return after you complete your MBA program. Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead: providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai. Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine cuisine, a spa, sports activities. And above all, the pristine beach and glorious ocean. After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo. Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach to his management decisionmaking processes.
Using the Tutorial: A Guide to Tutorial Resources Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage. QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link labeled "Using the Tutorial" on the left. These navigation links open interactive clips (like this one) here. There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips. Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip will have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice will present your results to Leo for his consideration. The Kahana clips will give you exposure to the types of business problems that benefit from the analytical methods you'll be learning, and a context for practicing the methods and interpreting their results. To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiplechoice assessment exam will evaluate your understanding of the material. In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. 2/135
5/13/2016
Quantitative Methods Online Course
Complementing the text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be asked questions even in Explanatory Clips that you should answer to check your understanding of the concepts. Some explanatory clips give you directions or tips on how to use the analytical and computational features of Microsoft Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this course. QM is supplemented with spreadsheets of data relating to the examples and problems presented. When you see a Briefcase link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel functions to reproduce the graphs and analyses that appear in the clips. You will also see Data links that you should click to view summary data relating to the problem. Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource that you can use to make sure that you have mastered the important concepts in each section. Work through exercises to solidify your knowledge of the material. Challenge exercises provide opportunities to tackle somewhat more advanced problems. The challenge exercises are optional you should not have to complete them to gain the mastery needed to pass the tutorial assessment test. The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the one on the right to move forward. Use the one on the left if you want to back up a page or two. In the upper right of the QM tutorial screen are three buttons. From left to right they are links to the Help, Briefcase, and Glossary. To access additional Help features, click on the Help icon. In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel Workbooks. In many of the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time. In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of the Excel functions used in the course. We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition for quantitative analysis that you will need as an effective and successful manager.
... and Welcome to Hawaii! The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat, and you watch as the foggy West Coast recedes behind you. I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them. This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful than others. Apparently, he once owned and managed a gourmet spam restaurant! Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but that didn't do so well. He had to declare bankruptcy. Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large operation on his hands. It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to take risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how to approach management problems in a more sophisticated, analytical fashion. We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way. You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the analytic work soon.
3/135
5/13/2016
Quantitative Methods Online Course
Leo and the Hotel Kahana Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport. Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that she would leave it to me. Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs. Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I make decisions. That's where you come into the picture. I used to be quite a risktaker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want to get it right. After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in the main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."
Basics: Data Description Leo's Data Mine After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You wake up early enough to take a short walk on the beach before you make your way to Leo's office. Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip. Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important data on the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over. There's just so much to keep track of. Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and the type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse your files? Yes. There are two things in particular that have been on my mind recently. For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract out the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or discontinue offering scuba lessons all together. I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and how it compares to the school I've been using. I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the year, and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given month. These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it. Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates. That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on diving school prices and occupancy patterns. Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, your suite, Alice, includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool whenever you like. Thanks! We'll certainly take advantage of that. Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to sea. Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise and useful information. But no matter what data you are working with, always make sure you really understand the data before doing a lot of analysis or making managerial decisions." 4/135
5/13/2016
Quantitative Methods Online Course
What is Alice getting at when she tells you to "understand the data?" And how can you develop such an understanding?
Describing and Summarizing Data Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the distribution of data.
Working with Data Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do? The data we encounter each day have valuable information buried within them. As managers, correctly analyzing financial, production, or marketing data can greatly improve the quality of the decisions we make. Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information and insight as possible from our data we have available. When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How were they collected? How can we help the data tell their story? Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was three and a half feet. We might be surprised... ... until we learn that the building is an elementary school. We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she measured height: with or without shoes. Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we might first try to detect any patterns, trends, or relationships that exist in the data. We might start by grouping the data into logical categories. Grouping data can help us identify patterns within a single category or across different categories. But how do we do this? And is this often timeconsuming process worth it? Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to comprehend. In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask: Are operating expenses increasing or decreasing? Do office space costs vary much from year to year? Comparing data across different years or different categories can give us further insight. Are selling costs growing more rapidly than sales? Which division has the highest inventory turns?
Histograms In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed graphically can significantly deepen our understanding of a data set and the situation it describes. To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What questions might we want to answer with the energy data? Which country is the largest consumer? How much energy do most countries use? Source In order to create a graph that provides good visual insight into these questions, we might sort the countries by their level of energy consumption, then group together countries whose consumption falls in the same range — e.g., the countries that use 100 to 199 million tonnes per year, or 200 to 299 million tonnes. Source We can find the number of countries in each range, and then create a bar graph in which the height of each bar represents the number of countries in each range. This graph is called a histogram. A histogram shows us where the data tend to cluster. What are the most common values? The least common? For example, we see that most countries consume less than 100 million tonnes per year, and the vast majority less 5/135
5/13/2016
Quantitative Methods Online Course
than 200 million tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year. Why are there so many countries in the first range — the lowest consumption? What factors might influence this? Population might be our first guess. Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's energy usage.
Outliers In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from the rest of the data are known as outliers. How do we interpret them? First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error? Was it collected in a different way than the rest of the data? At a different time? We might discover that the data point refers to a 75 yearold retiree, taking the course for fun. After making an effort to understand where an outlier comes from, we should have a deeper understanding of the situation the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one of three things: leave the outlier alone, or — very rarely — remove it or change it to a corrected value. A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we truly want to understand the age distribution of all students in the class, we would leave the point in. Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in fulltime degreegranting programs, we would exclude the senior citizen and all other nondegree program students enrolled in the course. Occasionally, we might change the value of an outlier. This should be done only after examining the underlying situation in great detail. For example, if we look at the inventory graph below, a data point showing 80 pairs of rollerblades in inventory would be highly unusual. Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6 on April 14th. Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales and purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable value, we correct the data point. Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we want to draw. Such changes to a data set should be made on a casebycase basis only after careful investigation of the situation.
Summary With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can provide insight into a situation, they can help us to make the right decisions.
Creating Histograms Note: Unless you have installed the Excel Data Analysis ToolPak addin, you will not be able to create histograms using the Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can construct them in the future when you do have access to the Data Analysis Toolpak. To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis" appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel Options." Choose "AddIns" and highlight the "Analysis Toolpak" in the list and click "Go." 6/135
5/13/2016
Quantitative Methods Online Course
Check the box next to Analysis Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak. Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis Histogram tool. To prepare the data, we enter or copy the values into a single column in an Excel worksheet. Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins," into a second column of data. In the Tool bar, select the Data tab, and then choose Data Analysis. In the Data Analysis popup window, choose Histogram and click OK. Click on the Input Range field and enter the range of data values by either typing the range or by dragging the cursor over the range. Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't specify our own bins, Excel will create its own bins, which are often quite peculiar. Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the summary table, which is created by default. Click New Worksheet Ply, and enter the name you would like to give the output sheet. Finally, click OK, and the histogram with the summary table will be created in a new sheet.
Central Values for Data Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise way with a single number.
The Mean Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data we're investigating and the type of questions we'd like the data to answer. What number would best describe employee satisfaction data collected from annual review questionnaires? The numerical average would probably work quite well as a single value representing employees' experiences. To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result by 11, the number of surveys. The Greek letter mu represents the mean of the data set. The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set. However, it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean value towards one extreme. In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that side will pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US consumption pulls the mean to a value higher than the consumption of most other countries. What other numbers can we use to find the central tendency of the data?
The Median Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion. How should we interpret this number? How well does this average represent the revenues of these companies? When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of revenue a year. If this is true, why is the mean so high? Source As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These 7/135
5/13/2016
Quantitative Methods Online Course
higher revenues pull up the average considerably. Source In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In these cases, we can use another central value called the median. Source The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than the median, and half are lower. Source For income, the median revenues of the top 100 US companies is $30 billion; significantly less than $42 billion. Half of all the companies earn less than $30 billion, and half earn more than $30 billion. Source Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of highrevenue earners. How can we find the median? Source With an odd number of data points, listed in order, the median is simply the middle value. For example, consider this set of 7 data points. The median is the 4th data point, $32.51. In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values — and obtain a median of $41.92. When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the pros and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed distribution. By contrast, the median is not biased by outliers and is often a better value to represent skewed data.
The Mode A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might use the mode to represent data when knowing the average value isn't as important as knowing the most common value. In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more than one peak. A distribution that has two peaks is called a bimodal distribution.
Summary To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central tendency" of the data set, but we need to understand how they differ before using them:
Finding The Mean In Excel To find the mean of a data set entered in Excel, we use the AVERAGE function. We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas. In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are located. Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it does not represent an actual data point.
Finding The Median In Excel 8/135
5/13/2016
Quantitative Methods Online Course
Excel can find the median, even if a data set is unordered, using the MEDIAN function. The easiest way to calculate a data set's median is to select a range of cell references.
Finding The Mode In Excel Excel can also find the most common value of a data set, the mode, using the MODE function. If more than one mode exists in a data set, Excel will find the one that occurs first in the data. Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating.
Variability The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently around the center value," Alice tells you. "We need a way to measure variation in the data."
The Standard Deviation It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely dispersed? Let's look at an example. To identify good target markets, a car dealership might look at several communities and find the average income of each. Two communities — Silverhaven and Brighton — have average household incomes of $95,500 and $97,800. If the dealer wants to target households with incomes above $90,000, he should focus on Brighton, right? We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is there a wide range around the average income? A market might be less attractive if fewer households have an income above the dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a closer look at the data. Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the dealer's target income range. Without understanding the variability in the data, the dealer might have chosen Brighton, which has fewer targeted homes. Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two communities. Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a data set, we need a summary statistic that captures the level of dispersion in a set of data. The standard deviation is a common measure for describing how much variability there is in a set of data. We represent the standard deviation with the Greek letter sigma: The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a conceptual level first. Then we'll build up step by step to help understand where the formula comes from. The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.
Calculating A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays, typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the difference between unhappy guests and satisfied customers who want to return. On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services during a shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her invaluable additional information about how those requests might vary across different days. The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To staff properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 9/135
5/13/2016
Quantitative Methods Online Course
and 195, for example, or between 120 and 220. To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to calculate a summary statistic called the variance. Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the variance, we first sum the squares of these differences. Why square the differences? A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or zero. If we simply summed the differences between each Saturday's requests and the mean, positive and negative differences would cancel each other out. But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get only positive numbers that do not cancel each other out in a sum. The formula for variance adds up the squared differences and divides by n1 to get a type of "average" squared difference as a measure of variability. (The reason we divide by n1 to get an average here is a technicality beyond the scope of this course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the variability of the data? Sure, but variances don't come out in the most convenient form. Because we square the differences, we end up with a value in 'squared' requests. What is a requestsquared? Or a dollarsquared, if we were solving a problem involving money? We would like a way to express variability that is in the same units as the original data — frontdesk requests, for example. The standard deviation — the first formula we saw — accomplishes this. The standard deviation is simply the square root of the variance. It returns our measure to our original units. The standard deviation for the hotel's Saturday desk traffic is 25.2 requests.
Interpreting What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests. With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation would translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long lines; when traffic is very low, desk staff are idle. For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is more representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both customer dissatisfaction and staff idle time. Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it easy for us to calculate variance and standard deviation.
Summary The standard deviation measures how much data vary about their mean value.
Finding in Excel Excel's STDEV function calculates the standard deviation. To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas. In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation. To calculate variance, we can use Excel's VAR function in the same way.
The Coefficient of Variation The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so much. How can you compare the variability in different data sets? 10/135
5/13/2016
Quantitative Methods Online Course
A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two data sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is "more variable"? Standard deviations must be considered within the data's context. The standard deviations for two stock indices below — The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent over a period. But were the two indices equally variable? Source If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the average); if the average is $700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's average index price was over three and half times higher than TSC's average index price. Source To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of the data to the data's mean. Source We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent of the mean. To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative variation? Click the answer you select. Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out which data set is most variable in this relative sense. The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of variability.
Summary The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation in different data sets of different scales or units.
Applying Data Analysis After a good night's sleep, you meet Alice for Breakfast. "It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem."
Pricing the Scuba Schools In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course. Prices You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own bins. If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the finished histogram. Prices Histogram This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape of the distribution suggests that: Prices 11/135
5/13/2016
Quantitative Methods Online Course
Histogram You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation): Prices Histogram Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average for the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand, maybe these more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with the course offered by my current contractor...
Exercise 1: VA Linux Stock Bonanza After a company completes its initial public offering, how is the ownership of common stock distributed between individuals in the firm, often termed "named insiders"? Let's examine a company, VA Linux, that choose to sell its stock in an Initial Public Offering (IPO) during the IPO craze in the late 1990s. According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of common stock owned by insiders: Source From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer)
Exercise 2: Employee Turnover Here is a histogram graphing annual turnover rates at a consulting firm. Which summary statistic better describes these data?
Exercise 3: Honidew Internship The J. B. Honidew Corporation offers a prestigious summer internship to firstyear students at a local business school. The human resources department of Honidew wants to publish a brochure to advertise the position. To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic expectations. The human resources manager calculates the mean GPA of the previous 8 interns, to include in the brochure. The mean GPA of the former interns is: Interns' GPA's In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35. In the presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'. What's the median GPA in this data set? Interns' GPA's
Exercise 4: Scuba Regulations Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for example, to scuba equipment using a device called a "rebreather" to recycle oxygen from exhaled air. Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With too little oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning. Minimizing the deviation of oxygen concentration levels from the specified level is clearly a matter of life and death! A scuba equipmenttesting lab compared the oxygen concentrations of two different brands of rebreathers, A and B. Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen concentration appear to have a lower standard deviation? 12/135
5/13/2016
Quantitative Methods Online Course
Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the set. Even without calculations, we have a good knack for seeing which set is more variable. We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we can calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.
Exercise 5: Fluctuations in Energy Prices After decades of government control, states across the US are deregulating energy markets. In a deregulated market, electricity prices tend to spike in times of high demand. This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable. To provide a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard deviation of prices during the 1990s, when electricity prices were largely regulated. From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and $50.55. What is the standard deviation of these eleven prices? Electricity Prices Source Excel makes the job much easier, because all that's required is entering the data into cells and inputting the range of cells into the =STDEV() function. The result is $2.02. On the other hand, to calculate the standard deviation by hand, use the formula: First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the sum of these squared differences, 40.79. Divide by the number of points minus one (11 1 =10 in this case) to obtain 4.08. Taking the square root of 4.08 gives us the standard deviation, $2.02.
Exercise 6: Big Mart Personal Care Products Suppose you are a purchasing agent for a wholesale retailer, BigMart. BigMart offers several generic versions of household items, like deodorant, to consumers at a considerable discount. Every 18 months, BigMart requests bids from personal care companies to produce these generic products. After simply choosing the lowest individual bidder for years, BigMart has decided to introduce a vendor "score card" that measures multiple aspects of each vendor's performance. One of the criteria on the score card is the level of yeartoyear fluctuation in the vendor's pricing. Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in relation to their average price, as measured by the coefficient of variation?
Summary Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater share of the project.
Relationships Between Variables So far, you learned how to work with a single variable, but many managerial problems involve several factors that need to be considered simultaneously.
Two Variables We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with two variables? Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship between the two? Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we let the data tell their story about the strength and nature of that relationship? 13/135
5/13/2016
Quantitative Methods Online Course
As always, one of our first steps is to try to visualize the data. Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one heightweight pair for each athlete. Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a relationship between height and weight. This type of graph is called a "scatter diagram." Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes. In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier. We need to be careful not to draw conclusions about causality when we see these types of relationships. Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights. Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier certainly doesn't make us taller! The direction and extent of causality might be easy to understand with the height and weight example, but in business situations, these issues can be quite subtle. Managers who use data to make decisions without firm understanding of the underlying situation often make blunders that in hindsight can appear as ludicrous as assuming that gaining weight can make us taller. Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort. We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort, the more massages purchased — to a point, where massages level off. Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our understanding of the underlying context from which the data are drawn.
Variable and Time Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable over time. In such cases, we can consider time as our second variable. Suppose we are planning the purchase of a large amount of highspeed computer memory from an electronics distributor. Experience tells us these components have high price volatility. Should we make the purchase now? Or wait? Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same way we plotted height and weight. Because time is one of the variables, we call this graph a time series. Time series are extremely useful because they put data points in temporal order and show how data change over time. Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal patterns, with prices in some months consistently higher than in others? Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on visual analysis when looking for relationships and patterns.
False Relationships Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be careful to avoid some of these common pitfalls. Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first victim of this unfortunate relationship. 14/135
5/13/2016
Quantitative Methods Online Course
Source James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office. Source Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the president elected in 2020? Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is no more than an interesting coincidence.
Hidden Variables Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the relationship. We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they are instead mutually related to another underlying factor. Suppose we're examining sales of icehockey pucks and baseballs at a sporting goods store. The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the two data sets actually related? If so, why? A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer, people play baseball. If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at all. We could have neglected a critical variable driving the sales of both products. In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two variables. A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable causes the other, but simply illustrate how the data behave.
Summary Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or may stem from a relationship each variable has with a third, often hidden variable.
Creating Scatter Diagrams To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart tools to plot the data. To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns. If the data sets are next to each other, simply select both sets. Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type: Scatter with Only Markers. Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X axis and the second column of data on the Yaxis. We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1. Then we can add the chart title and label the axes by selecting and editing the text. 15/135
5/13/2016
Quantitative Methods Online Course
Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of your chart.
Correlation By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that relationship? Can we describe the relationship in a standardized way? Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two variables looks strong ... ... or weak ... ... linear ... ... or nonlinear ... ... positive (when one variable increases, the other tends to increase) ... ... or negative (when one variable increases, the other tends to decrease). Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data points are close to an imaginary line running through a scatter plot. Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the two variables. However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two variables. To describe the strength of a linear relationship, the correlation coefficient takes on values between 1 and +1. Here's a strong positive correlation (about 0.85) ... ... and here's a strong negative correlation (about 0.90). If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly 1. At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle? Even when the correlation coefficient is 0, a relationship might exist — just not a linear relationship. As we've seen, scatter plots can reveal patterns and help us better understand the business context the data describe. To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates into a correlation coefficient, let's revisit the examples we analyzed visually earlier.
Influence of Outliers In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance patterns of their employees. For example, do workers' absence rates vary by time of year? Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466. While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the weather might indeed be the culprit. But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager might realize that the three outliers correspond to a latesummer, threeday transportation strike that kept some workers homebound the previous year. Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the outliers influence our measure of linearity so much? As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of 16/135
5/13/2016
Quantitative Methods Online Course
every data point. Just as it does with the mean, this inclusiveness can get us into trouble... Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.
Summary The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value of the correlation coefficient ranges between 1 and +1.
Finding in Excel Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height and weight. Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is aligned with its corresponding value in the other set. To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as shown below. The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height and weight, both values certainly need to refer to the same person!
Occupancy and Arrivals Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me with Leo's hotel occupancy problem." In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage of available rooms occupied by guests. Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month. On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can gather very precise data on arrivals. Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship, and a measure of its strength. Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by the Hawaii Department of Business, Economic Development, and Tourism. Kauai Data Source The best way to graphically represent the relationship between arrivals and occupancy is: Kauai Data Source You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as: Kauai Data Source You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. Kauai Data Source 17/135
5/13/2016
Quantitative Methods Online Course
To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you hit enter, the correct answer, 0.71, will appear. Kauai Data Together with Alice, you compile your findings and present them to Leo. Source I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals, the occupancy rates were very different — 68% in one month and 82% in the other. But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers come to Kauai in one month, and more hotel patrons in the other? Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of the arrivals would be what we call a hidden variable. Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than if they spend only 3 days. I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a later time. The scuba school contract is more pressing at the moment.
Exercise 1: The Effectiveness of Search Engines As online retailing expands, many companies are interested in knowing how effective search engines are in helping consumers find goods online. Computer scientists study the effectiveness of such search engines and compare how many results search engines recall and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for example a page containing both the phrases "winter parka" and "Eddie Bauer." What could you say about the relationship between the Precision and the number of Results Recalled? Source
Exercise 2: Education and Income Is an education a good investment in your future? Some very successful business executives are college dropouts, but is there a relationship in the general population between income and education level? Consider the following scatter plot, which lists the income and years of formal education for 18 people. Is the correlation: Source Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income education data, the coefficient is nearest to:
Sampling & Estimation Introduction: The Scuba Problem Leo asks you to help him evaluate the Kahana's contract with the scuba school. Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers. We started our yearround diving program last year, contracting a local diving school to do a scuba certification course. The oneyear trial contract is now up for renewal. 18/135
5/13/2016
Quantitative Methods Online Course
Maintaining the scuba offerings onsite isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide whether or not to renew the contract. The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the month. Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted but exhilarated. Alice is especially enthusiastic. "Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet! "But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed the scuba certification course. After all, we may have caught the instructor on his best day this year." Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school.
Generating Random Samples Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests, and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample?
How to Create a Representative and Unbiased Sample As managers, we often need to know something about a large group of people or products. For example, how many defective parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How many people in our industry plan to attend the annual conference? When it is too costly to gather the information we want to know about every person or every thing in an entire group, we often ask the question of a subset, or sample of the group. We then try to use that information to draw conclusions about the whole group. To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that sample and try to infer something about the total population we're interested in. For example, we could select a sample of people in our industry, ask them if they plan to attend the annual conference, and then infer from their answers how many people in the entire industry plan to attend. For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that between 7% and 13% of our entire population will attend. This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go forward. We want to know something about a population large enough to make examining every population member impractical. We first select elements from the population at random... ...then analyze that sample... ...and then draw an inference about the total population we're interested in.
Taking a Random Sample The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're interested in. For example, we couldn't just ask the conference organizers if they wanted to attend. They would not be representative of the whole group — they would be biased in favor of attending the conference! To get a good sample, we must make sure we select the sample "at random" from the full population. This means that every person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15 — of being selected. Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see 19/135
5/13/2016
Quantitative Methods Online Course
examples of some major mistakes people have made while trying to select a random sample, and provide some advice about how to avoid the most common types of sampling errors. In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the group in a database, we can just assign a unique number to each member of the group. We then let a computer draw random numbers from the list. This would ensure that each element of the population has an equal likelihood of being selected. If the population about which we need to obtain information is not listed in an easytoaccess database, the task of selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to introduce a bias in the way we select the sample. For example, if we want to know something about the opinions of an entire company, we cannot just pick employees from one department. We have to make sure that each employee has an equal chance of being included in the sample. A department as a whole might be biased in favor of one opinion.
Sample Size Once we have decided how to select a sample, we have to ask how large our sample needs to be. How many members of the group do we need to study to get a good estimate about what we want to know about the entire population? The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the larger the population, the larger the sample size needed to achieve a given level of accuracy, but this is not true. A sample size of 1,000 randomlyselected individuals can often give a satisfactory estimation about the underlying population, as long as the sample is representative of the whole population. This is true regardless of whether the population consists of thousands of employees or millions of factory parts. Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy of our estimate. Other times, we might need to sample thousands to obtain the accuracy we require. Later in this unit, we will find out how to calculate a good sample size. For now, it's important to understand that the sample size depends on the level of accuracy we require, not on the size of the population.
Learning about a Sample Once we select our sample, we need to make sure we obtain accurate information about each member of the sample. For example, if we want to learn about the number of defects a plant produces, we must carefully measure each item in the sample. When we want to learn something about a group of people and don't have any existing data, we often use a survey to learn about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to resolve. First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer them in a certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a question one way, and others interpret it differently, our results will be meaningless! Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the phone? Should we interview survey participants in person, or distribute handouts at a meeting? There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively inexpensive, but might have a very low response rate. This is a major problem if those who respond have a different opinion than those who don't respond. After all, the sample is meant to learn about the entire population, not just those with strong opinions! Creating a telephone survey creates other issues: When do we call people? Who is home during regular business hours? Most likely not working professionals. On the other hand, if we call household numbers in the evening the "happy hour crowd" might not be available. When we decide to conduct a survey in person, we have to consider whether the presence of the person asking the questions might influence the survey results. Are the survey participants likely to conceal certain information out of embarrassment? Are they likely to exaggerate? Clearly, every survey will have different issues that we need to confront before going into the field to collect the data. 20/135
5/13/2016
Quantitative Methods Online Course
Response Rates With any type of survey, we must pay close attention to the response rate. We have to be sure that those who respond to the survey answer questions in much the same way as those who don't respond would answer them. Otherwise, we will have a biased view of what the whole population thinks. Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to follow up with the people who did not respond the first time. We either need to increase the response rate by getting answers from those who originally did not respond, or we must demonstrate that the nonrespondents' opinions do not differ from those of the respondents on the issue of interest. Tracking down everyone in a sample and getting their response can be costly and time consuming. When our resources are limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample and settle for a low response rate.
Summary Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences:
Classic Sampling Mistakes To understand the importance of representative samples, let's go back in history and look at some mistakes made in the Literary Digest poll of 1936. The Literary Digest, a popular magazine in the 1930's, had correctly predicted the outcome of U.S, presidential elections from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who would become the next president? Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you vote for in this year's presidential election?" This may well be the largest poll in American history." Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!" Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and become President of the United States." As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was re elected to a third term in office in the largest landslide victory recorded to that date. This was a devastating blow to the Digest's reputation. What went wrong? How could such a large survey be so far off the mark? The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the survey to people on three different lists: the magazine's subscribers, car owners, and people listed in telephone directories. What was wrong with choosing a sample from these lists? The sample was not representative of the American public. Most lowerincome people did not subscribe to the Digest and did not own phones or cars back in 1936. This led the poll to be biased towards higherincome households and greatly distorted the poll's results. Lowerincome households were more likely to vote for the Democrat, Roosevelt, but they were not included in the poll. Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten million voters who were sent a poll, over two million responded. Two million is a huge number of people. What was wrong with this survey? The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than Democrats. Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the survey. Among those who received the survey, a disproportionate number of Republicans responded, and the results became even more biased. The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the outcome of the election. Its reputation wounded, the Digest went out of business soon thereafter. During the same election year, a little known psychologist named George Gallup correctly predicted what the Digest missed: Roosevelt's victory. What did Gallup do that the Literary Digest did not? Did he create an even bigger sample? 21/135
5/13/2016
Quantitative Methods Online Course
Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of accurate results if they weren't randomly selected from the population. Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly representative of the US population. He also instructed his team to be persistent in asking the opinion of each person in the sample, which generated a high response rate. Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling soon became a standard for public opinion polls. Today's polls usually consist of a sample of around a thousand randomly selected people who are truly representative of the underlying populations. For example, look at poll reported in a leading newspaper: the sample size will likely be around a thousand. Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a look at a recent example of a biased question. In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mailin survey to show that the public supported his desire to abolish special interest groups. This is the question he asked: Source In Perot's mailin survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in America agreed with Perot's stance. Source Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two interesting followup surveys. In the first survey, it used the same question that Perot asked and found that 80 percent of the population favored passing the law. YP attributed the difference to the fact that it was able to create a more representative sample than Perot. Source Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way: Source The response to this question was strikingly different. Only 40 percent of the sampled population agreed to prohibit contributions. As it turned out, the results of the survey all came down to the way the question was phrased. Source For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in the sample results. Source The real lesson of these two examples is this: How data are collected is at least as important as how data are analyzed. A sample that is unrepresentative, biased, or not drawn at random can give highly misleading results. How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need to be representative and unbiased, you conduct a survey of the hotel guests.
Solving the Scuba Problem (Part I) How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you determine that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year round. The database includes each guest's name, address, phone number, age, date of arrival, length of stay, and room number. Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good sample of all the guests who took scuba lessons. Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will respond, or conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest 22/135
5/13/2016
Quantitative Methods Online Course
contacted? To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by phone. Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a random, representative sample. You open up the list of names in the hotel database. The names were entered as guests arrived. To make things simple, you randomly select a date and then record the first 50 guests arriving after that date who took the course. You ask the hotel operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of the guests on the list. He asks the guests to rate their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to view your sample. Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter "5" as "5.0"). Round if necessary. Hotel Database You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes. Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data again. Back in your room, you look over your list of data. What should you tell Leo? What factor is biasing your results? When you report this news to Leo, he begins to laugh. We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did meet had to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually guests see a manta ray every week, and the guests in April could barely see the underwater coral. No wonder they weren't happy. You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that the guests are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests that show the average satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54.
Exercise 1: The Bell Computer Problem Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This morning, he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell Computers. The article praised the high quality and low cost of the PCs made by Bell. However, it also included some negative comments about Bell's customer service. Currently, customer service is only available to customers of Bell Computers over the phone. Collins wants to understand more fully what customers think of Bell's customer service. His marketing department designs a survey that asks customers to rate Bell's customer service from 1 to 10. How should he conduct the survey?
Exercise 2: The Wave Problem "Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the competition among laundry detergents is fierce. The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his marketing team to come up with a strong advertising campaign stressing Wave's superiority over other competitors. Wave conducted a survey in the month of June. They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you notice a difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave aired a major television campaign claiming that 75% of the population thought that Wave was a good product. You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a 23/135
5/13/2016
Quantitative Methods Online Course
study for Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water. Coral Reef wonders how the Wave results are possible, considering that Coral Reef holds over 45% of the current market share. Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not representative and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed?
Challenge: The Airport You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that were filled to capacity that morning. You decide to stand outside the airport's single exit door and ask a sample of 60 passengers leaving the airport how full their flight was. Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you quickly realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely have come from only a couple of flights, and you want to get a good sense of what percent of all flights arriving that morning were filled to capacity. Thus, you decide to randomly select 60 people from all the passengers departing the building that morning. After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their flight was filled to capacity, and 20 people tell you that their flight was not filled to capacity. What can you conclude from your survey results so far? What is the problem with your survey? To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers each) and five of which had only a single passenger on the plane. In this case, half of the planes were full. However, almost all of the passengers (500 of the total 505) departing from the airport would report (correctly!) that they had been on a full plane. Since people from a full plane are more likely to be selected, there is a systematic bias in your response. It is important, in every survey, to try to make your sample as representative as possible. In this case, your sample was not representative of the planes arriving to the airport. A better approach might be to ask the people you select what their flight number was, and then ask them how full their flight was. Make sure you have at least one passenger from every plane. Then count the responses of only one person from each flight. By including only one person per flight in your sample, you ensure that your sample is an accurate prediction of how many planes are filled to capacity. Sampling is complicated, and it is important to think through all the factors that might influence your results. In this case, the mistake is that you are trying to estimate a population of planes by sampling a population of passengers. This makes the sample unrepresentative of the underlying population. By randomly sampling the passengers rather than the flights, each flight is not equally likely to be selected, and the sample is biased.
The Population Mean You report the results of your survey, the sample mean, and its standard deviation to Leo.
The Scuba Problem II A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've only collected 45 responses. If you'd chosen different people, they likely would have given different responses. What if — just by chance — these 45 people loved the scuba course, and no one else did? You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from 4.4, but at this point we're not sure exactly how far away it might be. Without more calculations, all we can say is that 4.4 is the best estimate we have. That is why... Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey result? If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're off by two whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would be 24/135
5/13/2016
Quantitative Methods Online Course
completely different. I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving! The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the sample reflects the population. Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be able to pin down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with enough information to make a sound business decision." That sounds like a good idea, but you wonder how to actually do it.
Using Confidence Intervals The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense of how accurately the sample mean estimates the population mean. Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of our scuba school, what can we say about the average satisfaction rate of all scubadiving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0? To make decisions as a manager, we need to have more than just a good point estimate. We need to have a sense of how close or far away the true population mean might be from our estimate. We can indicate the most likely values of the true population mean by creating a range, or interval, around the sample mean. If we construct it correctly, this range will very likely contain the true population mean. For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average customer satisfaction for all scuba guests falls between 4.2 and 4.6. Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision than if he simply knew the estimated average of 4.4. Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean xbar, the standard deviation s, and the sample size n. We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For any level of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly what we mean by "confidence," and how to compute z. For now, just keep in mind that for higher levels of confidence, we'll need to put in a larger value of z. Using these numbers, we can create a range around the sample mean according to the following formula: Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should the range be centered? How wide must the range be to make us confident that it contains the true population mean? What factors would lead us to need a wider or narrower range? Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample mean. The sample mean is our best estimate of the population mean. This suggests that the sample mean should always be the center of the range. Move the slider bar to see how the sample mean affects the range. Second, the width of the range depends on the standard deviation of the sample. When the sample standard deviation is large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the population mean. Thus, we have to create a wider range to be confident that it includes the true population mean. On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an accurate predictor of the true population mean. In this case, we can draw a more narrow range. The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard deviation affects the range. Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or two atypical points in the sample could throw the sample mean off considerably from the true population mean. So with a small sample, we need to create a wide range to feel comfortable that the true mean is likely to be inside it. 25/135
5/13/2016
Quantitative Methods Online Course
The larger the sample, the more certain we can be that the sample mean represents the population mean. With a large sample, even if our sample includes a few atypical points, there are likely to be many more typical points in the sample to compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range. Move the slider bar to see how the sample size influences the range. Finally, the width of the range depends on our desired level of confidence. The level of confidence states how certain we want to be that the range contains the mean of the population. The more confident we want to be that the range contains the true population mean, the wider we have to make the range. If our desired level of confidence is fairly low, we can draw a more narrow range. In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident" that the range contains the true population mean. This means there is a 95% chance that the range contains the true population mean. Move the slider bar to see how the confidence level affects the range. These variables determine the size of the range that we want to construct. We will learn exactly how to construct this range in a later section. For now, all we have to understand is that the population mean can best be estimated by a range of values and that the range depends on three sample statistics as well as the level of confidence that we want to assign to the range.
Summary The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we construct a range around the sample mean that likely contains the population mean. The width of the range is determined by the sample size, sample standard deviation, and the level of confidence. The confidence level measures how certain we are that the range we construct contains the true population mean. Alice recommends taking a step back from sampling and learning about the normal distribution.
The Normal Distribution Alice recommends taking a step back from sampling and learning about the normal distribution. The normal distribution helps us create a range around a sample mean that is likely to contain the true population mean. You can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a precisely defined concept. Understanding the normal distribution will also give you deeper insight into how sampling works. The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes called the "bell curve." Like any probability distribution, the normal distribution is shown on two axes: the xaxis for the variable we're studying — women's heights, for example — and the yaxis for the likelihood that different values of the variable will occur. For example, few women are very short and few are very tall. Most are in the middle somewhere, with fairly average heights. Since women of average height are so much more common, the distribution of women's heights is much higher in the center near the average, which is about 63.5 inches. As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a specific range is equal to the area under the curve over that range. For example, the percentage of all women who are between 61 and 66 inches tall is equal to the area under the curve over that range. The percentage of all women taller than 66 inches is equal to the area under the curve to the right of 66 inches. Like any probability distribution, the total area under the curve is equal to 1, or 100%, because the height of every woman is represented in the curve. Over the years, statisticians have discovered that many populations have the properties of the normal distribution. For example, IQ test scores follow a normal distribution. The weights of pennies produced by U.S. mints have been shown to follow a normal distribution. 26/135
5/13/2016
Quantitative Methods Online Course
But what is so special about this curve? First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence, the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a value greater than the mean is 50%. Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends solely on the distribution's standard deviation. In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard deviation of the distribution. Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values very close to the mean. How is this information useful? Regardless of how wide or narrow the curve, it always retains its bellshaped form. Because of this unique shape, we can create a few useful "rules of thumb" for the normal distribution. For a normal distribution, about 68% (roughly twothirds) of the probability is contained in the range reaching one standard deviation away from the mean on either side. It's easiest to see this with a standard normal curve, which has a mean of zero and a standard deviation of one. If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability. The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its mean or standard deviation. For example, about two thirds of all women have heights within one standard deviation, 2.5 inches, of the average height, which is 63.5 inches. 95% of women have heights within two standard deviations (or 5 inches) of the average height. To see how these rules of thumb translate into specific women's heights, we can label the xaxis twice to show which values correspond to being one standard deviation above or below the mean, which values correspond to being two standard deviations above or below the mean, and so on. Essentially, by labeling the xaxis twice we are translating the normal curve into a standard normal curve, which is easier to work with. For women's height, the mean is 63.5 and the standard deviation is 2.5. So, one standard deviation above the mean is 63.5 + 2.5, and one standard deviation below the mean is 63.5 2.5. Thus, we can see that about 68% of all women have heights between 61 and 66 inches, since we know that about 68% of the probability is between 1 and +1 on a standard normal curve. Similarly, we can read the heights corresponding to two standard deviations above and below the mean to see that about 95% of all women have heights between 58.5 and 68.5 inches.
The zstatistic The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve, as we did with women's heights simply by relabeling the xaxis. To do this more formally, we use something called the zstatistic. For a normal distribution, we usually refer to the number of standard deviations we must move away from the mean to cover a particular probability as "z", or the "zvalue." For any value of z, there is a specific probability of being within z standard deviations of the mean. For example, for a zvalue of 1, the probability of being within z standard deviations of the mean is about 68%, the probability of being between 1 and +1 on a standard normal curve. A good way to think about what the zstatistic can do is this analogy: if a giant tells you his house is four steps to the north, and you want to know how many steps it will take you to get there, what else do you need to know? 27/135
5/13/2016
Quantitative Methods Online Course
You would need to know how much bigger his stride is than yours. Four steps could be a really long way. The same is true of a standard deviation. To know how far you must go from the mean to cover a certain area under the curve, you have to know the standard deviation of the distribution. Using the zstatistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of 0 and a standard deviation of 1. We are translating the real value in its original units — inches in our example — into a zvalue. The zstatistic translates any value into its corresponding zvalue simply by subtracting the mean and dividing by the standard deviation. Thus, for the women's height of 66 inches, the zvalue, z = (6663.5)/2.5, equals 1. Therefore, 66 is exactly one standard deviation above the mean. Essentially, the zstatistic allows us to measure the distance from the mean in terms of standard deviations instead of real values. It gives everyone the same size feet in statistics. We can extend the rules of thumb we've developed beyond the two cases we've looked at. For example, we may want to know the likelihood of being within 1.5 standard deviations from the mean, or within three standard deviations from the mean. Select different values of z — that is, select different numbers of standard deviations from the mean — and see how the probability changes. Be sure to try z values of 1 and 2 to verify that our rules of thumb are on target! Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard deviations are necessary on either side of the mean to capture that probability. For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the probability. Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the mean. This means that for a normal distribution, there is a 95% probability of falling between 1.96 and 1.96 standard deviations from the mean. Select different probabilities and see how many standard deviations we have to move away from the mean to cover that probability. We can create a table that shows which values of z correspond to each probability or we can calculate z using a simple function in Microsoft Excel. We'll explain how to use both of these approaches in the next few clips. ztable Remember, the probabilities and the rules of thumbs we've described apply ONLY to a normal distribution. Don't think you can use them for any distribution! Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just one example of a cumulative probability. A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left. There are easy ways to find cumulative probabilities using spreadsheet packages such as Microsoft Excel. You'll have opportunities to practice solving these types of problems shortly. Cumulative probabilities can be used to find the probability of any range of values. For example, to find the percentage of all women who have heights between 63.5 and 68 inches, we would simply subtract the percent whose heights are less than 63.5 inches from the percent whose heights are less than 68 inches.
Summary The normal distribution has a unique symmetrical shape whose center and width are completely determined by its mean and its standard deviation. For every normal distribution, the probability of being within a specified number of standard deviations of the mean is the same. The distance from the mean, as measured in standard deviations, is 28/135
5/13/2016
Quantitative Methods Online Course
known as the zvalue. Using the properties of the normal distribution, we can calculate a probability associated with any range of values.
Using Excel's Normal Functions To find the cumulative probability associated with a given zvalue for a standard normal curve, we use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean zero and standard deviation one. For example, to find the cumulative probability for the zvalue 1, we enter the Excel function =NORMSDIST(1). The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of obtaining a value less than 1 for a standard normal curve is about 84%. We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is symmetric, so there is a 50% chance of being below the mean. Next, we know that about 68% of the probability for a standard normal curve is between 1 and +1. Since the normal curve is symmetric, half of that 68% — or 34% of the probability — must lie between 0 and 1. Putting these two facts together confirms that there is an 84% chance of obtaining a value less than 1 for a standard normal curve. If we want to find the cumulative probability of a value in a general normal curve — one that does not necessarily have a mean of zero and a standard deviation of one — we have two options. One option is to first standardize the value in question to find the equivalent zvalue, and then use the NORMSDIST to find the cumulative probability for that zvalue. For example, if we have a normal distribution with mean 26 and standard deviation 8, we may wish to know the probability of obtaining a value less than 24. Standardizing can be done easily by hand, but Excel also has a STANDARDIZE function. We enter the function in a cell and insert three values: the value to be standardized, and the mean and standard deviation of the normal distribution. We find that the standardized value (or z value) of 24 for a normal curve with mean 26 and standard deviation 8 is 0.25. Now, to find the cumulative probability for the zvalue 0.25, we enter the Excel function =NORMSDIST(0.25), which tells us that the probability of a value less than 0.25 on a standard normal curve is 40%. Thus, the probability of a value less than 24 on a normal curve with mean 26 and standard deviation 8 is 40%. The second way to find a cumulative probability in a general normal curve is to use the NORMDIST function. Here, we enter the function in a cell and insert four values: the number whose cumulative probability we want to find, the mean and standard deviation of the normal distribution, and the word "TRUE." As with our previous approach, we find that the probability of obtaining a value less than 24 on a normal curve with mean 26 and standard deviation 8 is 40%. The value "TRUE" tells Excel to return a cumulative probability. If instead of "TRUE" we enter "FALSE," Excel returns the yvalue of the normal curve — something we are usually not interested in. Quite often, we have a cumulative probability, and want to work backwards, translating it into a value on a normal curve. Suppose we want to find the zvalue associated with the cumulative probability 95%. To translate a cumulative probability back to a zvalue on the standard normal curve, we use the Excel function NORMSINV. Note once again the S, which tells us we are working with a standard normal curve. We find that the zvalue associated with the cumulative probability 95% is 1.64. Sometimes we may want to translate a cumulative probability back to a value on a general normal curve. For example, we may want to find the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard deviation 8. 29/135
5/13/2016
Quantitative Methods Online Course
If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function. NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution in question. We find that the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard deviation 8 is 39.2.
Using the ztable The previous clip shows us how to use software programs like Excel to calculate zvalues and cumulative probabilities for the normal curve. Another way to find zvalues and cumulative probabilities is to use a ztable. Using ztables is a bit more cumbersome than using Excel, but it helps reinforce the concepts. Let's use the ztable to find a cumulative probability. Women's heights are distributed normally, with mean around 63.5 inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches? First, we calculate the zvalue for 65.6 inches, 0.84. The cumulative probability associated with the zvalue is the area under the standard normal curve to the left of the zvalue. This cumulative probability is the percentage of women who are shorter than 65.6 inches. We next use the table to find the cumulative probability corresponding to a zvalue of 0.84. First, we find the row by locating the zvalue up to the first digit to the right of the decimal point, 0.8. Then we choose the column corresponding to the remainder of the zvalue (0.84 — 0.8 = 0.04). The cumulative probability is 0.7995. About 80% of women are shorter than 65.6 inches. Finding the cumulative probability for a value less than the mean is a bit trickier. For example, we might want to know what percentage of women are shorter than 61.6 inches. We find that the zvalue for a height of 61.6 inches is a negative number: 0.76. When a zvalue is negative, we must first use the table to find the cumulative probability corresponding to the positive zvalue, in this case +0.76. Then, since the normal curve is symmetric, we will be able to conclude that the probability of being less than the zvalue 0.76 is the same as the probability of being greater than the zvalue +0.76. We find the cumulative probability for +0.76 by locating the row corresponding to the zvalue up to the first digit to the right of the decimal point, 0.7, and the column corresponding to the remainder of the zvalue (0.76 — 0.7 = 0.06). The cumulative probability is 0.7764. Since the probability of being less than a zvalue of +0.76 is 0.7764, then the probability of being greater than a z value of +0.76 is 1 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a zvalue of 0.76 is also 0.2236. Finally, we reach our conclusion. About 22.36% of women are shorter than 61.6 inches.
Practice with Normal Curves Find the cumulative probability associated with the zvalue 2. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel Find the cumulative probability associated with the zvalue 2.36. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel Find the cumulative probability associated with the zvalue 1. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. 30/135
5/13/2016
Quantitative Methods Online Course
ztable Excel Find the cumulative probability associated with the zvalue 1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel Find the cumulative probability associated with the zvalue 1.645. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 80. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than 80 but less than 115. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 80 and standard deviation 5, find the probability of obtaining a value greater than 85 but less than 95. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 45. Enter your answer in decimal notation with 3 digits to the right of the decimal, (e.g., enter "5" as "5.000"). Round if necessary. ztable Excel For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 38 but less than 45. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. 31/135
5/13/2016
Quantitative Methods Online Course
ztable Excel Find the zvalue associated with the cumulative probability of 60%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel Find the zvalue associated with the cumulative probability of 40%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel Find the zvalue associated with the cumulative probability of 2.5%. Enter your answer in decimal notation with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Excel For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 88%. Enter your answer as an integer (e.g., "5"). Round if necessary. ztable Excel For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 28%. Enter your answer as an integer (e.g., "5"). Round if necessary. ztable Excel
The Central Limit Theorem How can the normal distribution help you sample Leo's hotel guests? How do the unique properties of the normal distribution help us when we use a random sample to infer something about the underlying population? After all, when we sample a population, we usually have no idea whether or not the population is normally distributed. We're typically sampling because we don't even know the mean of the population! If the normal distribution is such a great tool, when can we use it? It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to us in sampling. To see why, let's first learn about a wellestablished statistical fact known as the "Central Limit Theorem".
Definition Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and plotted the means of each sample, then — assuming the samples we take are sufficiently large — the resulting plot of the sample means would look normally distributed. Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to the true mean of the population. 32/135
5/13/2016
Quantitative Methods Online Course
To repeat: no matter what type of distribution the population has — uniform, skewed, bimodal, or completely bizarre — if we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal distribution centered around the true mean of the population. The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a distribution of the means of many samples, but that is exactly what we are doing. We'll call this distribution the Distribution of Sample Means. (Statisticians also often call it the Sampling Distribution of the Mean). Let's walk through this stepbystep. If we have a population — any population — we can take a random sample. This sample has a mean. We can plot that mean on a graph. Then we take another sample. That sample also has a mean, which we also plot on the graph. Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the population's mean. The more samples we take, the more the graph of the sample means would look like a normal distribution. Eventually, the graph of the sample means — the Distribution of the Sample Means — would form a nearly perfect replica of a normal distribution. Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a normal distribution with them. We're taking a lot of samples here just to let you see that graphing the means of many samples would give you a normal curve. In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the Central Limit Theorem allow us to say based on that single sample? The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true population mean. The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally distributed and centered at the true population mean, we can completely disregard the underlying distribution of the population. As we'll see shortly, because we know so much about the normal distribution, we can use the information about the Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual population mean.
Summary The Central Limit Theorem states that for any population distribution, the means of samples from that population are distributed approximately normally. The more samples, and the larger the sample size, the closer the Distribution of Sample Means fits a normal curve. The mean of a single sample lies on this normal curve, so we can use the normal curve's special properties to extract more information from a single sample mean.
Illustrating Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked "Uniform," "Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation will display a different kind of distribution. On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for which all values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution that has two separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will display a distribution that is not symmetrical — values are more likely to fall above the mean than below.
Uniform The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample 33/135
5/13/2016
Quantitative Methods Online Course
mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts.
Bimodal The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts.
Skewed The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean. Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a graph. We repeat this process several times to create our distribution. Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of the sample means. As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key insight that will allow you to estimate the population mean from a sample.
Confidence Intervals Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values that is almost certain to contain the population mean.
Estimating a Population Mean II For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of the distribution's mean 95% of the time. The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large samples are normally distributed, regardless of the distribution of the underlying population. Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean. Combining these facts can give us a measure of how accurately the mean of a sample estimates the population mean. Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points — from a population, there is a 95% chance that the mean of that sample falls within two standard deviations of the true population mean. Let's build this up step by step to make sure we understand the logic. First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point on a normal distribution — the Distribution of Sample Means. Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that the sample mean is within two standard deviations of the mean of the distribution. 34/135
5/13/2016
Quantitative Methods Online Course
The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean. Thus, we can conclude that there is a 95% chance that the sample mean is within two standard deviations of the population mean. We have argued that 95% of our samples will have a mean within the range shown around the true population mean. Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence interval is. Let's look at intervals around the means of two different types of samples: those whose sample means fall within the 2 standard deviation range around the population mean (which should be the case for 95% of all samples) and those whose sample means fall outside the 2 standard deviation range around the population mean (which should be the case for 5% of all samples). First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population mean. Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population mean. Since the population mean is more than 2 standard deviations away from this sample mean, an interval of width 2 standard deviations around this sample mean could not contain the true population mean. We know that 5% of all samples should have sample means outside the 2 standard deviation range around the population mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population mean. Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range around the population mean. If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would contain the true population mean. Thus, 95% of all samples we obtain will have intervals that contain the population mean. We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a 95% confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain the true population mean. It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the population mean. To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population and drawing a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19 out of 20 — would actually contain the population mean. What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an approach that 1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't comfortable with those odds, we should raise the confidence level. If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not contain the true population mean. However, this higher confidence comes at a cost. If we keep the same sample size, then the confidence interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can increase our sample size. How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an interval, that interval is too wide. Let's look at an example. To estimate the percent of people in our industry who will attend the annual conference, we might construct a confidence interval that ranges from 7% to 13%. If we would select a different conference venue if the true percentages is 7% than if it is 13%, we need to tighten our range. Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be acquainted with. We need to know that the standard deviation of the Distribution of Sample Means is σ, the standard deviation of the underlying population, divided by the square root of n, the sample size. We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be tightly clustered around the true population mean, and thereby form a narrow distribution. 35/135
5/13/2016
Quantitative Methods Online Course
Summary A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the population mean. A confidence interval is centered at the mean of a sample randomly drawn from the population under study. When we have a confidence level of 95% we expect equally wide confidence intervals centered at 95 out of 100 such sample means to contain the population mean.
Finding a Confidence Interval You understand the theory behind a confidence interval. But how do you actually construct one? We can now translate the previous discussion into a simple method for finding a confidence interval for the mean of any population. First, we randomly select a sample of size 30 from the population. We then compute the mean and standard deviation of the sample. Next, we assign the sample mean as the center of the confidence interval. To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want a 95% confidence interval, the interval should be 2 times the standard deviation of the population divided by the square root of n, the sample size. Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do have — the standard deviation of the sample. Here's what the equation looks like for our example. If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the zvalue corresponding to the desired level of confidence. We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the sample size has to be at least 30.
Wine Lover's Magazing Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the average age of their subscribers so they can better target potential advertisers. We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with a sample, but emphasize that they want to be 95% confident that the range we give them contains the true average age of its full set of subscribers. We obtain survey results from 60 randomlychosen subscribers and determine that the sample has a mean of 52 and a standard deviation of 40. To find an appropriate confidence interval, we incorporate information about the sample into the formula: The zvalue for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95% confidence interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12. We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling them they can be 95% confident that the true population mean falls between these values. What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size from the sample data, but how do we obtain the right zvalue?
Obtaining the zvalue The zvalue for 95% confidence is well known to be about 2, but how do we find a zvalue for a less common confidence interval? To be 98% confident that our interval contains the population mean, how do we obtain the appropriate zvalue? To find the zvalue for 98% confidence level, we are essentially asking: How far to the left and right of the standard normal curve's mean do we have to go to capture 98% of the area? 36/135
5/13/2016
Quantitative Methods Online Course
Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering 1% of the area under the curve. The zvalue of the right boundary is the zvalue associated with a cumulative probability of 99% — the sum of the central 98% and the 1% in the left tail. Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is essential because Excel's NORMSINV function and the ztable work with cumulative probabilities. To find the zvalue associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which returns the zvalue 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to 0.9900. The zvalue is 2.33, the sum of the rowvalue 2.3 and the columnvalue 0.03. Try finding a zvalue yourself. Find the zvalue associated with a 99.5% confidence level using the appropriate normal distribution function in Excel or using the Standard Normal Table (ztable) in your briefcase. The correct zvalue for a confidence level of 99.5% is: ztable Excel Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the standard normal curve. To do this, note that to have 99.5% probability in the middle of the standard normal curve, we must exclude a total area of 1 99.5% = 0.5% from the curve. That area is divided into two equal parts in the distribution's tails: 0.25% in each tail. We can now see that the cumulative probability associated with confidence level of 99.5% is 1 — 0.25% = 99.75%. Thus, the zvalue for a confidence level of 99.5% is the same as the zvalue of a cumulative probability of 99.75%. We find the zvalue in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the zvalue in the ztable by looking up the probability 0.9975.
Summary To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a range around the sample mean with a specified level of confidence. The confidence level indicates how confident we are that the sample mean we collected contains the population mean.
Using Small Samples We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we have only a small sample? Let's consider a different survey, one that concerns a delicate matter. The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value of her guests' belongings. She needs this value to determine the correct insurance protection in case guest belongings disappear from their cabins, are destroyed in a fire, or sink with the ship. She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information. She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample, we need to prepare an estimate. With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample size. A small sample increases our uncertainty about two important aspects of our estimate of the population mean. First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure that the sample means follow a normal distribution. Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the population standard deviation. Due to these additional uncertainties, we cannot use zvalues to construct confidence intervals. Using a zvalue would overstate our confidence in our estimate. Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a handful of data points? It depends: if we don't know anything about the underlying population, we cannot create a confidence interval 37/135
5/13/2016
Quantitative Methods Online Course
with fewer than 30 data points. However, if the underlying population is normally distributed — or even roughly normally distributed — we can use a confidence interval to estimate the population mean. In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can construct a confidence interval, even when we have a small sample. However, we do need to modify our approach slightly. To estimate the population mean with a small sample, we use a tdistribution, which was discovered in the early 20th century at the Guinness Brewing Company in Ireland. A tdistribution gives us tvalues in much the same way as a normal distribution gives us zvalues. What is the difference between the normal distribution and the tdistribution? A tdistribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails, because it is more likely than the normal distribution to have values fall farther away from the mean. Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For example, we must go more than 2 standard deviations on either side of the mean to capture 95% of the probability for a t distribution. Thus, to achieve the same level of confidence, a confidence interval based on a tdistribution will be wider than one based on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with a smaller sample, so we need a wider interval to achieve a given level of confidence. The tdistribution is also different because it varies with the sample size: For each sample size, there is a different tvalue associated with a given level of confidence. The smaller the sample size n, the shorter the height and the thicker the tails of the tdistribution curve, and the farther we have to go from the mean to reach a given level of confidence. On the other hand, as the sample size increases, the shape of the tdistribution becomes more and more like the shape of a normal distribution. Once we reach a sample size of 30, the tdistribution becomes virtually identical to the zdistribution, so tvalues and zvalues can be used interchangeably. Incidentally, we can use the tdistribution even for sample sizes larger than 30. However, most people use the z distribution for larger samples, partially out of habit and partially because it's easier, since the zvalue doesn't vary based on the sample size.
Finding the tvalue To find the right tvalue, we first have to identify the tdistribution that corresponds to our sample size. We do this by finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample size minus one. If our sample size is 16, we have 15 degrees of freedom, and so on. Excel provides a simple function for finding the appropriate tvalue for a confidence interval. If we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate tvalue. For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15) would return the value 2.131. Excel Once we find the tvalue, we use it just like we used the zvalue to find a confidence interval. For example, for t = 2.131, the appropriate confidence interval is: Excel If we don't have Excel handy, we can use a tdistribution table to find the tvalue associated with the degrees of freedom and the confidence level we specify. When using different tvalue tables, we need to be careful to note which probability the table depicts. Excel ttable Some tables report values associated with the confidence level, like 0.95. Others report values based on the area in the tails, which would be 0.05 for a 95% confidence interval. Our ttable, like many others, reports values associated with a cumulative probability, so for a 95% level of confidence, we would have to look at a cumulative 38/135
5/13/2016
Quantitative Methods Online Course
probability of 97.5%. Excel ttable
The Good Ship Demiurgos Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers' belongings. The manager samples 16 guests, and reports that they have an average of $10,200 worth of clothing, jewelry, and personal effects in their cabins. From her survey numbers, we calculate a standard deviation of $4,800. We need to double check that the distribution isn't too skewed, which we might expect, since some of the passengers are quite wealthy. The manager explains that the insurance policy has a limited liability clause that limits a passenger's maximum claim to $20,000. Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest reported values above $20,000, the manager simply reported $20,000 as the value to be covered for our data set. We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel comfortable using the tdistribution. Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15) tells us that the appropriate tvalue is 2.131. Excel Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times $4,800 over the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and we can report that we are 95% confident that the average value of passengers' belongings is between $7,643 and $12,757. Excel ttable What if the Demiurgos' manager thinks this interval is too large? Excel ttable She will have to survey more guests. Increasing the sample size causes the tvalue to decrease, and also increases the size of the denominator (the square root of n). Both factors narrow the confidence interval. Excel ttable For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the tvalue would drop to 2.06 and the square root of n in the denominator would increase. The width of the interval would decrease significantly, from $2,557 to $1,939. Excel ttable
Summary Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is roughly normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a small sample, use a tdistribution. Tdistributions are a set of distributions that resemble the normal distribution, but with shorter heights near the mean and thicker tails. To find a confidence interval for a small sample size, place the appropriate tvalue into the confidence interval formula.
Choosing a Sample Size When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For 39/135
5/13/2016
Quantitative Methods Online Course
example, when estimating car owners' average spending on car repairs each year, we might want to be 95% confident that our estimate is within $50 of the true mean. We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample size, the tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a confidence interval that extends a distance of d on either side of the mean: To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a preliminary survey to obtain a rough estimate of sigma. In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of confidence, we set z = 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly sample at least 139 people. In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose a sample size n that satisfies the expression below. We can do this with simple algebra, or by using the attached Excel utility. Confidence Interval Utility
Summary When estimating a population mean, we can ensure that our confidence interval extends a distance of at most d on either side of the mean by choosing an appropriate sample size.
StepbyStep Guide Here is a stepbystep process for creating a confidence interval: First, we choose a level of confidence and a sample size n appropriate to the decision context. Second, we take a random sample and find the sample mean. This is our best estimate for the population mean. Third, we find the sample's standard deviation. Fourth, find the zvalue or tvalue associated with the proper confidence level. If our sample size is over 30, we find the zvalue for our confidence level. If not, we find the tvalue for our confidence level and with degrees of freedom = sample size 1. Fifth, we calculate the end points of the confidence interval using the formulae below.
Summary Construct confidence intervals using the steps outlined below. With a confidence interval derived from an unbiased random sample, we can say that the true population mean falls within the interval with the corresponding level of confidence.
Excel Utility Click here to open an Excel utility that allows you to create confidence intervals by providing the sample mean, standard deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the Wine Lover's Magazine and the Demiurgos examples.
Solving the Scuba Problem II The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem. You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the average satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you decide to create a 95% confidence interval for Leo. Your calculations show the following: 40/135
5/13/2016
Quantitative Methods Online Course
Confidence Interval Utility ztable ttable To create a 95% confidence interval, you take the mean of the sample and add/subtract the zvalue multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.45 points above and below the sample mean of 4.4, which translates into a confidence interval from 3.95 to 4.85. You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85. Leo looks at your numbers and shakes his head. That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's closer to 4, I'm concerned. Can we narrow the range at all? Looking over your notes, you think you can give Leo some options. Why don't you create a larger sample and report the results back to me? You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able to reach 25 guests. You combine the two samples, which gives a new sample size of 70. For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2. Armed with more data, you create another confidence interval. We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is between: Confidence Interval Utility ztable ttable To create this 95% confidence interval, you take the mean of the sample and add/subtract the zvalue multiplied by the sample standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5, which translates into a confidence interval from 4.22 to 4.78. Thank you. I am much happier with this result. I have enough information now to decide whether to keep the current scuba diving school.
Exercise 1: The Veetek VCR Gambit Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread throughout Japan. For over a year, the sales of highend VCRs have lagged, due to a shift towards DVD players. Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk shipment of highend VCRs. This would be a perfect opportunity for Toshi to liquidate slowmoving inventory currently languishing on the shelves of his stores. Before he calls Veetek, he wants to know how many highend VCRs he can promise. After two days of furious phone calls, his deputy has gathered data from 36 representative outlets in his retail chain. The mean highend VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi needs you to find a 95% confidence interval for the average VCR inventory per store. The interval is: Confidence Interval Utility ztable ttable
Exercise 2: Pulluscular Pig Disorder Paul Segal manages the pigfarming division of the agricultural company BowmanLyonsCenterville. A rumored outbreak of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public relations disaster. The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain size postmortem. 41/135
5/13/2016
Quantitative Methods Online Course
Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to find out. At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed. For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95% confidence, in what range does the herd's average brain weight lie? Confidence Interval Utility ztable ttable
Proportions The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.
The Customer Response Problem I'm sorry to disturb you, but I have another problem, and I think you might be able to help. The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops significantly during the offseason, from September through February and then April through May. We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant for weeklong periods during the slow season this year. I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay during our slow period. They won't have complete date flexibility of course, but the steep discount should make the offer attractive for them. To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The deadline by which they had to respond to the offer has passed. Ten guests responded with the required room deposit prior to the deadline — that's a solid 10 percent. I figure if we send out 2,000 promotions, we'll get about 200 responses. This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end up disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the discount, you'll have to turn away customers willing to pay full price. That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests, will it hold for 2,000? What if 11% actually respond to the promotions? Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the offer is not valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a discount. I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return guests can I safely send the discount offer and be confident that no more than 200 will respond? You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as interested in them as he is in your results.
Confidence Intervals And Proportions Sometimes, the question we pose to members of a sample calls for a yes or no answer. We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask if they plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant produced yesterday and count how many are defective. Even though our question has only two answers, we still have to address an inherent uncertainty: We know what values our data can take — yes or no — but we don't know how often each response will be given. In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion, p bar. This is our best estimate of p, the true percentage of "yes" responses in the underlying population. 42/135
5/13/2016
Quantitative Methods Online Course
Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to know what percentage of all passengers remembers seeing our ad. We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300 passengers respond to our survey, of which 100 passengers report remembering the ad. Then pbar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of respondents, 300. The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always add up to 1 because survey respondents report either remembering the ad or not. Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best estimate, or point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%. As managers, we typically want more than this simple point estimate — we want to know how accurate the estimate is. How far from 33% might the true percentage be? Can we say confidently that it is between 30% and 36%, for example? When we work with proportions, how do we find a confidence interval around our point estimate? The process for creating a confidence interval around a proportion is nearly identical to the process we've used before. The only difference is that we can approximate the standard deviation of the population with a simple formula rather than calculating it directly from the raw data. Based on our sample, our best estimate of the true population proportion is pbar, the percentage of "yes" responses in our survey. Statistical theory tells us that our best estimate of the standard deviation of the true population proportion is the square root of [(pbar)*(1 (pbar)]. We can use this approximate standard deviation to determine a confidence interval for the proportion. For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A 95% confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to 0.33 plus or minus 0.053, or 27.7% to 38.3%. Unfortunately, there is one catch when we calculate confidence intervals around proportions...
Sample Size Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are sampling New Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S., the odds of having the disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people? No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very small, we need to have a large enough sample to make sure we find at least a few people with the disease. Otherwise, we will not have enough data to get a good estimate of the true proportion. There is a guideline we must meet to make sure that our sample is large enough when estimating proportions. Two conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second, the product of the sample size and 1 minus the proportion must also be at least 5. If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our sample contains a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be useful otherwise. To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However, since we don't know the proportion pbar before sampling, we don't know if the two conditions are met before setting the sample size. How can we get around this problem?
Finding a Preliminary Estimate of pbar We can obtain a preliminary estimate of pbar using either of two methods: first, we can use past experience. For example, to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general population. This is a reasonable first estimate for pbar. In many cases, however, we are sampling for the first time. Without past experience, we don't know what pbar might be. In this case, it may well be worth our time to take a small test sample to estimate the proportion, pbar. 43/135
5/13/2016
Quantitative Methods Online Course
For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our preliminary estimate of pbar. Substituting 3% for pbar in our two requirements, n(pbar) ≥ 5 and n(1 (pbar)) ≥ 5, tells us that n must satisfy n*0.03 ≥ 5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample must be at least 167. We would then use a real sample — with at least 167 respondents — to find an actual sample value of pbar to create a confidence interval for the population proportion.
Summary Proportions are often used to indicate the frequency of some characteristic in a population. The sample proportion pbar is the number of occurrences of the characteristic in the sample divided by the number of respondents, the sample size. It is our best estimate of the true proportion in the population. We can construct a confidence interval for the population proportion. Two guidelines for the sample size must be met for a valid confidence interval: n(p bar) and n(1 (pbar)) must each be at least five.
Solving the Customer Response Problem Creating confidence intervals around proportions is not much different from creating them around means. Finding the right number of Leo's promotional brochures to mail should be easy. Leo needs to know how accurate the 10 percent response rate of his 100customer sample is. Will this response rate hold for 2,000 guests? To how many guests can he send the discount offer for his 200 rooms? First, you calculate a 95% confidence interval for the response rate. Enter the lower bound as a decimal number with two digits to the right of the decimal, (e.g., enter "5" as "5.00"). Round if necessary. ztable Confidence Interval Utility The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that answer by using the sample data and applying the familiar formula: Then after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of guests. Enter the number of guests as an integer, (e.g., "5"). Round if necessary. ztable Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most 1,259 past customers. Leo is pleased with your work. He tells you to relax and enjoy the resort.
Exercise 1: GMW Automotive GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez heads the Mexican sales division of the company's Latin American subsidiary. GMW earns additional profit when customers choose to finance their car purchase with a GMW financing package. Arturo has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW customers who opt for financing. Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You take a random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%, opted for GMW financing. If you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all Mexican customers opting for GMW financing falls in the range: 44/135
5/13/2016
Quantitative Methods Online Course
ztable Confidence Interval Utility
Exercise 2: Crown Toothpaste Kayleigh Marlon is the Chief Buyer at TarMart, a company that operates a chain of superstores selling discount merchandise. TarMart has a huge national presence, and manufacturers compete fiercely to get their products onto TarMart's shelves. Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4 weeks and display it prominently. After that period, she will stop stocking Crown unless 5% of TarMart's customers bought Crown or were considering buying Crown within the next month. The trial period is now over. Kayleigh has asked you to take a sample of customers to see if TarMart should continue stocking Crown. She would like you to be at least 95% confident in your answer. The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when TarMart introduced a new product, the percentage of people who expressed interest ranged between 2% and 10%. What sample size should you use? You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had bought Crown or were considering buying Crown within the next month. What is the 95% confidence interval for the population proportion? ztable Confidence Interval Utility First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(pbar) = 250*0.04 = 10 ≥ 5 and n(1 (pbar)) = 250*0.96= 240 ≥5. Then, using the formula, you find the confidence interval around the sample proportion. The endpoints of that interval are 1.6% and 6.4%.
Challenge: OOPS! Package Deliveries OOPS is a smallpackage delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard increasing complaints about late deliveries, and wants to know how many of the shipments are late by one day or more. Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered late, a proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence interval calculation, the interval is: Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that the population proportion of late packages is between: ztable Confidence Interval Utility First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size satisfies the rules of thumb. Both n(pbar) and n(1 (pbar)) are greater than 5. Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].
Hypothesis Testing Introduction After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you are gathering your beach gear, Leo gives you another call.
Improving the Kahana Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me with next. I've been working on ideas to increase the Kahana's profits. 45/135
5/13/2016
Quantitative Methods Online Course
Is it possible to increase profits by raising the room prices? That would be an easy solution. I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take into consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where we started from with profits — or even worse. What other factors influence your profits? The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money guests spend on nonroom amenities. You know, food, drinks, spa services, sports activities, and so on. As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant, for example, or the types of amenities offered. And you'd like us to help you understand your guests' discretionary spending patterns better. Right. Then I can explore new ways to increase profits on nonroom amenities. I can also see if some of my recent efforts to increase guest spending have paid off. I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For example, I hired a new executive chef last year. I'd like to know if restaurant revenues per person have changed since then. I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on beverages. Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I might change our marketing campaigns to better suit each of those market segments. What records do you have for us to work with? We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the results. You're really getting into these statistical methods, aren't you, Leo?
Definition Leo made some important changes to his business and he has some ideas of what the impact of these changes has been. How do you put his ideas to the test? As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions. Based on whether or not our claim is statistically supported, we may wish to take managerial action. Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to substantiate. To begin, we will learn how to test hypotheses about population means. For instance, suppose we know that the historical average number of defects in a production process is 3 defects per 1,000 units produced. We have a hunch that a certain change to the process — a new machine, say — has changed this number. The hypothesis we wish to substantiate is that the average defect rate has changed — that it is no longer 3 per 1,000. How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then, we see whether or not what we learn about the sample supports our hypothesis that the defect rate has changed. Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently say that the defect rate has changed? That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that has been cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it is very likely for the mean of a sample taken from the population to fall within that range. The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range or not. If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect rate is still 3. However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the 46/135
5/13/2016
Quantitative Methods Online Course
defect rate has changed. This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example and define our new statistical technique more precisely. To conduct a hypothesis test, we formulate two hypotheses: the socalled null hypothesis and the alternative hypothesis. Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null hypothesis states that the population mean is equal to that initial value: in our example, the null hypothesis states that the current population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population mean, in this case the current average defect rate. The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the average defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not hold. As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not reject it. Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we necessarily "accept" the alternative hypothesis. In our example, the evidence from our sample will help us determine whether or not we should reject the null hypothesis that the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed. Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly unlikely that our sample mean would come from a population with the mean stated by the null hypothesis. For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null hypothesis. Drawing a sample with 14 defects from a population with an average defect rate of 3 would be very unlikely. "We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to draw a definite conclusion." For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis, since it wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect rate of 3. Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that we "accept" the null hypothesis — we simply don't reject it. It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence. It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two types of conclusions with an analogy: a US jury trial. In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the accused is innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is trying to prove. The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it is certain beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude that the accused truly is innocent. The jury simply declares that the accused is "not guilty. Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it. A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave unchallenged a particular statement: the null hypothesis. Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results support his claims that food and beverage spending patterns have changed. 47/135
5/13/2016
Quantitative Methods Online Course
Summary We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the population mean is equal to an initial value that is based on our experience or conventional wisdom. We test the null hypothesis to learn if we should reject it in favor of our claim, the alternative hypothesis, which states that the null hypothesis does not hold.
Single Population Means The next morning, Leo explains the measures he has undertaken to increase customer spending on food and beverages. "I'd like to see if they've had a discernable impact on my guests' restaurantrelated spending patterns."
The Restaurant Revenue Problem Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the main cocktail lounge. The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate items on the menu, like that mango and brie tart I recommended to you. She also has offerings that cater to simpler tastes. But the question is, have restaurant profits been affected by the new chef? Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits have increased too. Based on last year's consolidated reports, the average spending on food per person per day was $55. I'm curious to see if that has changed. In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now more guests can be seated in the lounge, and more seats have good views of the ocean. I also invested in a large machine that makes a wide variety of frozen drinks. Frozen pina coladas are very, very popular. I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high margins, but I'm not sure if beverage sales have increased enough to cover the investments. Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?" Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average spending on beverages per guest per day was $21. Isn't that high? Well, we have some very nice wines in our restaurants. We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests. We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food revenues, total beverage revenues, and number of guests at the table. Using this information, we should be able to estimate the daily spending on food and beverages per guest. You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar renovations — have influenced the resort's profits.
Hypothesis Tests for Single Population Means Leo has prepared data for you. How are you going to put it to use? Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of test. Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started showing old classics instead of recent releases. He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He would like to know if average customer satisfaction has changed since he altered his theater's artistic focus. The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the 48/135
5/13/2016
Quantitative Methods Online Course
Greek letter mu to represent the current mean satisfaction rating of the theater's entire filmgoing population. His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now different. To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is careful to sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a standard deviation of 2.8. Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this year's moviegoers really are more satisfied? Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average satisfaction rating? This is equivalent to asking the question: If the null hypothesis is true — the average satisfaction is still 6.7 — would we be likely to randomly draw the sample that we did, with average satisfaction 7.3? To answer this question, we have to first define what we mean by "likely." As in sampling and estimation, we typically use 95% as our threshold level of likelihood. We then construct a range around the population mean specified by our null hypothesis. The range should be drawn so that if the null hypothesis is true, 95% of all samples drawn from the population would fall in that range. In other words, we create a range of likely sample means. The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its familiar properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed population mean, mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population standard deviation, so we use the sample standard deviation as our best estimate. As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on either side of the mean. However, when we conduct a hypothesis test, we center the range around the mean specified in the null hypothesis because we always start a hypothesis test by assuming the null hypothesis is true. In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence level translates into a zvalue of 1.96. We construct the range of likely sample means: This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample will fall between 6.3 and 7.1. Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis. Why? Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside this range. The region outside the range of likely sample means is called the "rejection region," since we reject the null hypothesis if our sample mean falls into it. In the movie theater example, the rejection region contains all values less than 6.3 and all values greater than 7.1. In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we reject the null hypothesis, we in effect accept the alternative hypothesis. We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7. If our sample mean had fallen within the range around 6.7, we could not make a definite statement about moviegoers' satisfaction. We would not have enough evidence to state that things have changed, but we can never claim that they have definitely remained the same. Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only with sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go wrong in two ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the first of these: the null hypothesis is true, but we reject it. We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the rejection region when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely we mean a 5% chance. However, 5% of all samples from a population with the null hypothesis mean would fall in the rejection region, so when we reject a null hypothesis, there is a 5% chance we will do so erroneously. Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified in rejecting the null hypothesis. Hence we continue to speak of a confidence level of 95%. 49/135
5/13/2016
Quantitative Methods Online Course
A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean when they say that something is "statistically significant at a 5% significance level. If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type. The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As managers, we need to choose the confidence level of our test based on the relative costs of making each type of error. The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always constructed around sample means, never around population means. When we construct a confidence interval, we don't even have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population mean, not for testing particular claims about that mean.
Summary In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we construct a range of likely sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level measures how confident we are that we are justified in rejecting the null hypothesis.
Onesided Hypothesis Tests The movie theater manager did not have a strong conviction about the direction of change for customer satisfaction prior to performing the hypothesis test. He wanted to test for change in both directions — up or down — and thus he used a twosided hypothesis test. The null hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer satisfaction may have increased or decreased. The twotailed nature of the test was reflected in the twosided range we drew around the population mean. Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a specific direction. For instance, if the theater manager were quite sure that satisfaction had not decreased, he wouldn't have to test in that direction; rather, he'd only have to test for positive change. In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of tests are called onesided hypothesis tests. Here, we substantiate the claim that the mean has increased only if the sample mean is sufficiently higher than 6.7, so our rejection region extends only to the right. Let's outline how to formulate one and twosided tests. For a twosided test we have an initial understanding of the population: the population mean is equal to a specified initial value. If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that the mean still equals that initial value. The alternative hypothesis should state that the mean does not equal that initial value. If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean — then the null hypothesis should state that the population mean has at most that value. The alternative hypothesis states that the mean is greater than the null hypothesis mean. Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null hypothesis should state that the mean is at least that initial value. The alternative hypothesis should state that the mean is less than the null hypothesis mean, and the rejection region extends only to the left. When we conduct a onesided hypothesis test, we need to create a onesided range of likely sample means. Suppose the theater manager claims that satisfaction improved. As usual, he states the claim he wants to substantiate as his alternative hypothesis. The 196person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to substantiate the claim that mean satisfaction increased? To find out, the manager creates a onesided range: he 50/135
5/13/2016
Quantitative Methods Online Course
assumes the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95% of all sample means. To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means be less than that value? To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of 95% corresponds to a zvalue of 1.645. ztable Why is this different from the zvalue for a twosided test with a 95% confidence level? For a twosided test, the z value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a onesided test, the zvalue corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail. ztable We now have all the information we need to find the upper bound on the range of likely sample means. The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the manager rejects the null hypothesis. He is confident that customer satisfaction is higher.
Summary When we want to test for change in a specific direction, we use a onesided test. Instead of finding a range containing 95% of all sample means centered at the null hypothesis mean, we find a onesided range. We calculate its endpoint using the cumulative probability under the normal curve.
Excel Utility (Single Populations) The Excel Utility link below allows you to perform hypothesis tests for single populations. Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to reproduce the results for the theater manager's example. Excel Utility for Single populations
Solving the Restaurant Revenue Problem A singlepopulation hypothesis test tests a claim using a sample from a single population. With a plan in mind, you take a look at Leo's sample data. You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to organize the data from your sample on daily guest spending on restaurant food. One change Leo made to his restaurant operations was to hire a new chef. He wants to know whether average restaurant spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a hypothesis test. Last year's average spending on food per person was $55; this gives you an initial value for the mean. Leo wants to know if mean spending has changed, so you use a twosided test. You jot down your null hypothesis, which states that the average revenue per guest is still $55. If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be accounted for by chance. You add the alternative hypothesis to your notes. Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range of likely sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in the rejection region? 51/135
5/13/2016
Quantitative Methods Online Course
Leo didn't specify what level of confidence he wanted for your results. You call him for clarification. I suppose a 95% confidence level is okay. I'd like to be more confident, of course. After you point out that higher confidence would reduce his chances of being able to substantiate a change in spending if a change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to compute a range around the null hypothesis mean of $55. Consulting your notes, you find the correct formula: You find the range containing 95% of all sample means. Its endpoints are: ztable Utility for Single Populations You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19 out of 20 samples of this size from the population of hotel guests would have means that would fall in the calculated range. The sample mean of $64 falls outside of this range. You and Alice report your results to Leo. Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has increased. I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has affected average beverage spending? Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage spending per guest. He wants to know if spending has gone up. You decide to do a onesided test. First, you write down all of Leo's data, along with the hypotheses: You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a zvalue of 1.645. The upper bound is $24.29. What is the correct interpretation of this number? Given that the null hypothesis is true, The range of likely sample means contains the collected sample mean of $24. This tells you that: Presenting your full report to Leo, he appears confused and disappointed. How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't pay off, shouldn't the increase in seats have helped? First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact that our sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in average beverage revenue. Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we are unable to conclude that spending has increased. You added seats and now more people can be seated in your lounge. But a greater number of guests does not necessarily translate into more spending per person. That does make a lot of sense. Your overall revenues may have actually increased, because more guests can be seated in the lounge. Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic I'll see a jump in total beverage revenues in the consolidated report at the end of the year. Why don't we go fill three of those new seats right now?
Exercise 1: Oma's Pretzels Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks, and advertises that these pretzels contain an average of 112 calories per serving. In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard deviation of the sample was 19. 52/135
5/13/2016
Quantitative Methods Online Course
Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed? ztable Utility for Single Populations You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states that the population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric content in the actual population is what Oma's has always advertised. The alternative hypothesis should contradict the null hypothesis. For a twosided test, the alternative hypothesis simply states that the mean does not equal the initial value. A twosided test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed. You assume that the null hypothesis is true and construct a range of likely sample means around the population mean. Using the data and the appropriate formula, you find the range [102; 121]. The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99% confident that the population mean is not 112. Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test? ztable Utility for Single Populations
Exercise 2: The Clearwater Power Company The Clearwater Power Company produces electrical power from coal. A local environmental group claims that Clearwater's emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town downwind of the plant. According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts per billion (ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the environmentalist's argument. The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a mean sulfur dioxide content of 35 ppb with a standard deviation of 24 ppb. The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are higher than permitted. Which of the following is an appropriate null hypothesis for this problem? ztable Utility for Single Populations The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a onesided test. The alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they will choose a 95% confidence level. What is the range of likely sample means? ztable Utility for Single Populations They calculate the onesided range around the null hypothesis mean that contains 95% of all samples. The zvalue for a onesided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb. Based on your calculations, you should: ztable Utility for Single Populations
Exercise 3: Neshey's Smooches You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its former capacity. 53/135
5/13/2016
Quantitative Methods Online Course
If the machine isn't working at top capacity, you will need to have it replaced. Which type of hypothesis test is most appropriate for this problem? The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340 Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected onehour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44. You conduct a onesided hypothesis test using a 95% confidence level. According to your calculations, you should: The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ
View more...
Comments