Statistics 1 - Notes
Short Description
S1 edexcel notes...
Description
Edexcel Notes S1
Statistics 1
Mathematical Model A mathematical model is a simplification of a real world problem. 1. 2. 3. 4. 5. 6. 7.
A real world problem is observed. A mathematical model is thought up. The model is used to make predictions, "What happens if...?" Real world data is collected. Predicted results are obtained. These are compared with statistical tests. Models are refined as required and then it's back to stage 3...
Advantages of using mathematical models models are: • • • •
They simplify a real world problem. They improve our understanding of a real world problem. They are quicker and cheaper. They can be used to predict future outcomes.
Disadvantages of using mathematical models are: • •
Only give a partial description of the re al problem. Only work for a restricted range of values.
Stem and Leaf One of the simplest ways of ordering or dering data and presenting data is to place it in a stem and leaf diagram. For example, which the following data: Person Weight (lb) Height (cm)
1 166 161
2 164 160
3 143 160
4 189 199
5 191 167
6 178 178
7 165 169
8 159 174
9 189 172
10 191 178
Unordered Stem and Leaf
Ordered Stem and Leaf
Height in cm 3 | 4 represents 34 cm.
Height in cm 3 | 4 represents 34 cm.
15 16 17 18 19
15 16 00179 16 00179 18 19 9
10079 84286 9
11 176 167
1 Liverpool F.C.™
Edexcel Notes S1
As you can see from the key, the | divides tens from units. Stem and leafs can also be back to back, if you have two sets of data to display. Using the data above:
Weight in pounds 4 | 3 represents 34 lb.
Height in cm 3 | 4 represents 34 cm. 3 9 7654 8 99 11
14 15 16 00179 17 24688 18 19 9
Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in this example, than height. If it we re comparing something like scores on two exams, we could compare the median.
Frequency Tables Amount (£x) 0 ≤ x < 20 20 ≤ x < 40 40 ≤ x < 60 60 ≤ x < 80 80 ≤ x < 100
Frequency, ( f ) 5 9 20 25 9
Cumulative Frequency One way we can interpret the data is by working out the c umulative frequency. This simply means add the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From the above example, we get: Amount (£x) 0 ≤ x < 20
Frequency ( f ) 5
Upper class boundary 20
Cumulative frequency
20 ≤ x < 40
9
40
14 (5+9)
40 ≤ x < 60
20
60
34 (5+9+20)
60 ≤ x < 80
25
80
59 (5+9+20+25)
80 ≤ x < 100
9
100
68 (5+9+20+25+9)
Total
68
5
To check you're right for the cumulative frequency, you can add the frequency column. Or the question will probably say something like, "a survey of 68 people..." and that's an even e asier check. When we have our c umulative frequency column, we can draw a cumulative frequency curve. 2 Liverpool F.C.™
Edexcel Notes S1
Using this, we can also create a box plot . This is deduced by looking at the quartiles up the y-axis and finding the corresponding x-values:
Box plots are useful because they tell you lots of information, such as the Median, show you the spread of the IQR, if t here are any outliers and whether t he data is normal, positively or negatively skewed. The IQR is a measure of spread.
IQR = Q ₃ - Q ₁ Outliers are extreme values. They are usually represented as a cross:
3 Liverpool F.C.™
Edexcel Notes S1
They can be either too low or two high and are usually worked out by t he equations:
Q 1 - 1.5 x (IQR)
(Anything less than this figure will be an outlier)
Q 3 + 1.5 x (IQR)
(Anything greater than this figure will be an outlier).
The exam question will always state how to work out the outliers though, so this is o ne thing you don't have to worry about remembering (j ust as long as you know how to use t he formula). When you've distinguished the outliers, where does the e nd of the box plot occur? You can e ither use the next highest/lowest data value after the outlier, or use the value worked from the formula.
Linear Interpolation To work out the median, find the For Q 1 work out the
value.
value and for Q 3 find the
value.
Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the
.
For a grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of estimating an answer, however, and this is called linear interpolation. Time (sec) 0 ≤ x < 10 10 ≤ x < 15 15 ≤ x < 17.5 17.5 ≤ x < 20 20 ≤ x < 24
Frequency 0 8 3 7 12
The first step is the find the
Cumulative Frequency 0 8 11 18 30
Class width 10 5 2.5 2.5 4
value. In this example, it is 15.5.
We take away 11 and then divide it by 7 (the frequency of the row the cumulative 15.5 is found in). Next we times by 2.5 (the class width of the row 15.5 is found in). Finally add on 17.5 (the lower class boundary of the row 15.5 is found in) and the answer appears, 19.1. The only difference for the percentiles and other quartiles is replacing
by whatever you want to find.
Mean from frequency table It's easy enough to work out the mean from normal data, just the t he simple formula:
4 Liverpool F.C.™
Edexcel Notes S1
(In other words, add them all up and divide by the number that there is.) Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24
Frequency (f) 0 8 3 7 12
For a grouped frequency table, you'll need to work out the mid-point of the x variable.
Midpoint = The formula is:
Therefore, once you have the midpoint, you need to multiply f and x: Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24
Frequency (f) 0 8 3 7 12
Midpoint (x) 4.75 12 16 19 23
f(x) 0 96 48 133 276
Add the f(x) column and then divide by the total of the Frequency column to find the mean:
Standard Deviation For an ordinary set of data, the standard deviation is found by the following:
(Variance is the same formula, but without the square root). For a frequency table, or grouped frequency table, though, again we have a slightly different formula:
Taking the above as an example, we need to add an f(x)2 column. Be careful with this. Notice only the x 2 is squared, not (fx) .
5 Liverpool F.C.™
Edexcel Notes S1
Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24
Frequency (f) 0 8 3 7 12
Midpoint (x) 4.75 12 16 19 23
f(x) 0 96 48 133 276
2
f(x) 0 1152 768 2527 6348
Now add up the fx2 and f columns, and write in the mean squared:
Stick all that in your calculator and you'll get the answer: 4.48 (3 sf)
Coding When the numbers are too large to be reasonably worked with, the re is an option for finding the mean. We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a smaller number). Use the code
to calculate the mean and standard deviation of the following frequency table:
x 15.5 25.5 35.5 45.5 55.5 65.5 75.5
Frequency f 8 12 15 16 11 6 2
2 We need to add the code column, and work out y and then add a column for f (y) (y) and f (y) (y) rather than 2 (x) and f (x) (x) : f (x)
x
Frequency f
15.5 25.5 35.5 45.5 55.5 65.5 75.5
8 12 15 16 11 6 2
-3 -2 -1 0 1 2 3
2
(y) f (y)
(y) f (y)
-24 -24 -15 0 11 12 6
72 48 15 0 11 24 18
Next, work out the mean of y, using the formula:
6 Liverpool F.C.™
Edexcel Notes S1
= -0.49 (3 s.f) We think back to the original code:
If we replace y with
here, we can replace x with
Add the numbers, and rearrange to make
:
the subject of the formula.
= 40.6 (3 s.f.) and that's your answer! For standard deviation it’s exactly the same. Now, if we think of the dispersion, adding and subtracting won't affect the Standard deviation. Dividing and multiplying will, however.
Histograms Histograms are used for representing data that is continuous and are summarized in a grouped frequency distribution. • •
There are no gaps between the bars. The area of the bar is proportional to the frequency.
Example: The height of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a histogram to represent the data.
Height 120-124 124-129 130-134 135-139 140-149
Frequency f 1 5 7 4 3
There are two columns that we need to add: the class width and the frequency density.
7 Liverpool F.C.™
Edexcel Notes S1
Class width is the width of each group. Be careful when calculating to work out from the lower class boundary and the upper class boundary. For example, 120-125 is actually: 124.5-119.5 and so the class width is 5.
Height 120-124 125-129 130-134 135-139 140-149
Frequency f 1 5 7 4 3
Class Width 5 5 5 5 10
Frequency Density 0.2 1 1.4 0.8 0.3
When we have these values, we plot the lower class and upper class boundaries on the x axis and the frequency density on the y axis.
Skewness From the histogram above, we see a slight positive skew: there are more values towards the negative than there are towards the positive. There are three types of skew, positive, negative and normal, and there are three tests to differentiate between them:
8 Liverpool F.C.™
Edexcel Notes S1
Positive Skew
Symmetrical
Negative Skew
Mean > Median > Mode
Mean = Median = Mode
Mean < Median < Mode
Q 2 - Q 1 < Q 3 - Q 2
Q 2 - Q 1 = Q 3 - Q 2
Q 2 - Q 1 > Q 3 - Q 2
Correlation Correlation is a measure of relationship between two or more variable. When we have t wo sets of data, we can draw a scatter diagram to see if there is any correlation between them Data: The marks of 10 candidates in Maths and Physics is shown below: Candidate Physics (x) Maths (y)
1 18 42
2 20 54
3 30 60
4 40 54
5 46 62
6 54 68
7 60 80
8 80 66
9 88 80
10 92 100
From the data, we can plot the x values corresponding corre sponding to the y values. The only difference is that we don't join the crosses with a line: We can already see that it's positively correlated. A way to test this is to divide the graph into four quadrants, and then look at where t he majority of the points lie:
9 Liverpool F.C.™
Edexcel Notes S1
If most points lie in the 1st and 3rd quadrants, we have a positive correlation.
If most points lie in the 2nd and 4th quadrants, we have a negative correlation.
If points lie in all four quadrants randomly, we have no correlation.
However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the strength of the correlation. There's a formula for this called PMCC (Product Moment Correlation Coefficient).
How to calculate Sxy, Sxx and Syy:
10 Liverpool F.C.™
Edexcel Notes S1
From the above information, we complete the following table: x 18 20 30 40 46 54 60 80 88 92
y 42 54 60 54 62 68 80 66 80 100
Σx = 528
Σy = 666
2
x 324 400 900 1600 2116 2916 3600 6400 7744 8464 2 Σx = 34464
2
y 1764 2916 3600 2916 3844 4624 6400 4356 6400 10000 2 Σy = 46820
xy 756 1080 1800 2160 2852 3672 4800 5280 7040 9200 Σxy = 38640
If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.
Now using the PMCC formula:
PMCC works so that –1 ≤ r ≤ 1, with -1 being perfect negative correlation, 0 being no corre lation and +1 being perfect positive correlation. 0.863 is strong positive. Even if we code the data, the PMCC remains the same.
Least squares regression line
We can work out b easily enough from the data above:
11 Liverpool F.C.™
Edexcel Notes S1
= 66.6 = 52.8
If the question asked you to draw draw on the regression regression line, line, an easy way is to plot the
and
point on the
scatter diagram, and then draw the line from the y-axis point, cro ssing this point. The mean point always lies on the line. If the data is coded, we need to uncode when finding the me an. An independent (explanatory) variable is one that is set independently of the other variable. (Plotted on the axis). A dependent (response) variable is one whose values are determined by the values of the independent variable. (Plotted on the axis).
Interpolation is when you estimate the value of a dependent variable within the range of the data. Extrapolation is when you estimate a value outside the r ange of the data. Values estimated by extrapolation can be unreliable. Probability If A A is an event, the probability of it occurring is the number of ways A can occur, divide by the sample space (total number of outcomes, S). =
Probability is always 0 ≤ p ≤ 1. If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find p(A'), we merely take p(A) away from 1. A B - this means A "intersection" "intersection" B - all elements that that are in A and in B. We can see this on a Venn diagram:
12 Liverpool F.C.™
Edexcel Notes S1
A È B means A "union" B -- all eleme nts that are in A or in B. On a Venn diagram this is:
Addition Rule This addition rule for finding P( A È B) :
We can rearrange this to get:
Example: There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hardback and the remaining 9 are paper back. Find the probability that a hard-back fiction book is chosen at random.
First stage is to draw a Venn diagram and write in all the numbers:
13 Liverpool F.C.™
Edexcel Notes S1
F) so so where where is is it it both both H and F? Where the two circles overlap, so 4/15.
We're We're looki looking ng for for p(H p(H
Find the probability that a hardback is chosen but is not fiction.
We'r We're e wan wanti ting ng p(H p(H
F'). F'). Whic Which h is is 2/15.
Conditional Probability This occurs when the probability of A is co nditional upon B having already occurred. Given B, find the probability of A. It's written out as p(A|B).
We use tree diagrams to solve conditional probability. Example: A bag contains 6 red and 4 blue balls. balls. 2 balls are picked at random and retained. retained. Find the probability that both balls are red.
First, draw out a tree diagram.
We want want p(R p(R
R), so so we just just foll follow ow the the tree diag diagram ram alo along: ng: 14
Liverpool F.C.™
Edexcel Notes S1
6/10 x 5/9 = 30/90 = 1/3. Find the probability that the balls are different colours.
We wan wantt p(R p(R
B) and and p(B p(B
R). R). Mult Multip iply ly acro across ss bot both h bran branch ches es and and the then n add add thes these e toge togeth ther: er:
p(R p(R
B) = 6/10 6/10 x 4/9 4/9 = 24/9 24/90 0
p(B p(B
R) = 4/10 4/10 x 6/9 6/9 = 24/9 24/90 0
= 48/90 = 8/15. Find the probability that the second ball is red, given the first is blue.
We want p(R|B), so we use the formula:
= 24/90
4/10
= 2/3.
Independent Events Independent events are the opposite of conditional, where one factor doesn't affect t he next. Example, if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many times you pick from the bag. This means:
If they are mutually exclusive , they cannot occur at the same same time time and the p(A
B) is 0.
This means that:
Sample Space Diagram Example : A dice is thrown twice tw ice and the scores obtained are added together. Find the probability that the total score is 6. 15 Liverpool F.C.™
Edexcel Notes S1
There are 36 equally e qually likely outcomes.
6
7
8
9
10
11
12
5 of the outcomes result in a total of 6.
5
6
7
8
9
10
11
4
5
6
7
8
9
10
3
4
5
6
7
8
9
2
3
4
5
6
7
8
1
2
3
4
5
6
7
1
2 3 4 5 First Throw
6
Second Throw
Discrete Random Variables Discrete Random Variables are probabilities such as the "number on a fair die". The probability for discrete random variables is written as P(
).
Example: A tetrahedral die has the numbers numbers 1, 2, 3, 4 on its faces. The die is biased in such such way that: P( )= P( ) = 3
= 1,2,3 =4
If we draw out this in a probability distribution table we get: P(
)
1 2 3 4
3
All the probabilities added together = 1. (1 + 1 + 1 + 3) = 1 6 =1
= Therefore, we can write out the probability distribution: P(
)
1 2 3 4
We can also find the cumulative distribution, the F(x):
16 Liverpool F.C.™
Edexcel Notes S1
P(
F(x)
)
1 2 3 4
1
The cumulative probability always adds up to 1.
P(
) means the probability of getting an X value less than or equal to 2. We add up the probabilities
we have, and so, in the above example, P(
F(x) means
)=
so F(2) =
If a question asks you something like F(3.5) , in our example 3.5 doesn't exist. Therefore, we do F(3) instead, which would be .
Mean and Variance Finding the mean and variance is almost identical to finding the mean of a frequency table. The formula for mean:
For Variance, we have the formula:
To find Example: If X is a discrete random variable.
0 1 2
0.4 0.5 0.1
0 0.5 0.2 0.7
0 0.5 0.4 0.9
Therefore,
Suppose is the random variable given by look like this:
by coding for the above table. The table would now
17 Liverpool F.C.™
Edexcel Notes S1
-2 1 4
0.4 0.5 0.1 Total
-0.8 0.5 0.4 0.1
1.6 0.5 1.6 3.7
Remember the code: To decode back:
In general:
Discrete Uniform distribution is where each random variable has t he same probability. For example, when is the probability of a fair 6-sided die. Each probability would be . A Discrete Uniform distribution
over the values 1,2,3,…, n.
Example: A tetrahedral dice has its faces numbered 1, 2, 3 and 4. X is the score obtained when the dice is rolled.
18 Liverpool F.C.™
Edexcel Notes S1
X therefore has a uniform distribution,
.
= 2.5
The Normal Distribution
-
Symmetrical about the mean.
-
Total area under the curve = 1
-
Probabilities correspond to the area.
-
A continuous distribution (therefore there is no difference between
and
. -
68% of the distribution lies within 1 standard deviation of the mean.
-
95% of the distribution lies within 2 standard deviations of the mean.
-
99.7% of the distribution lies within 3 standard deviations of the mean.
-
The masses of new born babies.
-
IQ of school students.
-
Hand span of adult females.
-
Height of plants growing in a field.
Examples:
19 Liverpool F.C.™
Edexcel Notes S1
Working out Probabilities using tables. Examples: 1. 2.
–
3.
–
4.
–
5.
–
6.
–
–
–
•
If P(Z < a) is greater than 0.5 than a will be >0.
•
If P(Z < a) is less than 0.5, than a is less than 0.
•
If P (Z > a) is less than 0.5 than a will be > 0.
•
If P (Z > a) is more than 0.5 than a will be
View more...
Comments