Graphs and Numerical Summaries 1 Data Types
2
2 Graphs
4
3 Measures of Center
5
4 Measures of Spread
10
5 Five-Number Summaries, Boxplots, z-Scores
16
6 Misleading Graphs
22
www.apsu.edu/jonesmatt
1
1
Data Types
Categorical variables take values from a set of categories. Quantitative discrete variables take values from a finite or countable set of numbers, like {0, 1, 2, 3, . . .} Quantitative continuous variables take values from an interval of numbers, like [0, ∞) or [0, 10]. Examples:
www.apsu.edu/jonesmatt
2
Relative Frequencies: Proportions and Percentages frequency of category proportion of a category = total number of observations percentage of a category = proportion × 100% A frequency table lists the possible variable values with the frequencies for each value. Examples:
www.apsu.edu/jonesmatt
3
2
Graphs
Group qualitative data using pie charts or bar charts. Group quantitative data using histograms, stem and leaf plots, or dot plots. The distribution of a set of data is a graph, table, or mathematical formula that indicates the different kinds of possible observations and how often they occur. Distributions of quantitative data have shape, and the shape of a distribution can be determined by looking at histograms, stem and leaf plots, or dot plots. Example 1 Use graphs to describe some categorical variables for your classmates. Example 2 Use graphs to describe some quantitative variables for your classmates. www.apsu.edu/jonesmatt
4
3
Measures of Center Summation Notation
Given numbers x1 , x2 , x3 , . . . , xn , we succinctly express their sum x1 + x2 + x3 + x4 + · · · + xn as X xi i
Examples:
www.apsu.edu/jonesmatt
5
Mean, Median, Mode A measure of center is a one-number description of a distribution or data set, and we focus on three. Mean or average: the sum of the numbers divided by the number of numbers: P i xi mean ≡ n Median or 50th percentile: a number that separates the lower 50% and upper 50% of the numbers. Mode: the number that occurs most frequently in the set. There can be more than one mode. Which measure of center is best?
www.apsu.edu/jonesmatt
6
Example 3 In the 2002 Winter Olympics, figure skater Michelle Kwan competed in the short program ladies single event. She received the following scores for technical merit: 5.8
5.7
5.9
5.7
5.5
5.7
5.7
5.7
5.6
Find the mean, median, and mode.
Throw out the score of 5.5 and again find the mean, median, and mode.
www.apsu.edu/jonesmatt
7
Properties Resistant measures are not sensitive to extreme data. The median is resistant, the mean is not.
Example 4 Compare the mean and median of the salaries $13, 000
$32, 000
$45, 000
with the mean and median of the salaries $13, 000
www.apsu.edu/jonesmatt
$32, 000
$250, 000
8
Population Mean and Sample Mean The population mean µ (pronounced myoo) of a population of size N is the average of all values x1 , x2 , . . . , xN in the population: 1 X µ= xi N i The sample mean x (say x-bar ) of a sample of size n is the average of all values x1 , x2 , . . . , xn in the sample: 1X x= xi n i
www.apsu.edu/jonesmatt
9
4
Measures of Spread
Measures of spread summarize how far data are spread out. We focus on three. Standard deviation: used when the mean is the measure of center. It is the most important measure of spread (definition to follow). Range: the largest value minus the smallest value. Interquartile range: used when the median is the measure of center (definition to follow).
www.apsu.edu/jonesmatt
10
Other Important Sums (leading up to measuring the spread) Sum of squared distances from 0: X
x2i
i
Sum of distances from the mean: X (xi − x) i
Sum of squared distances from the mean: X (xi − x)2 i
www.apsu.edu/jonesmatt
11
Example 5 History Exam Scores x
x2
x−x
(x − x)2
91
95
92
76
www.apsu.edu/jonesmatt
12
Population and Sample Standard Deviation Roughly speaking, the standard deviation measures the variation in a population or data set by indicating how far, on average, each number is from the mean. • Population standard deviation σ: rP 2 i (xi − µ) σ= N • Sample standard deviation s: rP s=
www.apsu.edu/jonesmatt
− x)2 n−1
i (xi
13
Example 6 The ages in years of all seven MATH 4270 students are 26
22
24
21
23
32
24
Find the population standard deviation and the range.
Take a simple random sample of size three. Then find the sample standard deviation and range of three students’ ages.
www.apsu.edu/jonesmatt
14
Facts About Standard Deviation The more variation among data in a sample, the larger the standard deviation. Like the mean, the standard deviation is not resistant because its value is affected by extreme data points. Empirical Rule: For bell-shaped distributions, • about 68.27% of all possible observations lie within one σ of µ. • about 95.45% of all possible observations lie within two σs of µ. • about 99.73% of all possible observations lie within three σs of µ.
www.apsu.edu/jonesmatt
15
5
Five-Number Summaries, Boxplots, z-Scores
The first quartile Q1 is the 25th percentile, and same as the median of data at or below the median. The second quartile Q2 is the 50th percentile, and same as the median. The third quartile Q3 is the 75th percentile, and same as the median of data at or above the median. Example 7 Twenty people reportedly watched the following numbers of hours of TV weekly: 8
22
34
16
13
26
19
23
25
31
34
30
31
20
22
41
32
30
39
29
Find the quartiles. www.apsu.edu/jonesmatt
16
The interquartile range (IQR) is compute as Q3 − Q1 (this is our third measure of spread). The IQR is not sensitive to extreme values and is therefore a resistant measure of spread. The IQR is used as the measure of spread when the median is used as the measure of center. Example 8 Compute the range and IQR for the data in Example 7. The five number summary of data consists of the min, max, Q1 , Q2 , and Q3 . Example 9 Write the five number summary for the following 100 meter race times (in seconds): 10.69 11.11 11.18 12.44 10.76 10.88 10.64
www.apsu.edu/jonesmatt
17
Outliers Outlier(s): data value(s) that is (are) far from most of the data. Lower limit: Q1 - (1.5)(IQR) Upper limit: Q3 + (1.5)(IQR) Data greater than the upper limit or less than the lower limit are potential outliers. Examples: Human heights of 8’ 11”. Miles per gallon rates greater than 95, or number of people struck more than five times by lightening
www.apsu.edu/jonesmatt
18
Boxplots Determine the 5-number summary. Compute lower & upper limits. Mark and label the quartiles with vertical lines and box them in. Indicate all potential outliers with ∗ and label them. Mark and label the smallest & largest values occurring within upper and lower limits with vertical lines, and connect the lines to the to the box (these are called adjacent values). Example 10 Make a boxplot for the following eye pressures (in mmHg) of fifteen Caucasians and African Americans: 16.2 16.7 15.3 15.9 24.6 18.4 17.2 15.8 16.7 17.8 16.1 14.9 16.6 21.2
www.apsu.edu/jonesmatt
19
z-Scores (Standardized Data) Data can be standardized so that different data sets can be compared, or to compare values within the same data set. Example 11 The average height of men is 69 inches with a st. dev. of 2.8 inches. The average height of women is 63.6 inches with a st. dev. of 2.5 inches. Michael Jordan is 78 inches tall. Rebecca Lobo is 76 inches tall. Relatively speaking, who is taller? Jordan’s and Lobo’s heights should be standardized relative to those of their genders so their heights can be compared. If x is a variable, then z = (x − µ)/σ is the standardized version or z-score of x. Calculate the x-scores of Jordan’s and Lobo’s heights.
www.apsu.edu/jonesmatt
20
Facts About z-Scores The mean of the z-scores of a population is always 0. The standard deviation of the z-scores of a population is always 1. Most z-scores will fall between -3 and 3. z-scores never have units! Example 12 Body temperatures of healthy human children have mean µ = 98.60o F and standard deviation σ = 0.62o F . Your child has temperature of 101o F . What should you do?
www.apsu.edu/jonesmatt
21
6
Misleading Graphs
Truncated graphs magnify relative frequency differences between categories. Example 13 Value of a mutual fund over time. Improper scaling gives incorrect impressions about the relative differences between categories. Example 14 Golf Balls.
www.apsu.edu/jonesmatt
22