READING GUIDE - DAY 2 UNIVARIATE DATA SUMMARY ECO 1005, 2017 FALL INSTRUCTOR : JUNGMO YOON HANYANG UNIVERSITY
Read Chapters 2.3, 3.1, and 4.1-4.3. We will have a brief R session. 1. Graphical Techniques Graphical techniques to describe a dataset: Histogram, Scatter plot, and Box plot. (i) A histogram visualizes the distribution of a variable. (ii) A scatter plot visualizes the relationship between two variables. (iii) A box plot visualizes the distribution of a variable. It is useful to compares distributions of multiple variables. Illustration : The class-size and test scores. We will use the California Test Score Data frequently. (i) The California Standardized Testing and Reporting dataset contains data on test performance, school characteristics and student demographic backgrounds. (ii) The data are from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. (iii) Test scores are the reading and math scores on the Stanford 9 standardized test administered to 5th grade students. (iv) The student-teacher ratio is the number of students, divided by the number of full-time equivalent teachers in the district. (v) School characteristics (averaged across the district) include enrollment, number of teachers (measured as full time-equivalents), number of computers per classroom, and expenditures per student. 1
2
ECO 1005, 2017 FALL INSTRUCTOR : JUNGMO YOON HANYANG UNIVERSITY
(vi) We also have demographic variables for the students. Implementation : First, make a directory ‘stat’. Download ‘caschool.csv’ (from the class website) and save it in the directory. I. R In R, change your working directory to ‘stat’. You are now ready. Type > data <- read.csv("caschool.csv") The data set is in your memory. Type > hist(data$str) > hist(data$read scr) > plot(data$str,data$read scr) > mean(data$str) > summary(data$str) II. Excel Use Tools → Data Analysis → Histogram. Read page 74 for additional information. 2. Numerical Descriptive Statistics 2.1. Central Tendency. A measure of central location describes the center of the distribution of the data. The most common measure is the mean. The mean of a variable is the long-run average value of the variable over many repeated trials. Other measures of central location include • Median : The value of the observation falls in the middle when you sort all the observations in ascending order. It is the value that divides the ordered data into two halves. • Mode : The value of the observation that occurs with the greatest frequency. Formally, let us denote observations in a sample of size n x1 , x2 , . . . , xn
STATISTICS
3
where x1 is the value of the first observation, and x2 is the value of the second observation, and so on. The mean of a sample (or simply the sample mean) x¯ is written as an arithmetic mean of n observations n
1X x¯ = xi n i=1 The median (or the mode) does not have a closed form expression. So we will use an example to illustrate. Ex 1) Suppose that in your sample, the sample size n = 5, and x1 = 4 , x2 = 1, x3 = 2, x4 = 5, x5 = 4. (a) what is the sample mean? (b) the sample median? (c) the sample mode? 2.2. Measure of Variability or Data Dispersion. A measure of dispersion describes the variability or spread of the data. The most common choice is the standard deviation. P A natural measure of spread is the average value of xi − x¯, namely, n−1 ni=1 (xi − x¯). P This quantity turns out to be useless (why?). So instead we use ni=1 (xi − x¯)2 as a measure of variability. The sample variance s2 is the average value of squared deviations, Pn
− x¯)2 n−1 One problem of the variance is that it is measured in units that are different from 2
s =
i=1 (xi
the units of the original variable. To go back to the original unit, we take the square root and this gives us a standard deviation. √ s = s2 =
sP
n i=1 (xi
− x¯)2 n−1
How to use the standard deviation? How to interpret its value?
4
ECO 1005, 2017 FALL INSTRUCTOR : JUNGMO YOON HANYANG UNIVERSITY
The Empirical Rule says the following rules: Suppose that the central location is zero. Then • Pr (−s ≤ xi ≤ s) ≈ 68%, • Pr (−2s ≤ xi ≤ 2s) ≈ 95%, • Pr (−3s ≤ xi ≤ 3s) ≈ 99.7%. Suppose that the central location (denoted by µ) is not zero. Then • Pr (−s ≤ xi − µ ≤ s) ≈ 68%, • Pr (−2s ≤ xi − µ ≤ 2s) ≈ 95%, • Pr (−3s ≤ xi − µ ≤ 3s) ≈ 99.7%. Other measures of variability include • Range : Largest observation - Smallest observations. • Interquartile Range : Q3 − Q1 .∗ • Average Absolute Deviation : n−1
Pn
i=1
|xi − x¯|.
In R, type > var(data$str) > sd(data$read scr) > summary(data$read scr)
2.3. Quartiles, Deciles, and Percentiles. Just as the median is the point that divided an ordered sample into the equal two, other divisions of the sample are possible. The lower quartile is the point where one quarter of the observations lies below and three quarters of observations lies above. The upper quartile is the point where three quarters of the observations lies below and one quarter of observations lies above. You can extend it further. Percentile. The p-th percentile is the value for which p% of observations are less than that value. So The lower (first) quartile Q1 is the 25th percentile. The middle (second) quartile Q2 is the median. The upper (third) quartile Q3 is the 75th percentile. ∗
Find the definition of Q3 or Q1 in the next section.
STATISTICS
5
Deciles split the sample into tenth. The first decile is the 10th percentile, and the second decile is the 20th percentile. A Box Plot summarizes the data by displaying five statistics: the minimum, the maximum, and three quartiles. In R, type > quantile(data$read scr, prob=(1:9/10)) > boxplot(data$str) Exercise 1. A sample of 10 adults was asked to report the number of hours they spent on the Internet the previous month. Calculate the mean and median. 0 7 12 5 33 14 8 0 9 22 Exercise 2. The data collector recorded 133 instead of 33 by mistake. Calculate the mean and median of this mis-measured sample. 0 7 12 5 133 14 8 0 9 22 Exercise 3. A sample of six students reported the number of summer jobs they applied for. Find the sample variance. 17 15 23 7 9 13 Exercise 4. Show that the following two expressions are identical (meaning that the two expressions always agree) n
1 X s2 = (xi − x¯)2 = n − 1 i=1
n
1 X 2 x n − 1 i=1 i
Hint) Recall that from Day 1 reading guide,
Pn
!
i=1 (xi
−
n 2 x¯ . n−1
− x¯)2 =
Pn
i=1
x2i − n¯ x2 .