5.1 Describing one variable
5.1.1 Summary statistics
A useful first step in analyzing the distribution of scores on a single numeric variable is to calculate the relevant summary statistics. Use the summary()
function for a quick, general overview. This returns the minimum, mean, and maximum scores, as well as the score at 1st, 2nd (median), and 3rd quartiles.
summary(dcps) # for every variable in the data frame
## SchCode SchName SchType
## Min. :202 Length:108 Elementary:64
## 1st Qu.:264 Class :character Middle :25
## Median :318 Mode :character High :19
## Mean :340
## 3rd Qu.:414
## Max. :943
## NumTested ProfLang ProfMath
## Min. : 12 Min. : 0.0 Min. : 0.00
## 1st Qu.: 112 1st Qu.:12.3 1st Qu.: 9.38
## Median : 146 Median :19.1 Median :20.56
## Mean : 180 Mean :29.7 Mean :26.96
## 3rd Qu.: 212 3rd Qu.:40.0 3rd Qu.:36.88
## Max. :1423 Max. :94.1 Max. :82.76
## DataVERSION
## Length:108
## Class :character
## Mode :character
##
##
##
summary(dcps$ProfLang) # for a specific variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 12.3 19.1 29.7 40.0 94.1
For specific inquiries, use the summarize()
function and customize your report. For example:
%>% # start by piping in the dataset
dcps summarize(
Avg = mean(ProfLang), # calculates the mean
StdDev = sd(ProfLang), # standard deviation
Range = max(ProfLang) - min(ProfLang)
)
## # A tibble: 1 x 3
## Avg StdDev Range
## <dbl> <dbl> <dbl>
## 1 29.7 24.6 94.1
5.1.2 Graphing the distribution
We typically use a histogram or box plot to visualize the distribution of scores on a numeric variable.
# Basic histogram
hist(dcps$ProfLang)
# Basic boxplot
boxplot(dcps$ProfLang, horizontal = TRUE)
See the chapter on data visualization to learn how to format these graphs appropriately for academic or professional settings.
5.1.3 Testing hypotheses
A one-sample \(t-\)test (t.test()
) compares the observed mean on a numeric variable to a hypothesized mean. The resulting \(p\)-value indicates the probability of observing the mean in your data from a population defined by the null hypothesis (mu =
).
For example, evaluate the argument that at least half of DC public school pupils read at or above grade level (i.e. \(H_0:~\mu \geq 50\)).
t.test(dcps$ProfLang, mu = 50, alternative = 'less')
##
## One Sample t-test
##
## data: dcps$ProfLang
## t = -8.6, df = 107, p-value = 5e-14
## alternative hypothesis: true mean is less than 50
## 95 percent confidence interval:
## -Inf 33.66
## sample estimates:
## mean of x
## 29.73
The test results suggest that it is extremely unlikely (\(t=-8.6\), \(p<0.001\)) that we would observe these data if the majority of DC public school pupils read at or above grade level. We can reject the null hypothesis.