Lab 11: Categorical Data Analysis
Please note that all images were created with modifications to the defaults to make them digitally accessible. If you recreate this code in another environment, your plots have different colors and backgrounds.
1 Getting Started
Be sure to load the packages ggformula and mosaic, using the library() function. Remember, you need to do this with each new Quarto document or R Session. Add the package names in each of the blanks below to load in the indicated packages.
library() loads in packages. You need to supply the package name you need to load inside the parentheses.
library(ggformula) #for graphs
library(mosaic) #for statistics
library(tidyverse) #for data management
library(ggformula) #for graphs
library(mosaic) #for statistics
library(tidyverse) #for data management2 Hair and Eye Color
In 1974 a Statistics professor collected data on 592 randomly selected students at the University of Delaware to help the students learn about analysis of categorical data. Import the data eye-hair-color.csv.
2.1 Exploratory Data Analysis
Here are the summary statistics and data visualizations we can create for categorical data.
2.1.1 Calculate Frequency/Count
We can use the tally() function to find the counts/frequency of the hair and eye color of the individuals in the study.
tally(~x , data = mydata)Here is the frequency distribution for hair color in our sample.
Calculate the frequency distribution of eye color in the sample:
tally(~Eye, data = eye_hair)
tally(~Eye, data = eye_hair)2.1.2 Calculate Relative Frequency/Proportion
We can adjust the arguments in the tally() function to find the proportion/relative frequency of the hair and eye color of the individuals in the study.
tally(~x , data = mydata, format = "prop")Here is the relative frequency distribution for hair color in our sample.
Calculate the relative frequency distribution of eye color in the data:
tally(~Eye, data = eye_hair, format = "prop")
tally(~Eye, data = eye_hair, format = "prop")2.1.3 Creating a Contingency Table
Suppose we want to compare the distribution of eye color across different hair colors. We can create a contingency table using the same tally() function.
tally(y ~ x , data = mydata) #y position are rows, x position are columnsHere is the frequency distribution for hair and eye color in our sample.
2.1.4 Calculate Conditional Proportions
In order to calculate the conditional proportions, we can adjust our code a little bit to make it clear which variable we are conditioning on.
tally(~y | x , data = mydata, format = "prop") # y conditioned on xHere are the conditional proportions of hair color given eye color in our sample.
Now calculate the conditional proportions of eye color given hair color:
tally(~ Eye | Hair, data = eye_hair, format = "prop")
tally(~ Eye | Hair, data = eye_hair, format = "prop")2.1.5 Bar Plots of Counts
We can choose from several options for our bar plot for a single categorical variable.
gf_bar(~x , data = mydata)Here is the bar plot for hair color counts in our sample.
Create a bar plot for the eye color counts in the our sample:
gf_bar(~Eye, data = eye_hair,
xlab = "Color of Eye",
ylab = "Number of Statistics Students at Univ. of Del. in 1974")
gf_bar(~Eye, data = eye_hair,
xlab = "Color of Eye",
ylab = "Number of Statistics Students at Univ. of Del. in 1974")2.1.6 Bar Plots of Proportions
We can choose from several options for our bar plot for a single categorical variable.
gf_props(~x , data = mydata)Here is the bar plot for hair color proportions in our sample.
Create a bar plot for the eye color counts in the our sample:
gf_props(~Eye, data = eye_hair,
xlab = "Color of Eye",
ylab = "Proportion of Statistics Students at Univ. of Del. in 1974")
gf_props(~Eye, data = eye_hair,
xlab = "Color of Eye",
ylab = "Proportion of Statistics Students at Univ. of Del. in 1974")2.1.7 Bar Plots of Conditional Proportions
Here is an example of a plot where we look at the conditional proportion for eye color conditioned on hair color.
What do you notice about how the distribution of eye color varies among those with different hair colors?
3 Comparing Eye Color Distribution from 1974 and 2025.
In 2025 a study was published using all 50 states’ Department of Motor Vehicles information to identify eye and hair color for all registered individuals. They found the following:
| Eye Color | Percent |
|---|---|
| Blue | 25% |
| Brown | 55% |
| Green | 9% |
| Hazel | 11% |
We want to determine if the distribution of eye color of statistics students at the University of Delaware in 1974 is similar the distribution of eye color in the United States in 2025.
3.1 Distribution of Blue Eye Color
Let’s first practice with a simple hypothesis test to determine if the true proportion of blue eyes was higher among statistics students at the University of Delaware in 1974 compare to the proportion of blue eyes in the US in 2025.
3.1.1 Null Hypothesis
3.1.2 Alternative Hypothesis
Be sure you can also write the hypotheses verbally in context. Bring your answers to office hours or the CLC for review.
3.1.3 Calculate the p-value using the Binomial Distribution
Adjust the code below to calculate the correct p-value for the test using the binomial distribution.
1 - pbinom(215 - 1, size = 592, prob = 0.237)
1 - pbinom(215 - 1, size = 592, prob = 0.237)3.2 Evaluate the Chi-Square Goodness-of-Fit
3.2.1 Null Hypothesis
Now we are going to test the full hypothesis that the true proportions for the eye color for statistics students at the University of Delaware in 1974 match the current distribution of eye colors in the United States in 2025, i.e.
\[H_0: \pi_{blue} = 0.25,~\pi_{brown} = 0.55,~\pi_{green} = 0.09,~\pi_{hazel} = 0.11\]
3.2.2 Alternative Hypothesis
Reorder the values below to write out the verbal alternative hypothesis, using the same order for the expected proportions as listed in the symbolic null hypothesis. Place the population in the last position.
3.2.3 Calculate the p-value
xchisq.test(~Eye, data = eye_hair, p = c(0.25, 0.55, 0.09, 0.11))
xchisq.test(~Eye, data = eye_hair, p = c(0.25, 0.55, 0.09, 0.11))3.2.4 What do we conclude from our test?
3.3 Checking Conditions for Goodness-of-Fit Test
Check all the conditions we meet in our scenario to determine if we can use the Chi-Square Distribution as a model for the null distribution and trust our calculated p-value.