Lab 11: Categorical Data Analysis

Digital Accessibility

Please note that all images were created with modifications to the defaults to make them digitally accessible. If you recreate this code in another environment, your plots have different colors and backgrounds.

1 Getting Started

Be sure to load the packages ggformula and mosaic, using the library() function. Remember, you need to do this with each new Quarto document or R Session. Add the package names in each of the blanks below to load in the indicated packages.

library() loads in packages. You need to supply the package name you need to load inside the parentheses.

library(ggformula) #for graphs
library(mosaic) #for statistics
library(tidyverse) #for data management

2 Hair and Eye Color

In 1974 a Statistics professor collected data on 592 randomly selected students at the University of Delaware to help the students learn about analysis of categorical data. Import the data eye-hair-color.csv.

2.1 Exploratory Data Analysis

Here are the summary statistics and data visualizations we can create for categorical data.

2.1.1 Calculate Frequency/Count

We can use the tally() function to find the counts/frequency of the hair and eye color of the individuals in the study.

tally(~x , data = mydata)

Here is the frequency distribution for hair color in our sample.

Calculate the frequency distribution of eye color in the sample:

2.1.2 Calculate Relative Frequency/Proportion

We can adjust the arguments in the tally() function to find the proportion/relative frequency of the hair and eye color of the individuals in the study.

tally(~x , data = mydata, format = "prop")

Here is the relative frequency distribution for hair color in our sample.

Calculate the relative frequency distribution of eye color in the data:

2.1.3 Creating a Contingency Table

Suppose we want to compare the distribution of eye color across different hair colors. We can create a contingency table using the same tally() function.

tally(y ~ x , data = mydata) #y position are rows, x position are columns

Here is the frequency distribution for hair and eye color in our sample.

2.1.4 Calculate Conditional Proportions

In order to calculate the conditional proportions, we can adjust our code a little bit to make it clear which variable we are conditioning on.

tally(~y | x , data = mydata, format = "prop") # y conditioned on x

Here are the conditional proportions of hair color given eye color in our sample.

Now calculate the conditional proportions of eye color given hair color:

2.1.5 Bar Plots of Counts

We can choose from several options for our bar plot for a single categorical variable.

gf_bar(~x , data = mydata)

Here is the bar plot for hair color counts in our sample.

Create a bar plot for the eye color counts in the our sample:

gf_bar(~Eye, data = eye_hair, 
       xlab = "Color of Eye",
       ylab = "Number of Statistics Students at Univ. of Del. in 1974")

2.1.6 Bar Plots of Proportions

We can choose from several options for our bar plot for a single categorical variable.

gf_props(~x , data = mydata)

Here is the bar plot for hair color proportions in our sample.

Create a bar plot for the eye color counts in the our sample:

gf_props(~Eye, data = eye_hair, 
       xlab = "Color of Eye",
       ylab = "Proportion of Statistics Students at Univ. of Del. in 1974")

2.1.7 Bar Plots of Conditional Proportions

Here is an example of a plot where we look at the conditional proportion for eye color conditioned on hair color.

What do you notice about how the distribution of eye color varies among those with different hair colors?

3 Comparing Eye Color Distribution from 1974 and 2025.

In 2025 a study was published using all 50 states’ Department of Motor Vehicles information to identify eye and hair color for all registered individuals. They found the following:

Eye Color	Percent
Blue	25%
Brown	55%
Green	9%
Hazel	11%

We want to determine if the distribution of eye color of statistics students at the University of Delaware in 1974 is similar the distribution of eye color in the United States in 2025.

3.1 Distribution of Blue Eye Color

Let’s first practice with a simple hypothesis test to determine if the true proportion of blue eyes was higher among statistics students at the University of Delaware in 1974 compare to the proportion of blue eyes in the US in 2025.

3.1.2 Alternative Hypothesis

Be sure you can also write the hypotheses verbally in context. Bring your answers to office hours or the CLC for review.

3.1.3 Calculate the p-value using the Binomial Distribution

Adjust the code below to calculate the correct p-value for the test using the binomial distribution.

3.2 Evaluate the Chi-Square Goodness-of-Fit

3.2.1 Null Hypothesis

Now we are going to test the full hypothesis that the true proportions for the eye color for statistics students at the University of Delaware in 1974 match the current distribution of eye colors in the United States in 2025, i.e.

\[H_0: \pi_{blue} = 0.25,~\pi_{brown} = 0.55,~\pi_{green} = 0.09,~\pi_{hazel} = 0.11\]

3.2.2 Alternative Hypothesis

Reorder the values below to write out the verbal alternative hypothesis, using the same order for the expected proportions as listed in the symbolic null hypothesis. Place the population in the last position.

3.2.3 Calculate the p-value

3.2.4 What do we conclude from our test?

3.3 Checking Conditions for Goodness-of-Fit Test

Check all the conditions we meet in our scenario to determine if we can use the Chi-Square Distribution as a model for the null distribution and trust our calculated p-value.

1 Getting Started

2 Hair and Eye Color

2.1 Exploratory Data Analysis

2.1.1 Calculate Frequency/Count

2.1.2 Calculate Relative Frequency/Proportion

2.1.3 Creating a Contingency Table

2.1.4 Calculate Conditional Proportions

2.1.5 Bar Plots of Counts

2.1.6 Bar Plots of Proportions

2.1.7 Bar Plots of Conditional Proportions

3 Comparing Eye Color Distribution from 1974 and 2025.

3.1 Distribution of Blue Eye Color

3.1.1 Null Hypothesis

3.1.2 Alternative Hypothesis

3.1.3 Calculate the p-value using the Binomial Distribution

3.2 Evaluate the Chi-Square Goodness-of-Fit

3.2.1 Null Hypothesis

3.2.2 Alternative Hypothesis

3.2.3 Calculate the p-value

3.2.4 What do we conclude from our test?

3.3 Checking Conditions for Goodness-of-Fit Test