Lab 5: Two Group EDA and QQ Plot Primer
Please note that all images were created with modifications to the defaults to make them digitally accessible. If you recreate this code in another environment, your plots have different colors and backgrounds.
1 Getting Started
Be sure to load the packages ggformula and mosaic, using the library() function. Remember, you need to do this with each new Quarto document or R Session. Add the package names in each of the blanks below to load in the indicated packages.
library() loads in packages. You need to supply the package name you need to load inside the parentheses.
library(ggformula) #for graphs
library(mosaic) #for statistics
library(tidyverse) #for data management
library(ggformula) #for graphs
library(mosaic) #for statistics
library(tidyverse) #for data management2 Sex Bias in Professor Ratings
Sex bias stems from a perceived mismatch from an expected role or characteristics based on sex. Studies have shown that men and women have unconscious sex biases against women in traditionally male-dominated fields (such as the sciences) or characteristics (such as leadership qualities). These biases often cause equally qualified women to be seen as less likable or less qualified than the men. (These links are to descriptions of two well-known studies, but there are plenty of other good resources).
Researchers are interested if this sex bias exists in traditionally female-dominated jobs as well, such as teaching. Students are asked to watch a video of an animated classroom and rate the professor. Each student is randomly assigned to either of two animations; the videos are exactly the same except for the sex of the professor drawn. You have been asked to analyze the data for the researchers to determine if the female-identifying professor is rated more poorly, on a 1 to 7 scale (with 7 being the best), than the male-identifying professor.
Run the following code chunk to read in the data and view the variable names and first 6 rows of the data.
2.1 Identify Key Components of the Scenario
For each variable, identify whether it is the explanatory or response variable in our analysis, and the type of variable.
Sex:
Rating:
Identify the study design of this study.
Be sure you are able to provide a full justification.
Identify the study type of this study.
Be sure you are able to provide a full justification.
2.2 Exploratory Data Analysis
Calculate the summary statistics for each sample. Modify the code below to calculate the summary statistics for each group.
df_stats(Rating ~ Sex, data = bias)
mean(Rating ~ Sex, data = bias)
var(Rating ~ Sex, data = bias)
sd(Rating ~ Sex, data = bias)
df_stats(Rating ~ Sex, data = bias)
mean(Rating ~ Sex, data = bias)
var(Rating ~ Sex, data = bias)
sd(Rating ~ Sex, data = bias)2.2.1 Create a boxplot of the data by modifying the code below.
Recall that ~, which translates as “by” or “as a function of,” lets us graph one variable split by another. (For a boxplot, a numeric variable split by a categorical variable). Be sure to add fully descriptive axis labels. Note the solution will provide its own labels, but as long as they contain similar information they are good labels!
gf_boxplot(Rating ~ Sex, data = bias,
ylab = "Rating of Professor (1-7)",
xlab = "Sex of the Professor")
gf_boxplot(Rating ~ Sex, data = bias,
ylab = "Rating of Professor (1-7)",
xlab = "Sex of the Professor") 2.2.2 Based on your boxplot,
What is the approximate median rating for females?
What is the approximate median rating for males?
Which of the following is an appropriate statistic to describe the variability (i.e. dispersion, spread) in rating of male-identifying professors?
Which of the following is an appropriate statistic to describe the variability (i.e. dispersion, spread) in rating of female-identifying professors?
2.3 Assessing Normality of the Ratings by Group
Construct a Quantile-Quantile Plot (QQ Plot) to evaluate the normality of the data. Remember for two independent groups, you must look at the QQ Plot split by the group (~ y | x ).
gf_qq(~Rating | Sex, data = bias,
xlab = "Theoretical Z-Scores",
ylab = "Ratings of Professors") |>
gf_qqline()
gf_qq(~Rating | Sex, data = bias,
xlab = "Theoretical Z-Scores",
ylab = "Ratings of Professors") |>
gf_qqline()2.4 Interpreting QQ Plots
Remember to completely interpret the QQ Plot,
General Pattern: Discuss the pattern compared to the 1:1 line that represents perfect normality.
Specifics: Describe specific details about the pattern such as if there are any deviations, where the deviations are located along the 1:1 line, and the magnitude of the deviations.
Consideration for n: Evaluate the severity of the deviations from the 1:1 line relative to the sample size, which should be cited.
Cumulative Frequency: Discuss the cumulative frequency rate of increase of the data points/residuals relative to the 1:1 line.
2.4.1 Based on the QQ Plot for Female Professors, evaluate the following statements and determine if the are Complete, Incomplete, or Missing.
The female-identifying scores seem normally distributed as they follow along the line with no apparent outliers.
General Pattern:
Specifics:
Consideration for n:
Cumulative frequency rate of increase:
The data does not follow the 1:1 line, points start above the line and then end below the line. That means there are more data points, especially on the left tail, than expected. The cumulative frequency increases too quickly.
General Pattern:
Specifics:
Consideration for n:
Cumulative frequency rate of increase:
The majority of data points in the center of the Female plot are close to the 1:1 line, but deviates at the tails. The sample size is \(n=34\) so we can expect variability. This implies that the plot is mostly normally distributed.
General Pattern:
Specifics:
Consideration for n:
Cumulative frequency rate of increase:
The female plot trends normally with most of the plot points trending relatively close to the 1:1 line. There is only one large deviation in the upper tail. This means the cumulative frequency rate increases.
General Pattern:
Specifics:
Consideration for n:
Cumulative frequency rate of increase:
The center of each data group is a little skewed and does not follow the 1:1 line. There are deviations apparent, given our sample size of 34. This means that the rate of cumulative frequency increase is not relatively normal.
General Pattern:
Specifics:
Consideration for n:
Cumulative frequency rate of increase:
3 Parents Age Gap
A recent study found the average Homo sapiens father has always been older than the average Homo sapiens mother for 250,000 years. However, the age gap has dwindled in the last 5,000 years, largely due to mothers having children at older ages. With declining teen birth rate, rising birth rates among older women, and women pursuing higher education and careers before starting families, the average age of all mothers giving birth in the United States increased to nearly 30 in 2023. While the age gap between parents can vary greatly, it’s common for fathers to be just a few years older than mothers.
The National Vital Statistics System (NVSS) is a collaborative effort between state and local governments to compile and publish reports on all vital events - births, deaths, marriages, and divorces. A random sample of 50 births in 2022 was extracted from NVSS. The data for mother’s age and father’s age can be found in nvss-births-2022.csv.
3.0.1 For each variable, identify whether it is the explanatory or response variable in our analysis, and the type of variable.
Sex:
Age:
Identify the study design of this study. Be sure you are able to provide a full justification.
Identify the study type of this study. Be sure you are able to provide a full justification.
3.0.2 Calculate the summary statistics for the differences.
Modify the code below to calculate the summary statistics. We will focus on the difference between the Father’s age and the Mother’s age.
df_stats(~(FatherAge - MotherAge), data = parents)
mean(~(FatherAge - MotherAge), data = parents)
var(~(FatherAge - MotherAge), data = parents)
sd(~(FatherAge - MotherAge), data = parents)
df_stats(~(FatherAge - MotherAge), data = parents)
mean(~(FatherAge - MotherAge), data = parents)
var(~(FatherAge - MotherAge), data = parents)
sd(~(FatherAge - MotherAge), data = parents)3.0.3 Create a boxplot of the differences by modifying the code below.
Be sure to add fully descriptive axis labels. As long as your axis contains similar information it is fine.
gf_boxplot(~(FatherAge - MotherAge), data = parents,
xlab = "Difference between Father's and Mother's Age of Newborn Infants in 2023")
gf_boxplot(~(FatherAge - MotherAge), data = parents,
xlab = "Difference between Father's and Mother's Age of Newborn Infants in 2023") 3.1 Assessing Normality of the Differences
Construct a Quantile-Quantile Plot (QQ Plot) to evaluate the normality of the differences Remember for two dependent/matched groups, you must look at the QQ Plot of the differences (~ (x1 - x2)).
gf_qq(~(FatherAge - MotherAge), data = parents,
xlab = "Theoretical Z-Scores",
ylab = "Difference between Father's and Mother's Age") |>
gf_qqline()
gf_qq(~(FatherAge - MotherAge), data = parents,
xlab = "Theoretical Z-Scores",
ylab = "Difference between Father's and Mother's Age") |>
gf_qqline()Practice writing a description of whether or not the differences are normally distributed based on the QQ Plot. (Hint - it is does not meet conditions to be considered normally distributed!),