PSTAT 5A: Lecture 23

Tying Up Loose Ends

Annie Adams

2023-09-09

Smoking

I previously emphasized that observational studies cannot be used to draw causal conclusions, even if they reveal associations.
I mentioned this is because of the potential presence of so-called confounding variables.
Let’s think through a concrete example together.

Setup

A researcher would like to determine whether or not there is an association between smoking and increased rates of lung cancer. To that end, they perform an observational study in which 50 people who regularly smoke were observed along with 50 people who do not smoke. Lung cancer rates within each group were recorded at the end of the study, and the data clearly (and statistically) displays that the average lung cancer rates among smokers is higher than that among non-smokers.

Smoking

Again, we cannot then simply conclude that smoking causes lung cancer.
All we can conclude is precisely what was stated above: there is statistical evidence to suggest that smoking is associated with higher lung cancer rates.
Why? It really boils down to asking: was the control (i.e. non-smoking) group truly similar to the treatment (i.e. smoking) group?
For example, what if it turns out that the group of smokers that were selected were also heavy drinkers? In that case, whether or not someone regularly drinks could be a confounding variable as it is one the researcher did not explicitly control for, but that could potentially skew results.
Additionally, some studies have shown that smokers tend to be predominantly male. As such, gender could also be a confounding variable.

Confounding Variables

The main point is: there are lots of variables that were not controlled for in this study, but that could be also contributing to the increased rates of lung cancer that was observed.
Hence, the study (as it was conducted above) cannot be used to say that smoking definitively causes lung cancer.
Now, I’d also like to stress- even if the researcher were to re-do the study as an experiment, we still wouldn’t be able to simply declare that smoking causes lung cancer.
To truly establish causal relationships, one must use results from causal inference (which is outside the scope of this course).

Case Study: Admissions at UC Berkeley

In the 1970’s, UC Berkeley conducted an observational study to determine whether or not there was gender bias in the graduate student admittance practices at the university.
- A disclaimer: at the time, “gender” was treated as a binary variable with values male and female.
Overall, the survey included 8,422 men and 4,321 women.
Of the men 44% were admitted; of the women only 35% were admitted.
- This difference was also deemed statistically significant.
So, on the surface, it does appear as though women are being disproportionately denied entry.

Case Study: Admissions at UC Berkeley

Something puzzling happens, however, when we take a look at the data after grouping by major:

	Men		Women
Major	Num. Applicants	% Admitted	Num. Applicants	% Admitted
A	825	62	108	82
B	560	63	25	68
C	325	37	593	34
D	417	33	375	35
E	191	28	393	24
F	373	6	341	7

Case Study: Admissions at UC Berkeley

Nearly none of the majors on their own display this bias against women.
- Indeed, in Major A there almost appears to be a bias against men
So, what’s going on? How can it be that none of the majors individually display a discrimination against women, but overall they display discrimination against women?
The answer actually lies in how difficult each major was to get into.
For instance, Major A appears to have an overall 64% acceptance rate, whereas Major E appears to have an overall 53.62% acceptance rate.
- Major A seems to be harder to get into than, say, Major E.
- Indeed, Majors A and B are easier to get into than majors C through F.

Case Study: Admissions at UC Berkeley

Indeed, if we look at the Num. Applicants column within each gender, we see that, on the aggregate, men were applying to easier majors!
- Over half (i.e. $(825 + 560) / (825 + 560 + 325 + 417 + 191 + 373) \approx 51.5\%$) of men applied to Majors A and B, whereas nearly 90% (i.e. $(593 + 375 + 393 + 341) / (108 + 25 + 593 + 375 + 393 + 341) $) of women applied to Majors C through F.
In other words, difficulty of major was a confounding variable that influenced the acceptance rates.
- After controlling for this variable, it was actually found that there was no significant difference in addmittance rates between men and women.

What Now?

Alright, so that was the last bit of new material I wanted to cover in this class.
But…. why did we do all of this?
What was the point of this course?
- If you say “the point was for me to finish my major”, well, then, fair enough!
But, more fundamentally, this course is designed to try and provide an introduction to Data Science.

What Now?

The key operating word in this is “introduction-” we only just scratched the surface of the topics we discussed!
- I sometimes refer to PSTAT 5A as a “table of contents” of statistics and data science.
Even within our own PSTAT department, there are lots of different courses that dive deeper into the subjects we discussed.

What Now?

If you’d like to learn some more programming (and get a recap of some probability), take PSTAT 10 (where you will learn a very popular programming language among statisticians, called R).
PSTAT 120A provides a deeper look at Probability, and some more sophisticated probability tools.
PSTAT 120B and 120C provide a deeper look at inferential statistics, and how to answer much more interesting and complex problems than those we looked at in this course.
- You’ll even learn a third, strange yet useful appraoach to probability called the Bayesian approach.
Interested in Experimental Design? PSTAT 122 is devoted entirely to that!
Want to learn more about regression (including logistic regression)? Take PSTAT 126 and 131!

Wherever there is data, there is the need for a data scientist.
Whenever there is uncertainty, there is the need for a statistician.
Statistics and Data Science have far reaching applications in so many fields!
There is a famous quote from an extremely influential statistician named John Tukey:

The best thing about being a statistician is that you get to play in everyone’s backyard.

Thank you for a great quarter!