Annie Adams
2023-09-09
I previously emphasized that observational studies cannot be used to draw causal conclusions, even if they reveal associations.
I mentioned this is because of the potential presence of so-called confounding variables.
Let’s think through a concrete example together.
Setup
A researcher would like to determine whether or not there is an association between smoking and increased rates of lung cancer. To that end, they perform an observational study in which 50 people who regularly smoke were observed along with 50 people who do not smoke. Lung cancer rates within each group were recorded at the end of the study, and the data clearly (and statistically) displays that the average lung cancer rates among smokers is higher than that among non-smokers.
Again, we cannot then simply conclude that smoking causes lung cancer.
All we can conclude is precisely what was stated above: there is statistical evidence to suggest that smoking is associated with higher lung cancer rates.
Why? It really boils down to asking: was the control (i.e. non-smoking) group truly similar to the treatment (i.e. smoking) group?
For example, what if it turns out that the group of smokers that were selected were also heavy drinkers? In that case, whether or not someone regularly drinks could be a confounding variable as it is one the researcher did not explicitly control for, but that could potentially skew results.
Additionally, some studies have shown that smokers tend to be predominantly male. As such, gender could also be a confounding variable.
The main point is: there are lots of variables that were not controlled for in this study, but that could be also contributing to the increased rates of lung cancer that was observed.
Hence, the study (as it was conducted above) cannot be used to say that smoking definitively causes lung cancer.
Now, I’d also like to stress- even if the researcher were to re-do the study as an experiment, we still wouldn’t be able to simply declare that smoking causes lung cancer.
To truly establish causal relationships, one must use results from causal inference (which is outside the scope of this course).
In the 1970’s, UC Berkeley conducted an observational study to determine whether or not there was gender bias in the graduate student admittance practices at the university.
male
and female
.Overall, the survey included 8,422 men and 4,321 women.
Of the men 44% were admitted; of the women only 35% were admitted.
So, on the surface, it does appear as though women are being disproportionately denied entry.
Men | Women | |||
Major | Num. Applicants | % Admitted | Num. Applicants | % Admitted |
A | 825 | 62 | 108 | 82 |
B | 560 | 63 | 25 | 68 |
C | 325 | 37 | 593 | 34 |
D | 417 | 33 | 375 | 35 |
E | 191 | 28 | 393 | 24 |
F | 373 | 6 | 341 | 7 |
Nearly none of the majors on their own display this bias against women.
So, what’s going on? How can it be that none of the majors individually display a discrimination against women, but overall they display discrimination against women?
The answer actually lies in how difficult each major was to get into.
For instance, Major A appears to have an overall 64% acceptance rate, whereas Major E appears to have an overall 53.62% acceptance rate.
Major A seems to be harder to get into than, say, Major E.
Indeed, Majors A and B are easier to get into than majors C through F.
Indeed, if we look at the Num. Applicants
column within each gender, we see that, on the aggregate, men were applying to easier majors!
In other words, difficulty of major
was a confounding variable that influenced the acceptance rates.
Alright, so that was the last bit of new material I wanted to cover in this class.
But…. why did we do all of this?
What was the point of this course?
But, more fundamentally, this course is designed to try and provide an introduction to Data Science.
If you’d like to learn some more programming (and get a recap of some probability), take PSTAT 10 (where you will learn a very popular programming language among statisticians, called R
).
PSTAT 120A provides a deeper look at Probability, and some more sophisticated probability tools.
PSTAT 120B and 120C provide a deeper look at inferential statistics, and how to answer much more interesting and complex problems than those we looked at in this course.
Interested in Experimental Design? PSTAT 122 is devoted entirely to that!
Want to learn more about regression (including logistic regression)? Take PSTAT 126 and 131!
Wherever there is data, there is the need for a data scientist.
Whenever there is uncertainty, there is the need for a statistician.
Statistics and Data Science have far reaching applications in so many fields!
There is a famous quote from an extremely influential statistician named John Tukey:
The best thing about being a statistician is that you get to play in everyone’s backyard.