from <module name> import *
It’s finally time for us to revisit our notions of descriptive statistics (from Week 1 of the course), now in the context of Python!
Modules, Revisited
Before we talk about plotting, we will need to quickly talk about modules again. Recall from Lab01 that modules are Python files containing definitions for functions and classes. Up until now, we’ve been importing all functions and classes from a module using the command
There is another way to import modules, which is the following:
import <module name> as <abbreviation>
For example,
import numpy np
not only imports the numpy
module but imports it with the abbreviation (i.e. nickname) np
so that we can simply write np
in place of numpy
.
The reason this is particularly useful is because module names can sometimes be quite long, so being able to refer to the module with a shortened nickname will save a lot of time!
In general, if we import a module using
import <module name> as <abbreviation>
we reference functions from <module name>
using the syntax
<abbreviation>.<function name>()
For example, after having imported the numpy
module with the nickname np
, we access the sin()
function contained in the numpy
module by calling
np.sin()
For example, after importing numpy
as np
, running numpy.sin()
would return an error.
Numerical Summaries
Measures of Central Tendency
Recall that for a list of numbers \(X = \{x_i\}_{i=1}^{n}\), the mean is defined to be \[ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{1}{n} (x_1 + \cdots + x_n) \] Computing the mean of a list or array of numbers in Python is relatively simple, using the np.mean()
function [recall that we imported the numpy
module with the abbreviation np
, meaning np.mean()
is a shorthand for numpy.mean()
]. Similarly, to compute the median of a list or array we can use np.median()
.
Measures of Spread
Recall that we also discussed several measures of spread:
- Standard deviation
- IQR (Interquartile Range)
- Range
Sure enough, the numpy
module contains several functions which help us compute these measures. Let’s examine each separately.
Next, we tackle a slightly peculiar function: np.std()
. We expect this to compute the standard deviation of a list/array, but…
The result of the previous Task is the following: given a list x = [x1, x2, ..., xn]
, running np.std(x)
actually computes \[ \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})^2 } \] as opposed to our usual definition of standard deviation \[ s_X = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \overline{x})^2} \] We can actually fix this issue by passing in an additional argument to the np.std()
function:
Finally, we turn to the IQR: to compute the IQR of a list/array x
, we use (after importing numpy
as np
)
25,75]))[0] np.diff(np.percentile(x, [
Visualizations
It’s finally time to make pretty pictures! The module we will use to generate visualizations in this class is the matplotlib
module (though there are quite a few other modules that work for visualizations as well). The official website for matplotlib
can be found at https://matplotlib.org/.
Before we generate any plots, we will need to run the following code once:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
'seaborn-v0_8-whitegrid') plt.style.use(
Here’s what these lines of code are doing:
%matplotlib inline
tells Jupyter to actually display our plots in our notebook (if we didn’t include this line, our plots wouldn’t display)import matplotlib
imports thematplotlib
moduleimport matplotlib.pyplot as plt
imports thepyplot
submodule (a submodule is just a module contained within another larger module) with the abbreviationplt
.plt.style.use('seaborn-v0_8-whitegrid')
tells Jupyter to use a specific theme (calledseaborn-v0_8-whitegrid
) when generating plots.
Again, notice the beauty of the import <module> as <abbreviation>
syntax- after running the third line above, we no longer need to write matplotlib.pyplot
, just plt
! Also, there are lots of other themes you can use when generating your plots: after completing this lab, I encourage you to consult this reference guide for a list of a few other pyplot
themes.
Boxplots and Histograms
Now, let’s proceed on to make some plots. The first two types of plots we will look at are the two we used to describe numerical data: namely, boxplots and histograms. The functions we will use are the plt.boxplot()
and plt.his()
functions, respectively.
Of course, boxplots are not the only way to summarize numerical variables: we also have histograms!
Scatterplots
We should also quickly discuss how to generate scatterplots in Python.
Plotting a Function
Finally, I’d like to take a quick detour from descriptive statistics and talk about how to plot a function using Python. As a concrete example, let’s try and plot a sine curve from \(0\) to \(2\pi\).
If you recall, on Lab01 we used the sin()
function from the math
module- it turns out that the numpy
module (which, recall, we have imported as np
) also has a sin()
function, so let’s use that one today:
np.sin()
Next, we create a set of finely-spaced points between our two desired endpoints (in this case, \(0\) and \(2\pi\), respectively). We will do so using the np.linspace()
function, which works as follows:
np.linspace(start, stop, num)
creates a set of num
evenly-spaced values between start
and stop
, respectively. For instance:
0, 1, 10) np.linspace(
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
In the context of plotting, the more points we generate the smoother our plot will seem (you will see what this means in a minute). As such, let’s start with 150
points between 0
and 2 * pi
:
= np.linspace(0, 2 * np.pi, 150) x
Finally, we call the plt.plot()
function on x
and np.sin(x)
to generate our plot:
=(4.5, 2.25))
plt.figure(figsize plt.plot(x, np.sin(x))
Let’s see what would have happened if we used fewer values in our np.linspace()
call:
= np.linspace(0, 2 * np.pi, 10)
xnew plt.plot(xnew, np.sin(xnew))
So, the more points we include in our call to np.linspace()
, the smoother our final function will look!
So, to summarize, here is the general “recipe” to plot a function f()
between two values a
and b
in Python:
- Let
x = np.linspace(a, b, <some large value>)
- Call
plt.plot(x, f(x))
- Add labels/titles as necessary
Overlaying Plots
Sometimes it will be useful to overlay two plots on top of each other. Recall that, for a function f()
and a variable x
that has been assigned a value resulting from a call to numpy.linspace()
, we generate a graph of f()
using (assuming matplotlib.pyplot
has been imported as plt
)
; plt.plot(x, f(x))
It stands to reason, then that given another function g()
we should be able to superimpose the graph of g()
onto the graph of f()
by simply adding another call to plt.plot()
:
;
plt.plot(x, f(x)); plt.plot(x, g(x))
Now, as it stands, it’s a bit difficult to determine which curve corresponds to the sine curve and which corresponds to the cosine curve. As such, we should add some labels!
Hm, doesn’t look like anything changed… That’s because we didn’t add a legend to our plot! To add a legend, we simply tack on a call to plt.legend()
after our code from above.
Okay, we’re almost there! The only issue is that now the legend is covered up by the actual graphs. One way we can fix this is by extending the \(y-\)axis further, using the function plt.ylim()
:
Finally, it is sometimes considered bad form to rely too heavily on colors in plots. This is because doing so alienates readers who are colorblind. One way around this is to rely on different line types; e.g. used dashed lines for one graph and dotted lines for another.
What to Turn In
Congrats on finishing Lab 03! Download the .ipnyb
version of your notebook and upload it to Gradescope!