{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Analysis with Pandas DataFrames\n",
"\n",
"### What we will cover\n",
"* Python Data Analysis Library\n",
"* Explore and plot data\n",
"* Common workflows for data analysis\n",
"\n",
"We could spend a full week on data frames and analysis workflows alone, but today, we will focus on the foundation so you can explore more advanced options on your own.\n",
"\n",
"Lesson plan comes from Data Carpentry:\n",
"* https://datacarpentry.org/python-ecology-lesson/02-starting-with-data/index.html\n",
"* https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset/index.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup\n",
"* Start from `~/Desktop/carpentries/` so that we're in the same place\n",
"* You can check this by running `%pwd` in a code cell"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'/home/charles/Desktop/carpentries'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%pwd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Today, we're going to work with an ecology dataset for hands-on practice. A subset of Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.\n",
"\n",
"We are studying the species and weight of animals caught in different sites in our study area.\n",
"\n",
"* Data link: https://drive.google.com/open?id=1UFs_lqcei76-EI3oaVM7o1mf-HW57eso\n",
"\n",
"Each column has a feature, and each row is an observation. A few observations are missing features, and that is okay."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importing pandas\n",
"* Python Data Analysis Library (Pandas)\n",
"* Integrates with `matplotlib` and `numpy`\n",
"\n",
"Some motivation for those who might use Excel already: Pandas is really flexible, which means it can be difficult to learn at first but ends up being really powerful with some practice.\n",
"* Really good at automating data-set cleanup and merging\n",
"* Quick ad-hoc analysis with missing data\n",
"* Integrated into other Python libraries (machine learning, instrumentation, etc)\n",
"* groupby analysis (similar to pivot tables)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"# Now we use the syntax LibraryName.FunctionName, eg pd.read_csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read in CSV data\n",
"* Can also use tab-separated, colon-separated, etc, different formats\n",
"* General idea is that we're working with *table* data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
species_id
\n",
"
sex
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
2
\n",
"
NL
\n",
"
M
\n",
"
32.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
3
\n",
"
NL
\n",
"
M
\n",
"
33.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
2
\n",
"
DM
\n",
"
F
\n",
"
37.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
7
\n",
"
DM
\n",
"
M
\n",
"
36.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
3
\n",
"
DM
\n",
"
M
\n",
"
35.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
35544
\n",
"
35545
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
AH
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
35545
\n",
"
35546
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
AH
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
35546
\n",
"
35547
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
10
\n",
"
RM
\n",
"
F
\n",
"
15.0
\n",
"
14.0
\n",
"
\n",
"
\n",
"
35547
\n",
"
35548
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
7
\n",
"
DO
\n",
"
M
\n",
"
36.0
\n",
"
51.0
\n",
"
\n",
"
\n",
"
35548
\n",
"
35549
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
5
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
35549 rows × 9 columns
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"0 1 7 16 1977 2 NL M 32.0 \n",
"1 2 7 16 1977 3 NL M 33.0 \n",
"2 3 7 16 1977 2 DM F 37.0 \n",
"3 4 7 16 1977 7 DM M 36.0 \n",
"4 5 7 16 1977 3 DM M 35.0 \n",
"... ... ... ... ... ... ... ... ... \n",
"35544 35545 12 31 2002 15 AH NaN NaN \n",
"35545 35546 12 31 2002 15 AH NaN NaN \n",
"35546 35547 12 31 2002 10 RM F 15.0 \n",
"35547 35548 12 31 2002 7 DO M 36.0 \n",
"35548 35549 12 31 2002 5 NaN NaN NaN \n",
"\n",
" weight \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"... ... \n",
"35544 NaN \n",
"35545 NaN \n",
"35546 14.0 \n",
"35547 51.0 \n",
"35548 NaN \n",
"\n",
"[35549 rows x 9 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# variable naming:\n",
"# \"surveys\" for dataset name, \"_df\" for dataframe\n",
"surveys_df = pd.read_csv('data/surveys.csv')\n",
"surveys_df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"0 1 7 16 1977 2 NL M 32.0 \n",
"1 2 7 16 1977 3 NL M 33.0 \n",
"2 3 7 16 1977 2 DM F 37.0 \n",
"3 4 7 16 1977 7 DM M 36.0 \n",
"4 5 7 16 1977 3 DM M 35.0 \n",
"... ... ... ... ... ... ... ... ... \n",
"35544 35545 12 31 2002 15 AH NaN NaN \n",
"35545 35546 12 31 2002 15 AH NaN NaN \n",
"35546 35547 12 31 2002 10 RM F 15.0 \n",
"35547 35548 12 31 2002 7 DO M 36.0 \n",
"35548 35549 12 31 2002 5 NaN NaN NaN \n",
"\n",
" weight \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"... ... \n",
"35544 NaN \n",
"35545 NaN \n",
"35546 14.0 \n",
"35547 51.0 \n",
"35548 NaN \n",
"\n",
"[35549 rows x 9 columns]\n"
]
}
],
"source": [
"# You may notice that pandas data-frames start at 0 indexing\n",
"# A slightly different view:\n",
"print(surveys_df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"0 1 7 16 1977 2 NL M 32.0 \n",
"1 2 7 16 1977 3 NL M 33.0 \n",
"2 3 7 16 1977 2 DM F 37.0 \n",
"3 4 7 16 1977 7 DM M 36.0 \n",
"4 5 7 16 1977 3 DM M 35.0 \n",
"\n",
" weight \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Access methods with `df_object.method()`\n",
"# For example:\n",
"surveys_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(35549, 9)\n",
"\n",
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"0 1 7 16 1977 2 NL M 32.0 \n",
"1 2 7 16 1977 3 NL M 33.0 \n",
"2 3 7 16 1977 2 DM F 37.0 \n",
"3 4 7 16 1977 7 DM M 36.0 \n",
"4 5 7 16 1977 3 DM M 35.0 \n",
"5 6 7 16 1977 1 PF M 14.0 \n",
"6 7 7 16 1977 2 PE F NaN \n",
"7 8 7 16 1977 1 DM M 37.0 \n",
"8 9 7 16 1977 1 DM F 34.0 \n",
"9 10 7 16 1977 6 PF F 20.0 \n",
"10 11 7 16 1977 5 DS F 53.0 \n",
"11 12 7 16 1977 7 DM M 38.0 \n",
"12 13 7 16 1977 3 DM M 35.0 \n",
"13 14 7 16 1977 8 DM NaN NaN \n",
"14 15 7 16 1977 6 DM F 36.0 \n",
"\n",
" weight \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"5 NaN \n",
"6 NaN \n",
"7 NaN \n",
"8 NaN \n",
"9 NaN \n",
"10 NaN \n",
"11 NaN \n",
"12 NaN \n",
"13 NaN \n",
"14 NaN \n",
"\n",
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"35544 35545 12 31 2002 15 AH NaN NaN \n",
"35545 35546 12 31 2002 15 AH NaN NaN \n",
"35546 35547 12 31 2002 10 RM F 15.0 \n",
"35547 35548 12 31 2002 7 DO M 36.0 \n",
"35548 35549 12 31 2002 5 NaN NaN NaN \n",
"\n",
" weight \n",
"35544 NaN \n",
"35545 NaN \n",
"35546 14.0 \n",
"35547 51.0 \n",
"35548 NaN \n"
]
}
],
"source": [
"# Challenge - DataFrames\n",
"# See what the following expressions return:\n",
"print(surveys_df.shape)\n",
"print() # Blank line\n",
"print(surveys_df.head(15))\n",
"print() # Blank line\n",
"print(surveys_df.tail())"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 NL\n",
"1 NL\n",
"2 DM\n",
"3 DM\n",
"4 DM\n",
" ... \n",
"35544 AH\n",
"35545 AH\n",
"35546 RM\n",
"35547 DO\n",
"35548 NaN\n",
"Name: species_id, Length: 35549, dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Grab a single column with df_object['column_name'], or\n",
"# df_object.column_name (if column name is a single word)\n",
"surveys_df.species_id\n",
"surveys_df['species_id'] # equivalent to above"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"^ We call a single-column a `Series`. Most operations you can do with a `DataFrame`, you can also do with a `Series`.\n",
"\n",
"For all possible `Series` attributes and methods, see: https://pandas.pydata.org/pandas-docs/stable/reference/series.html"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',\n",
" 'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',\n",
" 'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',\n",
" 'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',\n",
" 'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's grab the different species\n",
"pd.unique(surveys_df['species_id'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise - exploring columns\n",
"* Create a list of unique site ID’s (`'plot_id'`) found in the surveys data. Call it `site_names`. How many unique sites are there in the data?\n",
"\n",
"* Is there a difference between `len(site_names)` and `surveys_df['plot_id'].nunique()`?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of unique sites, calculated 2 ways:\n",
"len(site_names) = 24\n",
"surveys_df['plot_id'].nunique() = 24\n"
]
}
],
"source": [
"# Solution to exercise\n",
"site_names = pd.unique(surveys_df['plot_id'])\n",
"\n",
"print('Number of unique sites, calculated 2 ways:')\n",
"print('len(site_names) =', len(site_names))\n",
"print(\"surveys_df['plot_id'].nunique() =\", surveys_df['plot_id'].nunique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary Statistics\n",
"We often want to calculate summary statistics to get intuition for a large dataset."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 32283.000000\n",
"mean 42.672428\n",
"std 36.631259\n",
"min 4.000000\n",
"25% 20.000000\n",
"50% 37.000000\n",
"75% 48.000000\n",
"max 280.000000\n",
"Name: weight, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"surveys_df['weight'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"32283"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract your standard metrics one at a time\n",
"surveys_df['weight'].min() # minimum of this column\n",
"surveys_df['weight'].max() # maximum of this column\n",
"surveys_df['weight'].mean() # average (mean) of this column\n",
"surveys_df['weight'].std() # standard deviation of this column\n",
"surveys_df['weight'].count() # number of non-missing values in this column"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
35549.000000
\n",
"
35549.000000
\n",
"
35549.000000
\n",
"
35549.000000
\n",
"
35549.000000
\n",
"
31438.000000
\n",
"
32283.000000
\n",
"
\n",
"
\n",
"
mean
\n",
"
17775.000000
\n",
"
6.474022
\n",
"
16.105966
\n",
"
1990.475231
\n",
"
11.397001
\n",
"
29.287932
\n",
"
42.672428
\n",
"
\n",
"
\n",
"
std
\n",
"
10262.256696
\n",
"
3.396583
\n",
"
8.256691
\n",
"
7.493355
\n",
"
6.799406
\n",
"
9.564759
\n",
"
36.631259
\n",
"
\n",
"
\n",
"
min
\n",
"
1.000000
\n",
"
1.000000
\n",
"
1.000000
\n",
"
1977.000000
\n",
"
1.000000
\n",
"
2.000000
\n",
"
4.000000
\n",
"
\n",
"
\n",
"
25%
\n",
"
8888.000000
\n",
"
4.000000
\n",
"
9.000000
\n",
"
1984.000000
\n",
"
5.000000
\n",
"
21.000000
\n",
"
20.000000
\n",
"
\n",
"
\n",
"
50%
\n",
"
17775.000000
\n",
"
6.000000
\n",
"
16.000000
\n",
"
1990.000000
\n",
"
11.000000
\n",
"
32.000000
\n",
"
37.000000
\n",
"
\n",
"
\n",
"
75%
\n",
"
26662.000000
\n",
"
9.000000
\n",
"
23.000000
\n",
"
1997.000000
\n",
"
17.000000
\n",
"
36.000000
\n",
"
48.000000
\n",
"
\n",
"
\n",
"
max
\n",
"
35549.000000
\n",
"
12.000000
\n",
"
31.000000
\n",
"
2002.000000
\n",
"
24.000000
\n",
"
70.000000
\n",
"
280.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id \\\n",
"count 35549.000000 35549.000000 35549.000000 35549.000000 35549.000000 \n",
"mean 17775.000000 6.474022 16.105966 1990.475231 11.397001 \n",
"std 10262.256696 3.396583 8.256691 7.493355 6.799406 \n",
"min 1.000000 1.000000 1.000000 1977.000000 1.000000 \n",
"25% 8888.000000 4.000000 9.000000 1984.000000 5.000000 \n",
"50% 17775.000000 6.000000 16.000000 1990.000000 11.000000 \n",
"75% 26662.000000 9.000000 23.000000 1997.000000 17.000000 \n",
"max 35549.000000 12.000000 31.000000 2002.000000 24.000000 \n",
"\n",
" hindfoot_length weight \n",
"count 31438.000000 32283.000000 \n",
"mean 29.287932 42.672428 \n",
"std 9.564759 36.631259 \n",
"min 2.000000 4.000000 \n",
"25% 21.000000 20.000000 \n",
"50% 32.000000 37.000000 \n",
"75% 36.000000 48.000000 \n",
"max 70.000000 280.000000 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can use describe over the entire DataFrame\n",
"surveys_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split-apply-combine\n",
"We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to group by species, sex, or observation site. We use the `.groupby` method to summarize by one or more variables."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# For example, let's group data by sex\n",
"grouped_data = surveys_df.groupby('sex')\n",
"grouped_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a new datatype used to hold a list of related dataframes. For extended examples, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html\n",
"\n",
"For all possible attributes and methods, see: https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
"
\n",
"
sex
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
F
\n",
"
18036.412046
\n",
"
6.583047
\n",
"
16.007138
\n",
"
1990.644997
\n",
"
11.440854
\n",
"
28.836780
\n",
"
42.170555
\n",
"
\n",
"
\n",
"
M
\n",
"
17754.835601
\n",
"
6.392668
\n",
"
16.184286
\n",
"
1990.480401
\n",
"
11.098282
\n",
"
29.709578
\n",
"
42.995379
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id \\\n",
"sex \n",
"F 18036.412046 6.583047 16.007138 1990.644997 11.440854 \n",
"M 17754.835601 6.392668 16.184286 1990.480401 11.098282 \n",
"\n",
" hindfoot_length weight \n",
"sex \n",
"F 28.836780 42.170555 \n",
"M 29.709578 42.995379 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Provide the mean for each numeric column by sex\n",
"grouped_data.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise - `groupby`\n",
"\n",
"1. How many recorded individuals are female `F` and how many male `M`?\n",
" * Remember that you can use `.mean()`, `.std()`, etc on `grouped_data`\n",
"2. Summarize weight values for each site (`'plot_id'`) in your data. You can use the following syntax to create summary statistics for a single column in your data:\n",
" * `by_site = `\n",
" * `by_site['weight'].describe()`"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sex\n",
"F 15690\n",
"M 17348\n",
"dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Solution to exercise - groupby #1\n",
"grouped_data = surveys_df.groupby('sex')\n",
"grouped_data.count()['record_id'] # works as long as columns are all filled out\n",
"grouped_data.size() # also similar"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# No title / axis units, but we'll get to that\n",
"# in the upcoming lesson dedicated to plotting\n",
"\n",
"# We can also look at how many animals were measured in each site:\n",
"total_count = surveys_df.groupby('plot_id')['record_id'].nunique()\n",
"# Let's plot that too\n",
"total_count.plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise - Plots\n",
"Create a plot of average weight across all species per site."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEKCAYAAAALoA6YAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAASxUlEQVR4nO3dfbBcdX3H8feXBBTkMXB50BAvYxFqVYK9oBVblSCNohJb6IgORoea6fgAFR2J9sHq2DZ0Wh+mVttU1PiAiCgGtYAUiY5WgQQCAQMDIiAlksiDQttRwW//OL/outmnu3fvvfkl79fMmT17zvnu+d3dPZ/727PnnI3MRJJUn11muwGSpOEY4JJUKQNckiplgEtSpQxwSaqUAS5JlZo7kys74IADcnx8fCZXKUnVW7du3Y8zc6x9+owG+Pj4OGvXrp3JVUpS9SLirk7T3YUiSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqtSMnsgjSdur8eVf7TrvzhUnzWBLBmcPXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFWqusMIazzUR5Kmgz1wSarUrPXA7UlL0tRUtwtF2wf/AUuzz10oklSpgXrgEXEn8DDwGPBoZk5ExDzgc8A4cCfwJ5n54PQ0U5LUbjI98Bdm5sLMnCj3lwNXZubhwJXlviRphkxlF8rJwKoyvgpYMvXmSJIGNWiAJ/C1iFgXEcvKtIMycxNAuT1wOhooSeps0KNQjsvMeyPiQOCKiLhl0BWUwF8GsGDBgiGaKEnqZKAeeGbeW243AxcDxwL3RcQhAOV2c5falZk5kZkTY2Njo2m1JKl/DzwingDskpkPl/ETgfcAlwBLgRXldvV0NlQahsera0c2yC6Ug4CLI2Lr8udn5mURcS1wYUScAdwNnDp9zZQktesb4Jl5B3BUh+n3A4umo1GSpP48lV7SdsndX/15Kr0kVcoAl6RKuQuljx39Y9yO/veps26vu695XeyBS1KlDHBJqpQBLkmVMsAlqVIGuCRVyqNQpBHyqB7NJHvgklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKc/ElKRZMIqzdg1wVcFT1KVtuQtFkiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqNfCZmBExB1gL/HdmvjQiDgMuAOYB1wGnZ+bPp6eZ9fHMQUnTbTKn0p8FbAT2LvfPBd6fmRdExL8CZwAfGXH7pFnhP2DVYKBdKBExHzgJ+Gi5H8DxwEVlkVXAkulooCSps0F74B8A3g7sVe7vDzyUmY+W+/cAT+pUGBHLgGUACxYsGL6lkmaVn0q2P3174BHxUmBzZq5rndxh0exUn5krM3MiMyfGxsaGbKYkqd0gPfDjgJdHxEuAx9PsA/8AsG9EzC298PnAvdPXTElSu7498Mx8R2bOz8xx4JXA1zPz1cBVwCllsaXA6mlrpSRpG1M5Dvwc4OyIuJ1mn/h5o2mSJGkQk/pFnsxcA6wp43cAx46+SZKkQXgmpiRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlZrUceCafl4wSNKg7IFLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpErtNJeT9TKtknY09sAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpfoGeEQ8PiKuiYgbIuLmiHh3mX5YRFwdEbdFxOciYrfpb64kaatBeuA/A47PzKOAhcDiiHgOcC7w/sw8HHgQOGP6milJatc3wLPxSLm7axkSOB64qExfBSyZlhZKkjoaaB94RMyJiPXAZuAK4PvAQ5n5aFnkHuBJXWqXRcTaiFi7ZcuWUbRZksSAAZ6Zj2XmQmA+cCzw250W61K7MjMnMnNibGxs+JZKkn7DpI5CycyHgDXAc4B9I2LrtVTmA/eOtmmSpF4GOQplLCL2LeO7AycAG4GrgFPKYkuB1dPVSEnStga5GuEhwKqImEMT+Bdm5lci4nvABRHxXuB64LxpbKckqU3fAM/MG4GjO0y/g2Z/uCRpFngmpiRVygCXpEoZ4JJUKQNckiplgEtSpQxwSarUTvOr9NL2bHz5V7vOu3PFSTPYEtXEHrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUp7II1XME4C2tTM9J/bAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpL2YlaVrtTBeXmml9e+ARcWhEXBURGyPi5og4q0yfFxFXRMRt5Xa/6W+uJGmrQXrgjwJvzczrImIvYF1EXAG8FrgyM1dExHJgOXDO9DVVvdjLkXY+fXvgmbkpM68r4w8DG4EnAScDq8piq4Al09VISdK2JvUlZkSMA0cDVwMHZeYmaEIeOHDUjZMkdTdwgEfEnsAXgD/PzJ9Oom5ZRKyNiLVbtmwZpo2SpA4GOgolInalCe/PZOYXy+T7IuKQzNwUEYcAmzvVZuZKYCXAxMREjqDNkrTdmM3vnwY5CiWA84CNmfm+llmXAEvL+FJg9eibJ0nqZpAe+HHA6cCGiFhfpr0TWAFcGBFnAHcDp05PEyVJnfQN8Mz8FhBdZi8abXMkSYPyTEzNKI9Xl0bHa6FIUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUn0DPCI+FhGbI+KmlmnzIuKKiLit3O43vc2UJLUbpAf+CWBx27TlwJWZeThwZbkvSZpBfQM8M78JPNA2+WRgVRlfBSwZcbskSX0Muw/8oMzcBFBuDxxdkyRJg5j2LzEjYllErI2ItVu2bJnu1UnSTmPYAL8vIg4BKLebuy2YmSszcyIzJ8bGxoZcnSSp3bABfgmwtIwvBVaPpjmSpEENchjhZ4HvAEdExD0RcQawAnhRRNwGvKjclyTNoLn9FsjM07rMWjTitkiSJsEzMSWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlphTgEbE4Im6NiNsjYvmoGiVJ6m/oAI+IOcC/AC8GngacFhFPG1XDJEm9TaUHfixwe2bekZk/By4ATh5NsyRJ/URmDlcYcQqwODP/tNw/HXh2Zr6pbbllwLJy9wjg1i4PeQDw4yGaYt3o6mpoo3XW7Yx1T87MsW2mZuZQA3Aq8NGW+6cD/zyFx1tr3ezW1dBG66yz7tfDVHah3AMc2nJ/PnDvFB5PkjQJUwnwa4HDI+KwiNgNeCVwyWiaJUnqZ+6whZn5aES8CbgcmAN8LDNvnkJbVlo363U1tNE666wrhv4SU5I0uzwTU5IqZYBLUqUMcEmqVHUBHhFHRsSiiNizbfriHjXHRsQxZfxpEXF2RLxkiHV/coia55X1ndhnuWdHxN5lfPeIeHdEfDkizo2IfXrUnRkRh3ab36Nut4h4TUScUO6/KiI+FBFvjIhd+9Q+JSLeFhEfjIh/iog/69VGSb1FxIFD1W1vX2JGxOsy8+Nd5p0JvBHYCCwEzsrM1WXedZn5rA4176K5Xstc4Arg2cAa4ATg8sz82y7raj8kMoAXAl8HyMyXd6m7JjOPLeOvL+29GDgR+HJmruhSdzNwVDm6ZyXwv8BFwKIy/Y+61P0E+B/g+8Bngc9n5pZOy7bVfYbmOdkDeAjYE/hiWV9k5tIudWcCLwO+AbwEWA88CLwCeENmrum37p1FRByYmZtncH37Z+b9M7W+6VQ6BO8AlgBbz0DcDKwGVmTmQ0M85qWZ+eIO0/cu65oPXJqZ57fM+3BmvqHL4x0MvAv4JfDXwJuBP6bJp7Myc1OXunntk4B1wNE0294DA/9Rw5wxNJ0DcHePeRuAPcv4OLC2PFEA1/eomUMTVD8F9i7Tdwdu7LGu64BPAy8Anl9uN5Xx5/eou75l/FpgrIw/AdjQo25j67rb5q3vtT6aT1InAucBW4DLgKXAXj3qbiy3c4H7gDnlfvR5Xja0LLsHsKaML+j2GpT5+wArgFuA+8uwsUzbd8j3yqU95u0N/D3wKeBVbfM+3KPuYOAjNBdq2x/4m/I3Xwgc0qNuXtuwP3AnsB8wr0fd4rbn6DzgRuB84KAedSuAA8r4BHAHcDtwV5/353XAXwJPmeRzPQFcVbaJQ2k6Qz8p7/Gje9TtCbwHuLksvwX4LvDaPuu7HDgHOLjttTkHuKJH3bO6DL8LbOpS84XyfC6hOZflC8DjOm2LbXWX0YT28vKanVO2gzcDq3vU/RL4Qdvwi3J7x6Rel2E2nKkO5Y/tNGwAftaj7nsd3hyXAe+jS8jxm4F6fdu8XsG4C/CW8kZdWKb1fXKBG8pGuz9tp8a2r79t3ueB15XxjwMTZfypwLU96trDflfg5TS98S096m4CdittfZgSMsDjafln0qFuQ8ubez9gXetj9qibsQ2y1FWxUba2Bfgo8F7gyeW996Ver0PL+FXAMS3vl66nZJf2/CNwN3BNWc8TB3hfX0PzSfY04IfAKWX6IuA7PepWA6+l6d2eDfwVcDiwCvi7HnW3DjnvMZpPyVd1GP6vS836tvt/AXybZhvu9V5pzZa7ez1m27y3lffZM1pfl36vQcfHGqZoqgNNj29heaO2DuPAvT3qvk4J05Zpc4FPAo91qbka2KOM79IyfZ9eL07LcvNpwvVD7S9Sl+XvpOkN/aDcHlym79nnRd0H+ATNrpCry8Z/B82uiqMGeRN1mLd7j3lvKY9/F3AmcCXw7zQB/a4edWfRBNtKmt701n86Y8A3e9TN2AZZ6qrYKPnNAG9vc6/13QLMLePfbZvX65Ne6/p+H/gw8KPyfC4b8nnp9R68oe3+teV2F+CWHnVfA95Oy6cQ4CCaf6j/2aPuJuDwLvN+2GX6RlqyoUxbSvOp4a5B/jbgvYO+BmX+1lx5H7AXk+x5/+pxhima6kDzMfF5Xead3+ePPrjLvOO6TH9cl+kHtG5sA7T5JHr0GAao3wM4bIDl9gKOoulhdv0I3bL8U6fQpidSel/AvsApwLED1P1OWfbISaxrxjbIMq+KjZLmmkJnA2+l+YcaLfN67cp6c3lOj6fZzfMB4A+AdwOf6lG3zT8vml2Mi4GP96j7Ds1uulNp/ukvKdOfT+8e/39t3dZpvju5vGVer3/c+wHn0vyjehB4oLym59J7l9QpwBFd5i3pMv0fgBM6TF8M3NZjXe+h7NJtm/5bwEUDbhcvo9ml9KNBlt+mfpgiB4fJDm0b5ANtG+R+PeomvUGWeVVslDRfgrUOW78zORj4ZJ/aFwCfo/keZAPwHzSXbp7bo+aCIV+/o2h2g10KHAl8kObL75uB5/aoeybN7peHgG9ROhw0n9jO7LPOI2kONtizbfriAeoWTaauR82LR72u9jqa7+OePkjdNo8zzIvp4DDKgbIbZkeqa9sot9t2bq91NLv1bgW+RLNb8uSWeb12gU26jubTzDDrGrZuqL+t42MN88Q7OIxyYIDvFqzbueoY4oizYetmcl1Tqes0DH01QmkyIuLGbrNo9oVbZ12rOZn5CEBm3hkRLwAuiognl9pR1s3kuqZStw0DXDPlIOAPab6QahU0X3RZZ12rH0XEwsxcD5CZj0TES4GPAc8Ycd1MrmsqddswwDVTvkLzsXF9+4yIWGOddW1eAzzaOiEzHwVeExH/NuK6mVzXVOq2sd2dSi9JGkx1F7OSJDUMcEmqlAEuSZUywLVDi4g1ETHRZ5l3DvA4HY+YiIhPRMQpw7ZPmgoDXIK+AZ6Zz52JhkiTYYBrhxAR4xFxS0SsiogbI+KiiNijbZnTImJDRNwUEeeWaSuA3SNiffmRi26P/0i5jfLLRd+LiK8CQ/2SijQKBrh2JEcAKzPzmTQ/3vGrX1KJiCfSXDjreJpLGR8TEUsycznNZWkXZuarB1jHK8p6ngG8HrBnrlljgGtH8sPM/HYZ/zTwvJZ5x9D8etCWctLEZ2guvzpZfwB8NjMfy8x7KT+xJ80GA1w7kvaz0lrvT+oaE5NcjzQrDHDtSBZExO+V8dNorj+91dXA8yPigIiYU+Z/o8z7RUTsOuA6vgm8MiLmRMQhND90Lc0KA1w7ko3A0nIFvHk0P04MQDa/EP4Omp8Ou4Hmusury+yVwI29vsRscTFwG80lQT/Cr/8JSDPOa6FohxAR48BXMvPps9wUacbYA5ekStkDl4qI2B+4ssOsRZl5/0y3R+rHAJekSrkLRZIqZYBLUqUMcEmqlAEuSZUywCWpUv8PS3ayeUs1tAUAAAAASUVORK5CYII=\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Solution to exercise - plots\n",
"by_site = surveys_df.groupby('plot_id')\n",
"average_weight_by_site = by_site['weight'].mean()\n",
"average_weight_by_site.plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary of Lesson 2\n",
"* Pandas is a popular library for working with data.\n",
"* A Dataframe is a data structure for working with spreadsheet data.\n",
"* `df = pd.read_csv(filename)` reads data into the variable `df`. Then:\n",
" * `df.head()` prints the first 5 rows\n",
" * `df.describe()` provides summary statistics\n",
" * You can also use `.sum()`, `.count()`, `.median()`, `.std()`, `.var()`, etc.\n",
" * `df.groupby('column_name)` aggregates data into groups of the same value for analysis\n",
" * `df.plot.bar()` creates a bar-plot\n",
"* split-apply-combine using `.groupby` allows you to analyze data across many different axes\n",
"\n",
"Recommended workflow for data analysis\n",
"* Format raw data into rows of observations and columns\n",
"* Load data into Python\n",
"* Get basic intuition using `df.head()` and `df.describe()`\n",
"* Start analyzing across groups using `.groupby().describe()`\n",
"* Visualize groups using `.plot.bar()` or other plot tpes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lesson 3: Indexing, slicing, and subsetting DataFrames\n",
"\n",
"Goal: let's filter out subsets of our data to analyze."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# If you skipped lesson 2,\n",
"# make sure pandas is loaded,\n",
"import pandas as pd\n",
"\n",
"# and read in the survey CSV\n",
"surveys_df = pd.read_csv(\"data/surveys.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Indexing and Slicing\n",
"We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific `[row, column]` index locations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering data\n",
"We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
species_id
\n",
"
sex
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
33320
\n",
"
33321
\n",
"
1
\n",
"
12
\n",
"
2002
\n",
"
1
\n",
"
DM
\n",
"
M
\n",
"
38.0
\n",
"
44.0
\n",
"
\n",
"
\n",
"
33321
\n",
"
33322
\n",
"
1
\n",
"
12
\n",
"
2002
\n",
"
1
\n",
"
DO
\n",
"
M
\n",
"
37.0
\n",
"
58.0
\n",
"
\n",
"
\n",
"
33322
\n",
"
33323
\n",
"
1
\n",
"
12
\n",
"
2002
\n",
"
1
\n",
"
PB
\n",
"
M
\n",
"
28.0
\n",
"
45.0
\n",
"
\n",
"
\n",
"
33323
\n",
"
33324
\n",
"
1
\n",
"
12
\n",
"
2002
\n",
"
1
\n",
"
AB
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
33324
\n",
"
33325
\n",
"
1
\n",
"
12
\n",
"
2002
\n",
"
1
\n",
"
DO
\n",
"
M
\n",
"
35.0
\n",
"
29.0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
35544
\n",
"
35545
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
AH
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
35545
\n",
"
35546
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
AH
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
35546
\n",
"
35547
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
10
\n",
"
RM
\n",
"
F
\n",
"
15.0
\n",
"
14.0
\n",
"
\n",
"
\n",
"
35547
\n",
"
35548
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
7
\n",
"
DO
\n",
"
M
\n",
"
36.0
\n",
"
51.0
\n",
"
\n",
"
\n",
"
35548
\n",
"
35549
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
5
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
2229 rows × 9 columns
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"33320 33321 1 12 2002 1 DM M 38.0 \n",
"33321 33322 1 12 2002 1 DO M 37.0 \n",
"33322 33323 1 12 2002 1 PB M 28.0 \n",
"33323 33324 1 12 2002 1 AB NaN NaN \n",
"33324 33325 1 12 2002 1 DO M 35.0 \n",
"... ... ... ... ... ... ... ... ... \n",
"35544 35545 12 31 2002 15 AH NaN NaN \n",
"35545 35546 12 31 2002 15 AH NaN NaN \n",
"35546 35547 12 31 2002 10 RM F 15.0 \n",
"35547 35548 12 31 2002 7 DO M 36.0 \n",
"35548 35549 12 31 2002 5 NaN NaN NaN \n",
"\n",
" weight \n",
"33320 44.0 \n",
"33321 58.0 \n",
"33322 45.0 \n",
"33323 NaN \n",
"33324 29.0 \n",
"... ... \n",
"35544 NaN \n",
"35545 NaN \n",
"35546 14.0 \n",
"35547 51.0 \n",
"35548 NaN \n",
"\n",
"[2229 rows x 9 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"surveys_df[surveys_df.year == 2002]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
species_id
\n",
"
sex
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
2
\n",
"
NL
\n",
"
M
\n",
"
32.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
3
\n",
"
NL
\n",
"
M
\n",
"
33.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
2
\n",
"
DM
\n",
"
F
\n",
"
37.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
7
\n",
"
DM
\n",
"
M
\n",
"
36.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
7
\n",
"
16
\n",
"
1977
\n",
"
3
\n",
"
DM
\n",
"
M
\n",
"
35.0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
33315
\n",
"
33316
\n",
"
12
\n",
"
16
\n",
"
2001
\n",
"
11
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
33316
\n",
"
33317
\n",
"
12
\n",
"
16
\n",
"
2001
\n",
"
13
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
33317
\n",
"
33318
\n",
"
12
\n",
"
16
\n",
"
2001
\n",
"
14
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
33318
\n",
"
33319
\n",
"
12
\n",
"
16
\n",
"
2001
\n",
"
15
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
"
\n",
"
33319
\n",
"
33320
\n",
"
12
\n",
"
16
\n",
"
2001
\n",
"
16
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
33320 rows × 9 columns
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"0 1 7 16 1977 2 NL M 32.0 \n",
"1 2 7 16 1977 3 NL M 33.0 \n",
"2 3 7 16 1977 2 DM F 37.0 \n",
"3 4 7 16 1977 7 DM M 36.0 \n",
"4 5 7 16 1977 3 DM M 35.0 \n",
"... ... ... ... ... ... ... ... ... \n",
"33315 33316 12 16 2001 11 NaN NaN NaN \n",
"33316 33317 12 16 2001 13 NaN NaN NaN \n",
"33317 33318 12 16 2001 14 NaN NaN NaN \n",
"33318 33319 12 16 2001 15 NaN NaN NaN \n",
"33319 33320 12 16 2001 16 NaN NaN NaN \n",
"\n",
" weight \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"... ... \n",
"33315 NaN \n",
"33316 NaN \n",
"33317 NaN \n",
"33318 NaN \n",
"33319 NaN \n",
"\n",
"[33320 rows x 9 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Or we can select all rows that do not contain the year 2002:\n",
"surveys_df[surveys_df.year != 2002]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
species_id
\n",
"
sex
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2270
\n",
"
2271
\n",
"
1
\n",
"
15
\n",
"
1980
\n",
"
8
\n",
"
DO
\n",
"
M
\n",
"
35.0
\n",
"
53.0
\n",
"
\n",
"
\n",
"
2271
\n",
"
2272
\n",
"
1
\n",
"
15
\n",
"
1980
\n",
"
11
\n",
"
PF
\n",
"
F
\n",
"
16.0
\n",
"
10.0
\n",
"
\n",
"
\n",
"
2272
\n",
"
2273
\n",
"
1
\n",
"
15
\n",
"
1980
\n",
"
18
\n",
"
DM
\n",
"
F
\n",
"
34.0
\n",
"
33.0
\n",
"
\n",
"
\n",
"
2273
\n",
"
2274
\n",
"
1
\n",
"
15
\n",
"
1980
\n",
"
11
\n",
"
DM
\n",
"
M
\n",
"
38.0
\n",
"
37.0
\n",
"
\n",
"
\n",
"
2274
\n",
"
2275
\n",
"
1
\n",
"
15
\n",
"
1980
\n",
"
8
\n",
"
DO
\n",
"
F
\n",
"
33.0
\n",
"
29.0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
11222
\n",
"
11223
\n",
"
12
\n",
"
8
\n",
"
1985
\n",
"
4
\n",
"
DM
\n",
"
M
\n",
"
36.0
\n",
"
40.0
\n",
"
\n",
"
\n",
"
11223
\n",
"
11224
\n",
"
12
\n",
"
8
\n",
"
1985
\n",
"
11
\n",
"
DM
\n",
"
M
\n",
"
37.0
\n",
"
49.0
\n",
"
\n",
"
\n",
"
11224
\n",
"
11225
\n",
"
12
\n",
"
8
\n",
"
1985
\n",
"
7
\n",
"
PE
\n",
"
M
\n",
"
20.0
\n",
"
18.0
\n",
"
\n",
"
\n",
"
11225
\n",
"
11226
\n",
"
12
\n",
"
8
\n",
"
1985
\n",
"
1
\n",
"
DM
\n",
"
M
\n",
"
38.0
\n",
"
47.0
\n",
"
\n",
"
\n",
"
11226
\n",
"
11227
\n",
"
12
\n",
"
8
\n",
"
1985
\n",
"
15
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
8957 rows × 9 columns
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"2270 2271 1 15 1980 8 DO M 35.0 \n",
"2271 2272 1 15 1980 11 PF F 16.0 \n",
"2272 2273 1 15 1980 18 DM F 34.0 \n",
"2273 2274 1 15 1980 11 DM M 38.0 \n",
"2274 2275 1 15 1980 8 DO F 33.0 \n",
"... ... ... ... ... ... ... ... ... \n",
"11222 11223 12 8 1985 4 DM M 36.0 \n",
"11223 11224 12 8 1985 11 DM M 37.0 \n",
"11224 11225 12 8 1985 7 PE M 20.0 \n",
"11225 11226 12 8 1985 1 DM M 38.0 \n",
"11226 11227 12 8 1985 15 NaN NaN NaN \n",
"\n",
" weight \n",
"2270 53.0 \n",
"2271 10.0 \n",
"2272 33.0 \n",
"2273 37.0 \n",
"2274 29.0 \n",
"... ... \n",
"11222 40.0 \n",
"11223 49.0 \n",
"11224 18.0 \n",
"11225 47.0 \n",
"11226 NaN \n",
"\n",
"[8957 rows x 9 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We can define sets of criteria too:\n",
"surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
record_id
\n",
"
month
\n",
"
day
\n",
"
year
\n",
"
plot_id
\n",
"
species_id
\n",
"
sex
\n",
"
hindfoot_length
\n",
"
weight
\n",
"
\n",
" \n",
" \n",
"
\n",
"
62
\n",
"
63
\n",
"
8
\n",
"
19
\n",
"
1977
\n",
"
3
\n",
"
DM
\n",
"
M
\n",
"
35.0
\n",
"
40.0
\n",
"
\n",
"
\n",
"
63
\n",
"
64
\n",
"
8
\n",
"
19
\n",
"
1977
\n",
"
7
\n",
"
DM
\n",
"
M
\n",
"
37.0
\n",
"
48.0
\n",
"
\n",
"
\n",
"
64
\n",
"
65
\n",
"
8
\n",
"
19
\n",
"
1977
\n",
"
4
\n",
"
DM
\n",
"
F
\n",
"
34.0
\n",
"
29.0
\n",
"
\n",
"
\n",
"
65
\n",
"
66
\n",
"
8
\n",
"
19
\n",
"
1977
\n",
"
4
\n",
"
DM
\n",
"
F
\n",
"
35.0
\n",
"
46.0
\n",
"
\n",
"
\n",
"
66
\n",
"
67
\n",
"
8
\n",
"
19
\n",
"
1977
\n",
"
7
\n",
"
DM
\n",
"
M
\n",
"
35.0
\n",
"
36.0
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
35540
\n",
"
35541
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
PB
\n",
"
F
\n",
"
24.0
\n",
"
31.0
\n",
"
\n",
"
\n",
"
35541
\n",
"
35542
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
PB
\n",
"
F
\n",
"
26.0
\n",
"
29.0
\n",
"
\n",
"
\n",
"
35542
\n",
"
35543
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
15
\n",
"
PB
\n",
"
F
\n",
"
27.0
\n",
"
34.0
\n",
"
\n",
"
\n",
"
35546
\n",
"
35547
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
10
\n",
"
RM
\n",
"
F
\n",
"
15.0
\n",
"
14.0
\n",
"
\n",
"
\n",
"
35547
\n",
"
35548
\n",
"
12
\n",
"
31
\n",
"
2002
\n",
"
7
\n",
"
DO
\n",
"
M
\n",
"
36.0
\n",
"
51.0
\n",
"
\n",
" \n",
"
\n",
"
30676 rows × 9 columns
\n",
"
"
],
"text/plain": [
" record_id month day year plot_id species_id sex hindfoot_length \\\n",
"62 63 8 19 1977 3 DM M 35.0 \n",
"63 64 8 19 1977 7 DM M 37.0 \n",
"64 65 8 19 1977 4 DM F 34.0 \n",
"65 66 8 19 1977 4 DM F 35.0 \n",
"66 67 8 19 1977 7 DM M 35.0 \n",
"... ... ... ... ... ... ... .. ... \n",
"35540 35541 12 31 2002 15 PB F 24.0 \n",
"35541 35542 12 31 2002 15 PB F 26.0 \n",
"35542 35543 12 31 2002 15 PB F 27.0 \n",
"35546 35547 12 31 2002 10 RM F 15.0 \n",
"35547 35548 12 31 2002 7 DO M 36.0 \n",
"\n",
" weight \n",
"62 40.0 \n",
"63 48.0 \n",
"64 29.0 \n",
"65 46.0 \n",
"66 36.0 \n",
"... ... \n",
"35540 31.0 \n",
"35541 29.0 \n",
"35542 34.0 \n",
"35546 14.0 \n",
"35547 51.0 \n",
"\n",
"[30676 rows x 9 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We can also remove all rows with missing data\n",
"surveys_dropna_df = surveys_df.dropna()\n",
"surveys_dropna_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lesson 3 summary\n",
"* Data can be accessed with indices, slices, labels, or conditions\n",
"* Lots of ways to filter for the data you want to analyze. If you can specify your criteria, you can use the criteria to filter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Helpful resources\n",
"* [Pandas quick reference](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)\n",
"* [10 minutes to pandas (tutorial)](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)\n",
"* [Advanced Pandas examples](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}