Python for humanists: Instructor Notes

Challenge solutions

Checking installations.

In the include/scripts directory, you will find a script called check_env.py This checks the functionality of the Anaconda install.

By default, Data Carpentry does not have people pull the whole repository with all the scripts and addenda. Therefore, you, as the instructor, get to decide how you’d like to provide this script to learners, if at all. To use this, students can navigate into includes/scripts terminal, and execute the following:

python check_env.py 

If learners receive an AssertionError, it will inform you how to help them correct this installation. Otherwise, it will tell you that the system is good to go and ready for Data Carpentry!

00-short-introduction-to-Python

Tuples Challenges

Dictionaries Challenges

Make sure it is also clear that access to ‘the second value’ is actually just about the key name. Add for example rev[10] = "ten" to clarify it is not about the position.

rev[2] = "apple-sauce"
{1: 'one', 2: 'apple-sauce', 3: 'three'}

01-starting-with-data

###Bug Note:

Pandas < .18.1 has a bug where surveys_df[‘weight’].describe() may return a runtime error.

Dataframe Challenges

Calculating Statistics Challenges

len(plot_names) includes duplicate names.

Grouping Challenges

surveys_df.groupby(['plot_id','sex']).agg({"year": 'min', 
                                           "hindfoot_length": 'median', 
                                           "weight": 'mean'})`
surveys_df.groupby(['plot_id'])['weight'].describe()

Plotting Challenges

surveys_df.groupby('plot_id').mean()["weight"].plot(kind='bar')

average weight across all species for each plot

surveys_df.groupby('sex').count()["record_id"].plot(kind='bar')

total males versus total females for the entire dataset

02-index-slice-subset

Indexing Challenges

Selection Challenges

Advanced Selection Challenges

surveys_df[~surveys_df["sex"].isin(['M', 'F'])]

Masking Challenges

new = surveys_df[~surveys_df['sex'].isin(['M', 'F'])].copy()
new['sex']='x'
print(len(new))

Can verify the number of Nan values with sum(surveys_df['sex'].isnull()), which is equal to the number of none female/male records.

# selection of the data with isin
stack_selection = surveys_df[(surveys_df['sex'].isin(['M', 'F'])) & 
							surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]
# calculate the mean weight for each plot id and sex combination: 
stack_selection = stack_selection.groupby(["plot_id", "sex"]).mean().unstack()
# and we can make a stacked bar plot from this:
stack_selection.plot(kind='bar', stacked=True)

Suggestion: As we now the other values are all Nan values, we could also select all not null values (just preview, more on this in next lesson):

stack_selection = surveys_df[(surveys_df['sex'].notnull()) & 
					surveys_df["weight"] > 0.][["sex", "weight", "plot_id"]]

average weight for each plot per sex

However, due to the unstack command, the legend header contains two levels. In order to remove this, the column naming needs to be simplified :

stack_selection.columns = stack_selection.columns.droplevel()

average weight for each plot per sex

03-data-types-and-format

Challenge - Changing Types

Pandas cannot convert types from float to int if the column contains NaN values.

Challenge - Counting

surveys_df.isnull()

04-merging-data

# read the files:
survey2001 = pd.read_csv("data/survey2001.csv")
survey2002 = pd.read_csv("data/survey2002.csv")
# concatenate
survey_all = pd.concat([survey2001, survey2002], axis=0)
# get the weight for each year, grouped by sex:
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
# plot: 
weight_year.plot(kind="bar")
plt.tight_layout()  # tip(!)

average weight for each year, grouped by sex

# writing to file:
weight_year.to_csv("weight_for_year.csv")
# reading it back in:
pd.read_csv("weight_for_year.csv", index_col=0)
merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species_id")

Then calculate and plot the distribution of:

1. taxa per plot (number of species of each taxa per plot):

Species distribution (number of taxa for each plot) can be derived as follows:

merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')

taxa per plot

Suggestion: It is also possible to plot the number of individuals for each taxa in each plot (stacked bar chart):

merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05))

(the legend otherwise overlaps the bar plot)

taxa per plot

2. taxa by sex by plot: Providing the Nan values with the M|F values (can also already be changed to ‘x’):

merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'

Number of taxa for each plot/sex combination:

ntaxa_sex_plot = merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
ntaxa_sex_plot = ntaxa_sex_plot.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_plot.index)
ntaxa_sex_plot.plot(kind="bar", legend=False)
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
           fontsize='small', frameon=False)

taxa per plot per sex

Suggestion (for discussion only):

The number of individuals for each taxa in each plot per sex can be derived as well.

sex_taxa_plot  = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
sex_taxa_plot.unstack(level=[1, 2]).plot(kind='bar', logy=True)
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15), 
           fontsize='small', frameon=False)

taxa per plot per sex

This is not really the best plot choice: not readable,… A first option to make this better, is to make facets. However, pandas/matplotlib do not provide this by default. Just as a pure matplotlib example (M|F if for not-defined sex records):

fig, axs = plt.subplots(3, 1)
for sex, ax in zip(["M", "F", "M|F"], axs):
    sex_taxa_plot[sex_taxa_plot["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
    ax.set_ylabel(sex)
    if not ax.is_last_row():
        ax.set_xticks([])
        ax.set_xlabel("")
axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3), 
              fontsize='small', frameon=False)

taxa per plot per sex

However, it would be better to link to Seaborn and Altair for tis kind of multivariate visualisations.

plot_info = pd.read_csv("data/plots.csv")
plot_info.groupby("plot_type").count()
merged_plot_type = pd.merge(merged_left, plot_info, on='plot_id')
# For each plot, get the number of species for each plot
nspecies_plot = merged_plot_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
# For each plot, get the number of individuals
nindividuals_plot = merged_plot_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
# combine the two series
diversity_index = pd.concat([nspecies_plot, nindividuals_plot], axis=1)
# calculate the diversity index
diversity_index['diversity'] = diversity_index['nspecies']/diversity_index['nindiv']

Making a bar chart:

diversity_index['diversity'].plot(kind="barh")
plt.xlabel("Diversity index")

taxa per plot per sex

05-loops-and-functions

Basic Loop Challenges

for creature in animals:
    print(creature+',', end='')

This loop also adds a comma after the last animal. A better, loop-free solution would be: ','.join(animals)

Looping Over Dataframe Challenges

surveys_year = surveys_df[surveys_df.year == year].dropna()
n_year = 5  # better overview by making variable from it
first_year = surveys_df['year'].min()
last_year = surveys_df['year'].max()

for year in range(first_year, last_year, n_year):
    print(year)

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year].dropna()

Functions Challenges

Additional Functions Challenges

def one_year_csv_writer(this_year, all_data, folder_to_save, root_name):
    """
    Writes a csv file for data from a given year.

    Parameters
    ---------
    this_year : int
        year for which data is extracted
    all_data: pd.DataFrame
        DataFrame with multi-year data 
    folder_to_save : str
        folder to save the data files
    root_name: str
        root of the filenames to save the data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = os.path.join(folder_to_save, ''.join([root_name, str(this_year), '.csv']))
    surveys_year.to_csv(filename)

Also adapt function yearly_data_csv_writer with the additional inputs.

Additional Functions Challenges

Output Management Challenges

Implementation inside the function:

filenames = []
for year in range(start_year, end_year+1):
    filenames.append(one_year_csv_writer(year, all_data, folder_to_save, root_name))
return filenames
yearly_data_arg_test(surveys_df, end_year=2001)
def one_year_csv_writer(this_year, all_data, folder_to_save='./', root_name='survey'):
    """
    Writes a csv file for data from a given year.

    Parameters
    ---------
    this_year : int
        year for which data is extracted
    all_data: pd.DataFrame
        DataFrame with multi-year data 
    folder_to_save : str
        folder to save the data files
    root_name: str
        root of the filenames to save the data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = os.path.join(folder_to_save, ''.join([root_name, str(this_year), '.csv']))
    surveys_year.to_csv(filename)
    # Write the new DataFrame to a csv file
    if len(surveys_year) > 0:
        filename = os.path.join(folder_to_save, ''.join([root_name, str(this_year), '.csv']))
        surveys_year.to_csv(filename)
    else:
        print("No data for year " + str(this_year))
def yearly_data_csv_writer(all_data, yearcolumn="year", 
                           folder_to_save='./', root_name='survey'):
    """
    Writes separate csv files for each year of data.

    all_data --- DataFrame with multi-year data
    yearcolumn --- column name containing the year of the data
    folder_to_save --- folder name to store files
    root_name --- start of the file names stored
    """
    years = all_data["year"].unique()

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    filenames = []
    for year in years:
        filenames.append(one_year_csv_writer(year, all_data, folder_to_save, root_name))
    return filenames