{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "CiaKDpAltTxe" }, "source": [ "# Visualization in Python \n", "\n", "#### Computing Showcase \n", "\n", "**April 6, 2022**\n", "\n", "\n", "*Mobility data and example adapted from UGBA 88 course materials*\n" ] }, { "cell_type": "markdown", "metadata": { "id": "K9U62tY3iR2p" }, "source": [ "\n", "## Workshop Goals\n", "\n", "In this workshop, we will explore visualization of the data looking at College Mobility. We will focus on public universities and community colleges in Michigan. An important justification for public spending on higher education is that colleges and universities may be seen as the 'engines of social mobility'. \n", "\n", "We will do three things. First, we will investigate how access, success, and upward mobility rates vary across institutions. Second, we will explore how access has changed over time, as Michigan’s spending on public higher education has declined or stagnated. Third, we will write a function that generates a Report Card for a provided institution.\n", "\n", "The exercises are intended to illustrate how visualizations can provide valuable insights and motivate new questions.\n", "\n", "\n", "## Economic Mobility at Universities\n", "\n", "\n", "In 2017, a team of researchers used anonymized data from the federal government to publish statistics for each college in the U.S. on the distribution of students’ earnings in their thirties and their parents’ incomes. They showed that students from low-income families have excellent long-term outcomes after attending selective schools, but that there are very few low-income students at these schools.\n", "\n", "This work was highlighted in several news sites: \n", "\n", "* [NYTimes](https://www.nytimes.com/interactive/2017/01/18/upshot/some-colleges-have-more-students-from-the-top-1-percent-than-the-bottom-60.html) including interactive visualizations\n", "* [Vox](https://www.vox.com/policy-and-politics/2017/2/28/14359140/chetty-friedman-college-mobility)\n", "\n", "Full details on the data used here as well as many related data sets can be found at [opportunityinsights.org](https://opportunityinsights.org/education/)\n", "\n", "\n", "\n", "\n", "### Table of Contents\n", "1 - [Comparing Outcomes Across Institutions](#compare)
\n", "2 - [How Does Access Vary Over Time?](#access)
\n", "3 - [Creating a College Report Card](#card)
\n", "\n", "\n", "**Dependencies:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9n25VHRQtS14" }, "outputs": [], "source": [ "# import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# data Visualization libraries\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "%matplotlib inline\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Zri0aZmvjPg3" }, "source": [ "## 1. Comparing Outcomes Mobility Across Institutions \n", "The first dataset we'll use has one row of data for each college and university in the US.\n", "\n", "(Though we discuss the columns we'll use in this lab, look [here](http://www.equality-of-opportunity.org/data/college/Codebook%20MRC%20Table%202.pdf) for more documentation on the remaining contents of these data.)\n", "\n", "\n", "First, let's load the data and the specific columns we'll use in this lab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 400 }, "id": "lZOdPgmYjQJg", "outputId": "06e41f2e-a370-48c3-8066-1631e5d1434b" }, "outputs": [], "source": [ "mobility = pd.read_csv(\"data/mrc_q1.csv\")\n", "\n", "print(\"Data Dimensions:\", mobility.shape[1] , \"X\" , mobility.shape[0])\n", "mobility.head(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "iLK_4eY3jSFP" }, "source": [ "In this exercise, we will focus on Michigan public institutions. Let’s filter the data to reflect this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mi_pub_mobility = mobility[(mobility['type']=='Public') & (mobility['state']=='MI')]\n", "\n", "\n", "print(\"Data Dimensions:\", mi_pub_mobility.shape[1] , \"X\" , mi_pub_mobility.shape[0])\n", "mi_pub_mobility.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are left with 40 institutions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Data \n", "\n", "We will first describe the distributions of _access, success rates, and mobility rates_ across institutions. We use the same definitions of these terms used in the paper and described in lecture:\n", "\n", "- **`access`:** the percentage of students enrolled that are ‘low income’–those whose parents' income is in the bottom quintile (bottom 20%) of the parental income distribution. Note: values range from 0 to 100.\n", "\n", "- **`success`:** the percentage of low income students with post-graduation incomes in the top quintile (top 20%) of the student income distribution, measured at age 32-34.\n", "\n", "- **`mobility`:** the percentage of students enrolled that are both ‘low income’ and later have earnings in the top quintile (top 20%) of the student income distribution.\n", "\n", "Recall that `mobility` $=$ `access` $\\times$ `success`. Hence, institutions with high mobility will tend to have more low income students and high 'success' rates with those students." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Success Rates\n", "\n", "
\n", "EXERCISE 1.1: Plot a histogram of `success` across institutions.\n", "
\n", "\n", "We can make our first plot using the `matplotlib` library and `pandas`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using pandas with matplotlib \n", "mi_pub_mobility.hist(column='success', bins=10)\n", "plt.xlabel('Success Rate: P(Child in Q5 | Parent in Q1)')\n", "plt.ylabel('Count')\n", "plt.title('Distribution of \"Success\" at MI Public Institutions');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can recreate this histogram with `matplotlib`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using matplotlib\n", "plt.figure(figsize=(7,5)) # change figure size\n", "plt.hist(mi_pub_mobility[\"success\"], # plot histogram\n", " bins=20, \n", " label='Success') \n", "\n", "# adding some plot information\n", "plt.xlabel(\"Success Rate: P(Child in Q5 | Parent in Q1)\") # display x-label\n", "plt.ylabel(\"Count\") # display y-label\n", "plt.title('Distribution of \"Success\" at MI Public Institutions'); # display title\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Box Plots \n", "\n", "Box plots are used to compactly show many pieces of information about a variable's distribution including some summary statistics.\n", "\n", "A box plot of `success` and `access` can be created. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "...\n", "plt.ylabel(\"Percentages\")\n", "plt.xlabel(\"Mobility Factors\")\n", "plt.title(\"Boxplot of Success and Access\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "...\n", "plt.ylabel(\"Percentages\")\n", "plt.xlabel(\"Mobility Factors\")\n", "plt.title(\"Boxplot of Success and Access\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create a boxplot for a single attribute `mobility` and see how it varies with another attribute `iclevel`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# boxplot using seaborn\n", "# display each variable at new line for better visibility and readability\n", "sb.boxplot(x='iclevel', \n", " y='mobility', \n", " hue='iclevel',\n", " data=mi_pub_mobility[['mobility', 'iclevel']])\n", "plt.title(\"Boxplot of Mobility by Level Type\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "...\n", "#plt.title(\"Boxplot of Petal Width by Iris Type\")\n", "plt.xlabel(\"iclevel\")\n", "plt.ylabel(\"Mobility\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's revisit a plot from above on the distribution of `success`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using matplotlib\n", "plt.figure(figsize=(7,5)) # change figure size\n", "plt.hist(mi_pub_mobility[\"success\"], # plot histogram\n", " bins=20, \n", " label='Success') \n", "\n", "# adding some plot information\n", "plt.xlabel(\"Success Rate: P(Child in Q5 | Parent in Q1)\") # display x-label\n", "plt.ylabel(\"Count\") # display y-label\n", "plt.title('Distribution of \"Success\" at MI Public Institutions'); # display title\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, that there is are two institutions that have a larger `success` value than the other institutions. We also see this in the box plot with two points above the whiskers. \n", "\n", "This type of data point(s), that does not fit the overall pattern of the data, is often referred to as an **outlier**. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 1.2: What is that outlier?\n", "
\n", "\n", "To find this, we can filter the table to look at rows where `success` is sufficiently large. \n", "\n", "Set `success_outlier` to the name of the outlier institution(s)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "success_outlier = mi_pub_mobility[mi_pub_mobility['success']>40][\"name\"]\n", "print(success_outlier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 1.3: Next, let's examine the relationship between `access` and `success`. Create a scatterplot with `access` on the horizontal axis and `success` on the vertical access. Try the function `scatter` that can be called on a pandas DataFrame. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create scatter plot\n", "mi_pub_mobility.plot.scatter(...)\n", "\n", "#the code below will label the axes and title of your scatter plot\n", "plt.title('Access versus Success at Michigan Public Institutions')\n", "plt.xlabel('Share of Parents in Bottom Quintile')\n", "plt.ylabel('100 * P(Child in Q5 | Parent in Q1)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, despite the clear relationship between `access` and `success` you've noted above, there is still a lot of variation in `access` among institutions with similar `success` rates. You can see that from the following figure (which includes all US colleges and universities, not just public Michigan schools):\n", "\n", "\"Drawing\"\n", "\n", "Among schools at the 75th percentile of `success`, the stadard deviation is relatively large at 6.88%. This suggests an interesting policy question: how are institutions producing students of similar 'quality' (as measured by earnings) yet providing very different levels of access? What can be learned from the more accessible colleges and universities?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mobility Rates \n", "\n", "Finally, let's investigate `mobility` rates. Recall that `mobility` measures the percentage of students enrolled that are both ‘low income’ and later have earnings in the top quintile (top 20%) of the student income distribution.\n", "\n", "
\n", "EXERCISE 1.4: Plot a histogram of `mobility` across institutions. Follow the example above on `access`\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create histogram of mobility column\n", "...\n", "\n", "#the code below will label the axes and title of your histogram\n", "plt.title('Distribution of \"Mobility\" at Michigan Public Institutions')\n", "plt.xlabel('\"Mobility\"')\n", "plt.ylabel('Count');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see again points that separate themselves from the distribution. What institution(s) is that? Set `mobility_outlier` to the name of the institution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mobility_outlier = mi_pub_mobility[...][\"name\"]\n", "print(mobility_outlier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. How Does Access Vary Over Time? \n", "\n", "In this section we will study how low income access to Michigan public institutions has changed over time. \n", "\n", "
\n", "EXERCISE 2.1: We will begin by loading a new dataset, which is described in more detail below.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mobility_panel = pd.read_csv('data/mrc_q2.csv')\n", "\n", "#restrict to California public and private (non-profit) institutions\n", "mi_mobility_panel = mobility_panel[(mobility_panel['state']=='MI') & (mobility_panel['type'].isin(['Public', 'Private Non-profit']))]\n", "\n", "#drop missing values\n", "mi_mobility_panel = mi_mobility_panel[(mi_mobility_panel['access']>0)]\n", "\n", "mi_mobility_panel.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These data are **longitudinal data** (also known as **panel data**), which means they follow the same object over time with repeated observations. In this case, the data follow institutions over time.\n", "\n", "These particular longitudinal data are organized by **cohort**. In general, a cohort is a group of individuals that share some common factor, of a year of birth or year of matriculation. In this case, cohorts are defined by the student's year of birth. For each institution, there is now a separate row of data for students born in each year, ranging from 1980 to 1991.\n", "\n", "The column `count` records the number of students from each cohort that were included in the underlying data.\n", "\n", "\n", "
\n", "EXERCISE 2.2: Let's measure `access` over time (by cohort), averaging across all public institutions. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#plot `access` by cohort\n", "mi_mobility_panel_public = mi_mobility_panel[mi_mobility_panel['type']=='Public'].groupby('cohort').mean()\n", "mi_mobility_panel_public['cohort'] = mi_mobility_panel_public.index\n", "\n", "#When plotting we must first select the columns we want to plot\n", "...\n", "plt.title('Low-Income Percent of Enrollment in Michigan Public Institutions');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 2.3: Now, let's separate this figure by institution type.\n", "
\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mi_mobility_two_year = mi_mobility_panel[(mi_mobility_panel['type']=='Public') & \n", " (mi_mobility_panel['iclevel']=='Two-year')].groupby('cohort').mean()\n", "mi_mobility_two_year['cohort'] = mi_mobility_two_year.index\n", "\n", "mi_mobility_four_year = mi_mobility_panel[(mi_mobility_panel['type']=='Public') & \n", " (mi_mobility_panel['iclevel']=='Four-year')].groupby('cohort').mean()\n", "mi_mobility_four_year['cohort'] = mi_mobility_four_year.index\n", "\n", "...\n", "plt.title('Low-Income Percent of Enrollment in Michigan Public Institutions, by Type')\n", "plt.legend()\n", "plt.xlabel(\"cohort\")\n", "plt.ylabel(\"access mean\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A couple of key takeaways from this figure:\n", "* the *level* of `access` is significantly higher at two-year colleges.\n", "* both two-year and four-year colleges are seeing increases of `access`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 2.4: Finally, for comparison's sake, let’s check how low-income access is evolving at private non-profit 4-year institutions in Michigan. \n", "
\n", "\n", "Perhaps there is some substitution to these institutions, some of which have increased their financial aid offerings over time.\n", "\n", "For this exercise you will need to use the following columns:\n", "\n", "- **`iclevel`:** indicates whether an institution is a 4-year, 2-year, or less than 2-year college.\n", "\n", "- **`type`:** indicates whether an institution is a Public, Private Non-profit, or Private For-profit institution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#use similar code as above, except replace public two-year institutions with private four-year institutions\n", "mi_mobility_private = ...\n", "\n", "mi_mobility_public = ...\n", "\n", "\n", "plt.plot(mi_mobility_private['cohort'], mi_mobility_private['access'], label = 'Private')\n", "plt.plot(mi_mobility_public['cohort'], mi_mobility_public['access'], label = 'Public')\n", "plt.title('Low-Income Percent of Enrollment in Michigan Public Institutions, by Type')\n", "plt.legend()\n", "plt.xlabel(\"cohort\")\n", "plt.ylabel(\"access mean\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Creating a College Report Card \n", "\n", "\n", "\n", "The main output of the Chetty et al. (2017) project is a Mobility Report Card for each school included in their data. The Report Card shows the composition of an institution's students by parental income quintile, and success rates by parental income quintile. Report Cards for each institution can be found [here](https://sites.google.com/site/dannyyagan/college).\n", "\n", "Above, you can see the Report Card for Michigan Tech. The figure includes a bar chart for the distribution of students by parental income quintile, and a line plot the show success rates by parental income quintile. The figure is effective--it presents a lot of information without too much clutter.\n", "\n", "In this section we will create a function that generates a Report Card comparing two institutions.\n", "\n", "
\n", "EXERCISE 3.1: For this exercise, it will be easier to work with the first dataset in a different format. Again, we will restrict to public Michigan colleges and universities.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#read in data\n", "mobility_long = pd.read_csv(\"data/mrc_q3.csv\")\n", "\n", "#restrict to MI public institutions again\n", "mi_pub_mobility_long = mobility_long[(mobility_long['type']=='Public') & \n", " (mobility_long['state']=='MI')]\n", "\n", "mi_pub_mobility_long.head(10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that now there are *5* observations per institution. While each row represented an institution in the first table, in this table each row represents an institution by parental income quintile _pair_. The latter is denoted by the column `parq`. \n", "\n", "(What we have done is transformed the data from *wide* to *long* format. The details of this are beyond the scope of this workshop.)\n", "\n", "There are two other columns that require explanation:\n", "\n", "- **`percent`**: this is the percent of students at the institution with parental income in the quintile indicated by `parq`. Across the 5 rows for each institution, these values will sum to 100.\n", "\n", "- **`success_by_q`**: this is the 'success rate' for students from a particular institution and parental income quintile. In other words, it is the percentage of students that reach the top quintile of the children's income distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 3.2: First, create the bar chart portion of the Report Card for Michigan Tech.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create table with just MTU data\n", "mtu_mobility_long = mi_pub_mobility_long[(mi_pub_mobility_long['name']=='Michigan Technological University')]\n", "\n", "\n", "#create bar chart\n", "plt.bar(mtu_mobility_long['parq'], mtu_mobility_long['percent'], \n", " label = 'Michigan Technological University');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 3.3: Next, create the line plot portion. Specify which columns labels belong on the x and y axes, take those columns from the relevant table, and use .plot from `matplotlib` to create the scatter plot.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create line plot\n", "plt.plot(mtu_mobility_long['parq'], mtu_mobility_long['success_by_q'], marker='o')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the difference in vertical scales for the two figures.\n", "\n", "
\n", "EXERCISE 3.4: Let's put the last two pieces together in one figure as in the offical Report Cards.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# copy code from prior two code cells \n", "...\n", "\n", "# add annotations\n", "plt.xlabel('Parent Income Quintile')\n", "plt.ylabel('Percent of Students (%)')\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're almost there! We just need to combine the data from two institutions in one plot. The code below generates a Report Card that compares Michigan Tech and Michigan State. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create report card that compares two institutions\n", "bar_width = 0.3 # default: 0.8\n", "\n", "school1 = mi_pub_mobility_long[mi_pub_mobility_long['name']=='Michigan Technological University']\n", "school2 = mi_pub_mobility_long[mi_pub_mobility_long['name']=='Michigan State University']\n", "\n", "#create the bar charts\n", "plt.bar(school1['parq'] + bar_width/2 + .05, school1['percent'], bar_width, \n", " label = 'Michigan Technological University')\n", "plt.bar(school2['parq'] - bar_width/2 - .05, school2['percent'], bar_width, \n", " label = 'Michigan State University')\n", "\n", "#create the line plots\n", "plt.plot(school1['parq'], school1['success_by_q'], marker='o')\n", "plt.plot(school2['parq'], school2['success_by_q'], marker='o')\n", "\n", "plt.legend()\n", "plt.xlabel('Parent Income Quintile')\n", "plt.ylabel('Percent of Students (%)')\n", "plt.title('Student Success (line) and Parent Incomes (bar)')\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "EXERCISE 3.5: Create a function that takes two institution names as arguments and returns a Report Card that compares the two.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#turn into function\n", "#hint: you should first copy the code from the cell above and then make some minor changes so that\n", "#'Michigan Tech' and 'Michigan State' are replaced by the names for the function arguments.\n", "\n", "def report_card(a, b):\n", "\n", " # select schools\n", " ...\n", "\n", " #create the bar charts\n", " ...\n", "\n", " #create the line plots\n", " ...\n", "\n", " plt.legend()\n", " plt.xlabel('Parent Income Quintile')\n", " plt.ylabel('Percent of Students (%)')\n", " plt.title('Student Success (line) and Parent Incomes (bar)')\n", "\n", " return plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "EXERCISE 3.6: Generate a report card using two institutions of your choosing. Describe the comparison.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "report_card('Michigan Technological University', 'University Of Michigan - Ann Arbor')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IF you're interested in these data, you can play around with [this data exploration tool](https://www.nytimes.com/interactive/projects/college-mobility/) put together by the New York Times." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "exercise-visualization.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }