Women in Computer Science

Data Science by Laura Brown

June 29, 2021

During this session you will learn about Data Science. This topic area is a growing field in that combines computational and inferential thinking (computer science + statistics).

Activities

  1. A overview of Data Science Slides

    • what is data science?
    • why is it important?
    • examples and applications
  2. First Steps in Data Science

    • introduction to Jupyter notebooks
    • basics of Python in Jupyter
    • basics of data science in Jupyter
  3. Exercise: Explore Colleges

    • show how we explore a data set to better understand differences in colleges

Introduction to Data Science

Materials adapted from How to Think Like a Data Scientist

This introduction will cover the definition of Data Science. It will explore the history and current state of the discipline, explaining how data science began and where it will be going in the future. We will also explore how data science leverages data analysis and data visualization.

Throughout this process, you will also be learning Python to perform the analyses described.

What is Data Science?

In 2016 a study reported that 90% of the data in the world today has been created in the last two years alone. This is the result of the continuing acceleration of the rate at which we store data. Some estimates indicate that roughly 2.5 quintillion bytes of data are generated per day; that's 2,500,000,000,000,000,000 bytes! By comparison, all the data in the Library of Congress adds up to about 200 TB, merely 200,000,000,000,000 bytes. This means that we are capturing 12,500 libraries of congress per day!

The amount of data that Google alone stores in its servers is estimated to be 15 exabytes (15 followed by 18 zeros!). For those of you who remember punch cards, you can visualize 15 exabytes as a pile of cards three miles high, covering all of New England. Everywhere you go, someone or something is collecting data about you: what you buy, what you read, where you eat, where you stay, how and when you travel, and so much more. By 2025, it is estimated that 463 exabytes of data will be created each day globally, and the entire digital universe is expected to reach 44 zettabytes by 2020. This would mean there would be 40 times more bytes than there are stars in the observable universe.

What does it all mean?

Often, this data is collected and stored with little idea about how to use it, because technology makes it so easy to capture. Other times, the data is collected quite intentionally. The big question is: what does it all mean? That's where data science comes in. Data science is an emerging and interdisciplinary field that brings together ideas that have been around for years, or even centuries. Most people define data science as "an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms".

Data science has spawned many new jobs in which people and computers extract valuable insights from this data. These range from the simple scaling of functions that existed previously, to completely new jobs processing data that was never previously captured. For example, the owner of a general store 100 years ago kept a log, both on paper and in their head, of the items their customers purchased and how those items varied with the seasons. Based on this knowledge, they would decide how many of each product to order to meet their customers' needs, while keeping their stock to a minimum. With data science, this job can be done on the scale of thousands of supermarkets spread across the country and can factor in a myriad of signals that would have been too hard for the store owner to track, such as unemployment, inflation, or even weather forecasts.

At the other end of the spectrum, we are now able to track the pressure applied to various points on the sole of an athletic shoe with a precision that was impossible just a few years ago. This allows manufacturers to design more efficient and comfortable footwear understanding this data.

What does a Data Scientist do?

Here is a video talking about what a data scientist job is like?

Data Science as an Interdisciplinary Field

As an interdisciplinary field of inquiry, data science combines statistics, computer science, writing, art, and ethics, data science has application across the entire curriculum: biology, economics, management, English, history, music, pretty much everything.

The best data scientists have one thing in common: unbelievable curiosity. - D.J. Patil Chief Data Scientist of the United States from 2015 to 2017.

The diagram below is widely used to answer the question "What is Data Science?" Some computer science, some statistics, and something from one of the many majors available at a college, all of which are looking for people with data skills!

Venn Diagram depicting the different components of Data Science: Hacking Skills, Substantive Expertise, and Math and Statistics Knowledge
Venn Diagram |CCBYANC| Drew Conway

According to Eric Haller, Executive Vice President & Global Head, Experian DataLabs (a global information services company), when interviewed by the Chicago Tribune:

A data scientist is an explorer, scientist, and analyst all combined into one role. They have the curiosity and passion of an explorer for jumping into new problems, new datasets, and new technologies. They love going where no person has gone before in taking on a new approach to taking on age old challenges or coming up with an approach for a very new problem where nobody has tried to solve it in the past.

They can write their own code and develop their own algorithms. They can keep up with the scientific breakthrough of the day and regularly apply them to their own work. And as an analyst, they have a penchant for detail, continually diving deeper to find answers. Finding treasure in the data, analysis, and the details give them an adrenaline rush.

Our data scientists tend to operate with a noble purpose of trying to do good things for people, businesses and society with data.

However, all of this exploration and analysis means nothing if you cannot communicate it to people. In a Harvard Business Review article by Jeff Bladt and Bob Filbin entitled: A Data Scientist's Real Job: Storytelling, they elaborate:

Using Big Data successfully requires human translation and context whether it's for your staff or the people your organization is trying to reach. Without a human frame, like photos or words that make emotion salient, data will only confuse, and certainly won't lead to smart organizational behavior.
- Harvard Business Review

Stories are great, but in data science, you need to make sure they are true, especially when you are dealing with stories about numbers. In an article entitled The Ethical Data Scientist, the sub-title really tells the story: People have too much trust in numbers to be intrinsically objective. The better known phrase is that "Statistics don’t lie, but statisticians sometimes do." The challenge for the data scientist is to avoid the trap of choosing the statistics that only tell the story they want to tell.

The ethical data scientist would strive to improve the world, not repeat it. That would mean deploying tools to explicitly construct fair processes. As long as our world is not perfect, and as long as data is being collected on that world, we will not be building models that are improvements on our past unless we specifically set out to do so.

Data Scientist Skills / Tools

One way I like to look at Data Science is using a collection of tools to answer questions. Below you can see a simple visualization showing the spectrum of tools or skills a Data Scientist may need to employ.

The Data Science Spectrum of Skills

This can be examined further to look at detailed skills in each area.

The Data Science Spectrum of Skills Detailed