An Introduction

Interacting with data is something we do everyday, whether consciously or subconsciously.

If you are interested into going into any field of STEM - whether it be the medical field, data science, or computer science, you will often be working with large datasets with tons of useful AND useless information.

A skillset that is becoming increasingly important in these areas of work is the ability to effectively query and filter for data in large datasets and draw conclusions based on these filters.

Pandas

Pandas is a Python library that allows for the manipulation, querying, and filtering of data.

Over time, it has become one of the most popular Python libraries for data analysis.

Here is the documentation link: https://pandas.pydata.org/docs/

Our Data

Data is readable in many formats. As someone who is working with datasets, you should be able to recognize what formats are easiest to understand for you and for any program that you write. Here are two of the most common data formats:

  1. JSON: This is a standard file format that is very easy for humans and computers to read and write. It is a compact way to store data.

  2. CSV: These are comma-separated value files. This is where data has comma delimiters (separaters).

For the purpose of this notebook, we’ll use a JSON file containing data about the average income level in each of the 50 states in the USA (located in /assets/datasets/income.json)

We will look to first understand and interpret the data ourselves and use Pandas and Numpy to provide insightful statistical information about the dataset that we may not be as easy to find by ourselves.

import pandas as pd 

df = pd.read_json('files/income.json')

# This defines the dataframe in question. In this case, it is reading from the JSON file (hence read_json) income.json. 

# It is very important that the relative file path is correct, otherwise it won't be reading the file that you want it to.

display(df)

# This is now displaying the readable JSON data.


State MeanHouseholdIncome
0 Alabama 71964
1 Alaska 98811
2 Arizona 84380
3 Arkansas 69357
4 California 111622
5 Colorado 100933
6 Connecticut 115337
7 Delaware 92308
8 Florida 83104
9 Georgia 85961
10 Hawaii 107348
11 Idaho 77399
12 Illinois 95115
13 Indiana 76984
14 Iowa 80316
15 Kansas 82103
16 Kentucky 72318
17 Louisiana 73759
18 Maine 78301
19 Maryland 114236
20 Massachusetts 115964
21 Michigan 80803
22 Minnesota 96814
23 Mississippi 65156
24 Missouri 78194
25 Montana 76834
26 Nebraska 82306
27 Nevada 84350
28 New Hampshire 101292
29 New Jersey 117868
30 New Mexico 70241
31 New York 105304
32 North Carolina 79620
33 North Dakota 85506
34 Ohio 78796
35 Oklahoma 74195
36 Oregon 88137
37 Pennsylvania 87262
38 Rhode Island 92427
39 South Carolina 76390
40 South Dakota 77932
41 Tennessee 76937
42 Texas 89506
43 Utah 94452
44 Vermont 83767
45 Virginia 106023
46 Washington 103669
47 West Virginia 65332
48 Wisconsin 82757
49 Wyoming 83583

Dataset statistics

Let’s find and display some statistics from the dataset..

dfmean = df["MeanHouseholdIncome"].mean()

# Defines dfmean as using the "mean" operation (finds average) of the dataframe in question.

# A label is given to make sure that the user knows what is being computed.

print("Mean Household Income: $" + str(dfmean))

# The dfmean value is converted into a string format so that there is no space between the dollar sign and the mean value.

Mean Household Income: $87461.46

Dataframe sort, Household Income

In this example, analytical data is sorted.

df.sort_values(by="MeanHouseholdIncome")

# 50 states are being sorted based on the "MeanHouseholdIncome" column in ascending order
State MeanHouseholdIncome
23 Mississippi 65156
47 West Virginia 65332
3 Arkansas 69357
30 New Mexico 70241
0 Alabama 71964
16 Kentucky 72318
17 Louisiana 73759
35 Oklahoma 74195
39 South Carolina 76390
25 Montana 76834
41 Tennessee 76937
13 Indiana 76984
11 Idaho 77399
40 South Dakota 77932
24 Missouri 78194
18 Maine 78301
34 Ohio 78796
32 North Carolina 79620
14 Iowa 80316
21 Michigan 80803
15 Kansas 82103
26 Nebraska 82306
48 Wisconsin 82757
8 Florida 83104
49 Wyoming 83583
44 Vermont 83767
27 Nevada 84350
2 Arizona 84380
33 North Dakota 85506
9 Georgia 85961
37 Pennsylvania 87262
36 Oregon 88137
42 Texas 89506
7 Delaware 92308
38 Rhode Island 92427
43 Utah 94452
12 Illinois 95115
22 Minnesota 96814
1 Alaska 98811
5 Colorado 100933
28 New Hampshire 101292
46 Washington 103669
31 New York 105304
45 Virginia 106023
10 Hawaii 107348
4 California 111622
19 Maryland 114236
6 Connecticut 115337
20 Massachusetts 115964
29 New Jersey 117868

Dataframe sort, State

In this example, categorical can be sorted.

df.sort_values(by="State")

# The data is sorted alphabetically based on the "State" column.

State MeanHouseholdIncome
0 Alabama 71964
1 Alaska 98811
2 Arizona 84380
3 Arkansas 69357
4 California 111622
5 Colorado 100933
6 Connecticut 115337
7 Delaware 92308
8 Florida 83104
9 Georgia 85961
10 Hawaii 107348
11 Idaho 77399
12 Illinois 95115
13 Indiana 76984
14 Iowa 80316
15 Kansas 82103
16 Kentucky 72318
17 Louisiana 73759
18 Maine 78301
19 Maryland 114236
20 Massachusetts 115964
21 Michigan 80803
22 Minnesota 96814
23 Mississippi 65156
24 Missouri 78194
25 Montana 76834
26 Nebraska 82306
27 Nevada 84350
28 New Hampshire 101292
29 New Jersey 117868
30 New Mexico 70241
31 New York 105304
32 North Carolina 79620
33 North Dakota 85506
34 Ohio 78796
35 Oklahoma 74195
36 Oregon 88137
37 Pennsylvania 87262
38 Rhode Island 92427
39 South Carolina 76390
40 South Dakota 77932
41 Tennessee 76937
42 Texas 89506
43 Utah 94452
44 Vermont 83767
45 Virginia 106023
46 Washington 103669
47 West Virginia 65332
48 Wisconsin 82757
49 Wyoming 83583

Statistical summary

In this example, all the summary statistics generated using: df.describe.

print(df.describe())
       MeanHouseholdIncome
count            50.000000
mean          87461.460000
std           13945.982845
min           65156.000000
25%           77532.250000
50%           83675.000000
75%           96389.250000
max          117868.000000

Statistical Review

As seen in the above output, the dataframe is being described from the information for the column where applicable statistics are present. The “count” statistic for example, is the number of not-empty cells in the mean household income column. The mean is the average mean household income across all 50 states, and the standard deviation is how much the values within the mean household income column deviate from the mean.

It is important to note that in many more complex datasets, there will be multiple columns with explanatory data. In those cases, the df.describe() method will need to be specified based on a specific column.

Conclusion

What are the key takeaways from this lesson?

The purpose is to obtain a basic understanding of working with a dataset, using Pandas dataframes. To obtain a more comprehensive understanding of Pandas capabilities, research operations such as filtering data based on certain criteria, grouping data, or performing calculations on multiple columns. Additional work can be done with these Python modules (ie numpy, matplotlib).

Explain each example briefly and provide a real-world scenario where such an operation would be useful. Every dataset that you work with should have a purpose - that’s what the field of data science is all about.

For instance, in the Household income example, we analyzed a mean household income by state dataset. This could be applicable if someone wanted to find out where the most affordable place to live.

  • Find the minimum household income
  • Expand data to look at affordability of areas within state
  • Perhaps add other factors like employment in those areas

Additional Resources

  1. Pandas Documentation
    • This is an essential resource for learning about Pandas and its various functionalities. It provides detailed documentation, examples, and explanations of different methods and operations.
  2. Data Science Applications
    • This resource provides an overview of major applications of data science across various domains. It can help students understand the practical implications of data analysis and how it is used in different industries.
  3. Kaggle Datasets
    • Kaggle is a popular platform for data science and machine learning. It offers a wide range of datasets for practice and exploration. Students can find interesting datasets on different topics to apply their Pandas learning and gain hands-on experience.
  4. NumPy Documentation
    • NumPy is another important Python library often used in conjunction with Pandas for numerical operations and scientific computing. The official NumPy documentation provides in-depth explanations and examples of working with arrays, mathematical functions, and more.
  5. Matplotlib Documentation
    • Matplotlib is a powerful data visualization library in Python. It allows students to create a wide range of plots and charts to visualize their data. The Matplotlib documentation offers comprehensive guidance on creating different types of visualizations, customizing plots, and using various plotting functions. By referring to these resources, students can further expand their knowledge and explore advanced topics in Pandas, NumPy, and data visualization.

Hacks

  1. Find a CSV/JSON Dataset that interests you. Refer to Kaggle Datasets mentioned above.

  2. Try to show your Pandas learning by illustrating 5 different numerical analysis operations being done on the dataset. After showing each operation in a separate code block, add a sentence explaining what that operation is showing and what real-world implication it has. It is important to make sure that you are not only able to run code to analyze data, but also understand its implications.

  3. EXTRA: Research Matplotlib Documentation mentioned and implement a code block where you create a graph showing and visualize relationship in your chosen dataset. Then, add a sentence or two explaining what the relationship shows.