An Introduction

Interacting with data is something we do everyday, whether consciously or subconsciously.

If you are interested into going into any field of STEM - whether it be the medical field, data science, or computer science, you will often be working with large datasets with tons of useful AND useless information.

A skillset that is becoming increasingly important in these areas of work is the ability to effectively query and filter for data in large datasets and draw conclusions based on these filters.

Pandas

Pandas is a Python library that allows for the manipulation, querying, and filtering of data.

Over time, it has become one of the most popular Python libraries for data analysis.

Here is the documentation link: https://pandas.pydata.org/docs/

Our Data

Data is readable in many formats. As someone who is working with datasets, you should be able to recognize what formats are easiest to understand for you and for any program that you write. Here are two of the most common data formats:

  1. JSON: This is a standard file format that is very easy for humans and computers to read and write. It is a compact way to store data.

  2. CSV: These are comma-separated value files. This is where data has comma delimiters (separaters).

For the purpose of this notebook, we’ll use a JSON file containing data about the average income level in each of the 50 states in the USA (located in /assets/datasets/income.json)

We will look to first understand and interpret the data ourselves and use Pandas and Numpy to provide insightful statistical information about the dataset that we may not be as easy to find by ourselves.

# uncomment the following line to install the pandas library
# !pip install pandas 

import pandas as pd 

# Use wget to obtain file, be sure to to place in _notebooks/files 
# wget https://raw.githubusercontent.com/nighthawkcoders/teacher_portfolio/main/_notebooks/files/income.json
df = pd.read_json('data/income.json')

# Add a new column that is the mean household income in dollars
df["MeanHouseholdIncomeDollars"] = df["MeanHouseholdIncome"].apply(lambda n: "${:,.2f}".format(n))

# display is alternative to print
# ... it is a part of Jupyter Notebooks
# ... it shows the dataframe as a nice table
display(df)

State MeanHouseholdIncome MeanHouseholdIncomeDollars
0 Alabama 71964 $71,964.00
1 Alaska 98811 $98,811.00
2 Arizona 84380 $84,380.00
3 Arkansas 69357 $69,357.00
4 California 111622 $111,622.00
5 Colorado 100933 $100,933.00
6 Connecticut 115337 $115,337.00
7 Delaware 92308 $92,308.00
8 Florida 83104 $83,104.00
9 Georgia 85961 $85,961.00
10 Hawaii 107348 $107,348.00
11 Idaho 77399 $77,399.00
12 Illinois 95115 $95,115.00
13 Indiana 76984 $76,984.00
14 Iowa 80316 $80,316.00
15 Kansas 82103 $82,103.00
16 Kentucky 72318 $72,318.00
17 Louisiana 73759 $73,759.00
18 Maine 78301 $78,301.00
19 Maryland 114236 $114,236.00
20 Massachusetts 115964 $115,964.00
21 Michigan 80803 $80,803.00
22 Minnesota 96814 $96,814.00
23 Mississippi 65156 $65,156.00
24 Missouri 78194 $78,194.00
25 Montana 76834 $76,834.00
26 Nebraska 82306 $82,306.00
27 Nevada 84350 $84,350.00
28 New Hampshire 101292 $101,292.00
29 New Jersey 117868 $117,868.00
30 New Mexico 70241 $70,241.00
31 New York 105304 $105,304.00
32 North Carolina 79620 $79,620.00
33 North Dakota 85506 $85,506.00
34 Ohio 78796 $78,796.00
35 Oklahoma 74195 $74,195.00
36 Oregon 88137 $88,137.00
37 Pennsylvania 87262 $87,262.00
38 Rhode Island 92427 $92,427.00
39 South Carolina 76390 $76,390.00
40 South Dakota 77932 $77,932.00
41 Tennessee 76937 $76,937.00
42 Texas 89506 $89,506.00
43 Utah 94452 $94,452.00
44 Vermont 83767 $83,767.00
45 Virginia 106023 $106,023.00
46 Washington 103669 $103,669.00
47 West Virginia 65332 $65,332.00
48 Wisconsin 82757 $82,757.00
49 Wyoming 83583 $83,583.00

Dataset statistics

Let’s find and display some statistics from the dataset.. The dataframe is being described from the information for the column where applicable statistics are present.

  • The “count” statistic for example, is the number of not-empty cells in the mean household income column.
  • The mean is the average mean household income across all 50 states
  • The standard deviation is how much the values within the mean household income column deviate from the mean.

It is important to note that in many more complex datasets, there will be multiple columns with explanatory data. In those cases, the df.describe() method will need to be specified based on a specific column.

# Use built in stats
print("Statistical Summary of the Data")
display(df.describe())

# Define a specific value from the dataframe
# ... using the "mean" method (finds average) from the dataframen
# ... in the [] brackets is a dataframe label describing what is to be computed
dfmean = df["MeanHouseholdIncome"].mean()

# Format the mean household income
# ... allows it to be formatted into a message
# ... so that there is a dollar sign and commas in the number
print("USA Mean Household Income: ${:,.2f}".format(dfmean))
Statistical Summary of the Data
MeanHouseholdIncome
count 50.000000
mean 87461.460000
std 13945.982845
min 65156.000000
25% 77532.250000
50% 83675.000000
75% 96389.250000
max 117868.000000
USA Mean Household Income: $87,461.46

Dataframe sort, Household Income

In this example, analytical data is sorted.

# Format the "MeanHouseholdIncome" column with commas as thousands separators and a dollar sign

# Sort the DataFrame by the "MeanHouseholdIncome" column
# ... in ascending order
# ... purpose is to find best paying states
# ... note, calculation are on unformatted numbers
df = df.sort_values(by="MeanHouseholdIncome", ascending=False)


# Print the DataFrame
# ... the display only the nicely formatted numbers
display(df[df.columns.difference(['MeanHouseholdIncome'])]) 
MeanHouseholdIncomeDollars State
29 $117,868.00 New Jersey
20 $115,964.00 Massachusetts
6 $115,337.00 Connecticut
19 $114,236.00 Maryland
4 $111,622.00 California
10 $107,348.00 Hawaii
45 $106,023.00 Virginia
31 $105,304.00 New York
46 $103,669.00 Washington
28 $101,292.00 New Hampshire
5 $100,933.00 Colorado
1 $98,811.00 Alaska
22 $96,814.00 Minnesota
12 $95,115.00 Illinois
43 $94,452.00 Utah
38 $92,427.00 Rhode Island
7 $92,308.00 Delaware
42 $89,506.00 Texas
36 $88,137.00 Oregon
37 $87,262.00 Pennsylvania
9 $85,961.00 Georgia
33 $85,506.00 North Dakota
2 $84,380.00 Arizona
27 $84,350.00 Nevada
44 $83,767.00 Vermont
49 $83,583.00 Wyoming
8 $83,104.00 Florida
48 $82,757.00 Wisconsin
26 $82,306.00 Nebraska
15 $82,103.00 Kansas
21 $80,803.00 Michigan
14 $80,316.00 Iowa
32 $79,620.00 North Carolina
34 $78,796.00 Ohio
18 $78,301.00 Maine
24 $78,194.00 Missouri
40 $77,932.00 South Dakota
11 $77,399.00 Idaho
13 $76,984.00 Indiana
41 $76,937.00 Tennessee
25 $76,834.00 Montana
39 $76,390.00 South Carolina
35 $74,195.00 Oklahoma
17 $73,759.00 Louisiana
16 $72,318.00 Kentucky
0 $71,964.00 Alabama
30 $70,241.00 New Mexico
3 $69,357.00 Arkansas
47 $65,332.00 West Virginia
23 $65,156.00 Mississippi

Dataframe sort, State

In this example, categorical can be sorted.

df.sort_values(by="State")

# The data is sorted alphabetically based on the "State" column.

State MeanHouseholdIncome MeanHouseholdIncomeDollars
0 Alabama 71964 $71,964.00
1 Alaska 98811 $98,811.00
2 Arizona 84380 $84,380.00
3 Arkansas 69357 $69,357.00
4 California 111622 $111,622.00
5 Colorado 100933 $100,933.00
6 Connecticut 115337 $115,337.00
7 Delaware 92308 $92,308.00
8 Florida 83104 $83,104.00
9 Georgia 85961 $85,961.00
10 Hawaii 107348 $107,348.00
11 Idaho 77399 $77,399.00
12 Illinois 95115 $95,115.00
13 Indiana 76984 $76,984.00
14 Iowa 80316 $80,316.00
15 Kansas 82103 $82,103.00
16 Kentucky 72318 $72,318.00
17 Louisiana 73759 $73,759.00
18 Maine 78301 $78,301.00
19 Maryland 114236 $114,236.00
20 Massachusetts 115964 $115,964.00
21 Michigan 80803 $80,803.00
22 Minnesota 96814 $96,814.00
23 Mississippi 65156 $65,156.00
24 Missouri 78194 $78,194.00
25 Montana 76834 $76,834.00
26 Nebraska 82306 $82,306.00
27 Nevada 84350 $84,350.00
28 New Hampshire 101292 $101,292.00
29 New Jersey 117868 $117,868.00
30 New Mexico 70241 $70,241.00
31 New York 105304 $105,304.00
32 North Carolina 79620 $79,620.00
33 North Dakota 85506 $85,506.00
34 Ohio 78796 $78,796.00
35 Oklahoma 74195 $74,195.00
36 Oregon 88137 $88,137.00
37 Pennsylvania 87262 $87,262.00
38 Rhode Island 92427 $92,427.00
39 South Carolina 76390 $76,390.00
40 South Dakota 77932 $77,932.00
41 Tennessee 76937 $76,937.00
42 Texas 89506 $89,506.00
43 Utah 94452 $94,452.00
44 Vermont 83767 $83,767.00
45 Virginia 106023 $106,023.00
46 Washington 103669 $103,669.00
47 West Virginia 65332 $65,332.00
48 Wisconsin 82757 $82,757.00
49 Wyoming 83583 $83,583.00

Conclusion

What are the key takeaways from this lesson?

The purpose is to obtain a basic understanding of working with a dataset, using Pandas dataframes. To obtain a more comprehensive understanding of Pandas capabilities, research operations such as filtering data based on certain criteria, grouping data, or performing calculations on multiple columns. Additional work can be done with these Python modules (ie numpy, matplotlib).

Explain each example briefly and provide a real-world scenario where such an operation would be useful. Every dataset that you work with should have a purpose - that’s what the field of data science is all about.

For instance, in the Household income example, we analyzed a mean household income by state dataset. This could be applicable if someone wanted to find out where the most affordable place to live.

  • Find the minimum household income
  • Expand data to look at affordability of areas within state
  • Perhaps add other factors like employment in those areas

Additional Resources

  1. Pandas Documentation
    • This is an essential resource for learning about Pandas and its various functionalities. It provides detailed documentation, examples, and explanations of different methods and operations.
  2. Data Science Applications
    • This resource provides an overview of major applications of data science across various domains. It can help students understand the practical implications of data analysis and how it is used in different industries.
  3. Kaggle Datasets
    • Kaggle is a popular platform for data science and machine learning. It offers a wide range of datasets for practice and exploration. Students can find interesting datasets on different topics to apply their Pandas learning and gain hands-on experience.
  4. NumPy Documentation
    • NumPy is another important Python library often used in conjunction with Pandas for numerical operations and scientific computing. The official NumPy documentation provides in-depth explanations and examples of working with arrays, mathematical functions, and more.
  5. Matplotlib Documentation
    • Matplotlib is a powerful data visualization library in Python. It allows students to create a wide range of plots and charts to visualize their data. The Matplotlib documentation offers comprehensive guidance on creating different types of visualizations, customizing plots, and using various plotting functions. By referring to these resources, students can further expand their knowledge and explore advanced topics in Pandas, NumPy, and data visualization.

Hacks

  1. Find a CSV/JSON Dataset that interests you. Refer to Kaggle Datasets mentioned above.

  2. Try to show your Pandas learning by illustrating 5 different numerical analysis operations being done on the dataset. After showing each operation in a separate code block, add a sentence explaining what that operation is showing and what real-world implication it has. It is important to make sure that you are not only able to run code to analyze data, but also understand its implications.

  3. EXTRA: Research Matplotlib Documentation mentioned and implement a code block where you create a graph showing and visualize relationship in your chosen dataset. Then, add a sentence or two explaining what the relationship shows.