An Introduction

Interacting with data is something we do everyday, whether consciously or subconsciously.

If you are interested into going into any field of STEM - whether it be the medical field, data science, or computer science, you will often be working with large datasets with tons of useful AND useless information.

A skillset that is becoming increasingly important in these areas of work is the ability to effectively query and filter for data in large datasets and draw conclusions based on these filters.

Pandas

Pandas is a Python library that allows for the manipulation, querying, and filtering of data.

Over time, it has become one of the most popular Python libraries for data analysis.

Here is the documentation link: https://pandas.pydata.org/docs/

Our Data

Data is readable in many formats. As someone who is working with datasets, you should be able to recognize what formats are easiest to understand for you and for any program that you write. Here are two of the most common data formats:

JSON: This is a standard file format that is very easy for humans and computers to read and write. It is a compact way to store data.
CSV: These are comma-separated value files. This is where data has comma delimiters (separaters).

For the purpose of this notebook, we’ll use a JSON file containing data about the average income level in each of the 50 states in the USA (located in /assets/datasets/income.json)

We will look to first understand and interpret the data ourselves and use Pandas and Numpy to provide insightful statistical information about the dataset that we may not be as easy to find by ourselves.

import pandas as pd 

df = pd.read_json('files/income.json')

# This defines the dataframe in question. In this case, it is reading from the JSON file (hence read_json) income.json. 

# It is very important that the relative file path is correct, otherwise it won't be reading the file that you want it to.

display(df)

# This is now displaying the readable JSON data.

	State	MeanHouseholdIncome
0	Alabama	71964
1	Alaska	98811
2	Arizona	84380
3	Arkansas	69357
4	California	111622
5	Colorado	100933
6	Connecticut	115337
7	Delaware	92308
8	Florida	83104
9	Georgia	85961
10	Hawaii	107348
11	Idaho	77399
12	Illinois	95115
13	Indiana	76984
14	Iowa	80316
15	Kansas	82103
16	Kentucky	72318
17	Louisiana	73759
18	Maine	78301
19	Maryland	114236
20	Massachusetts	115964
21	Michigan	80803
22	Minnesota	96814
23	Mississippi	65156
24	Missouri	78194
25	Montana	76834
26	Nebraska	82306
27	Nevada	84350
28	New Hampshire	101292
29	New Jersey	117868
30	New Mexico	70241
31	New York	105304
32	North Carolina	79620
33	North Dakota	85506
34	Ohio	78796
35	Oklahoma	74195
36	Oregon	88137
37	Pennsylvania	87262
38	Rhode Island	92427
39	South Carolina	76390
40	South Dakota	77932
41	Tennessee	76937
42	Texas	89506
43	Utah	94452
44	Vermont	83767
45	Virginia	106023
46	Washington	103669
47	West Virginia	65332
48	Wisconsin	82757
49	Wyoming	83583

Dataset statistics

Let’s find and display some statistics from the dataset..

dfmean = df["MeanHouseholdIncome"].mean()

# Defines dfmean as using the "mean" operation (finds average) of the dataframe in question.

# A label is given to make sure that the user knows what is being computed.

print("Mean Household Income: $" + str(dfmean))

# The dfmean value is converted into a string format so that there is no space between the dollar sign and the mean value.

Mean Household Income: $87461.46

Dataframe sort, Household Income

In this example, analytical data is sorted.

df.sort_values(by="MeanHouseholdIncome")

# 50 states are being sorted based on the "MeanHouseholdIncome" column in ascending order

	State	MeanHouseholdIncome
23	Mississippi	65156
47	West Virginia	65332
3	Arkansas	69357
30	New Mexico	70241
0	Alabama	71964
16	Kentucky	72318
17	Louisiana	73759
35	Oklahoma	74195
39	South Carolina	76390
25	Montana	76834
41	Tennessee	76937
13	Indiana	76984
11	Idaho	77399
40	South Dakota	77932
24	Missouri	78194
18	Maine	78301
34	Ohio	78796
32	North Carolina	79620
14	Iowa	80316
21	Michigan	80803
15	Kansas	82103
26	Nebraska	82306
48	Wisconsin	82757
8	Florida	83104
49	Wyoming	83583
44	Vermont	83767
27	Nevada	84350
2	Arizona	84380
33	North Dakota	85506
9	Georgia	85961
37	Pennsylvania	87262
36	Oregon	88137
42	Texas	89506
7	Delaware	92308
38	Rhode Island	92427
43	Utah	94452
12	Illinois	95115
22	Minnesota	96814
1	Alaska	98811
5	Colorado	100933
28	New Hampshire	101292
46	Washington	103669
31	New York	105304
45	Virginia	106023
10	Hawaii	107348
4	California	111622
19	Maryland	114236
6	Connecticut	115337
20	Massachusetts	115964
29	New Jersey	117868

Dataframe sort, State

In this example, categorical can be sorted.

df.sort_values(by="State")

# The data is sorted alphabetically based on the "State" column.

	State	MeanHouseholdIncome
0	Alabama	71964
1	Alaska	98811
2	Arizona	84380
3	Arkansas	69357
4	California	111622
5	Colorado	100933
6	Connecticut	115337
7	Delaware	92308
8	Florida	83104
9	Georgia	85961
10	Hawaii	107348
11	Idaho	77399
12	Illinois	95115
13	Indiana	76984
14	Iowa	80316
15	Kansas	82103
16	Kentucky	72318
17	Louisiana	73759
18	Maine	78301
19	Maryland	114236
20	Massachusetts	115964
21	Michigan	80803
22	Minnesota	96814
23	Mississippi	65156
24	Missouri	78194
25	Montana	76834
26	Nebraska	82306
27	Nevada	84350
28	New Hampshire	101292
29	New Jersey	117868
30	New Mexico	70241
31	New York	105304
32	North Carolina	79620
33	North Dakota	85506
34	Ohio	78796
35	Oklahoma	74195
36	Oregon	88137
37	Pennsylvania	87262
38	Rhode Island	92427
39	South Carolina	76390
40	South Dakota	77932
41	Tennessee	76937
42	Texas	89506
43	Utah	94452
44	Vermont	83767
45	Virginia	106023
46	Washington	103669
47	West Virginia	65332
48	Wisconsin	82757
49	Wyoming	83583

Statistical summary

In this example, all the summary statistics generated using: df.describe.

print(df.describe())

       MeanHouseholdIncome
count            50.000000
mean          87461.460000
std           13945.982845
min           65156.000000
25%           77532.250000
50%           83675.000000
75%           96389.250000
max          117868.000000

Statistical Review

As seen in the above output, the dataframe is being described from the information for the column where applicable statistics are present. The “count” statistic for example, is the number of not-empty cells in the mean household income column. The mean is the average mean household income across all 50 states, and the standard deviation is how much the values within the mean household income column deviate from the mean.

It is important to note that in many more complex datasets, there will be multiple columns with explanatory data. In those cases, the df.describe() method will need to be specified based on a specific column.

Conclusion

What are the key takeaways from this lesson?

The purpose is to obtain a basic understanding of working with a dataset, using Pandas dataframes. To obtain a more comprehensive understanding of Pandas capabilities, research operations such as filtering data based on certain criteria, grouping data, or performing calculations on multiple columns. Additional work can be done with these Python modules (ie numpy, matplotlib).

Explain each example briefly and provide a real-world scenario where such an operation would be useful. Every dataset that you work with should have a purpose - that’s what the field of data science is all about.

For instance, in the Household income example, we analyzed a mean household income by state dataset. This could be applicable if someone wanted to find out where the most affordable place to live.

Find the minimum household income
Expand data to look at affordability of areas within state
Perhaps add other factors like employment in those areas

Additional Resources

Pandas Documentation
- This is an essential resource for learning about Pandas and its various functionalities. It provides detailed documentation, examples, and explanations of different methods and operations.
Data Science Applications
- This resource provides an overview of major applications of data science across various domains. It can help students understand the practical implications of data analysis and how it is used in different industries.
Kaggle Datasets
- Kaggle is a popular platform for data science and machine learning. It offers a wide range of datasets for practice and exploration. Students can find interesting datasets on different topics to apply their Pandas learning and gain hands-on experience.
NumPy Documentation
- NumPy is another important Python library often used in conjunction with Pandas for numerical operations and scientific computing. The official NumPy documentation provides in-depth explanations and examples of working with arrays, mathematical functions, and more.
Matplotlib Documentation
- Matplotlib is a powerful data visualization library in Python. It allows students to create a wide range of plots and charts to visualize their data. The Matplotlib documentation offers comprehensive guidance on creating different types of visualizations, customizing plots, and using various plotting functions. By referring to these resources, students can further expand their knowledge and explore advanced topics in Pandas, NumPy, and data visualization.

Hacks

Find a CSV/JSON Dataset that interests you. Refer to Kaggle Datasets mentioned above.
Try to show your Pandas learning by illustrating 5 different numerical analysis operations being done on the dataset. After showing each operation in a separate code block, add a sentence explaining what that operation is showing and what real-world implication it has. It is important to make sure that you are not only able to run code to analyze data, but also understand its implications.
EXTRA: Research Matplotlib Documentation mentioned and implement a code block where you create a graph showing and visualize relationship in your chosen dataset. Then, add a sentence or two explaining what the relationship shows.

Python Pandas • 26 min read

Description