An Introduction
Interacting with data is something we do everyday, whether consciously or subconsciously.
If you are interested into going into any field of STEM - whether it be the medical field, data science, or computer science, you will often be working with large datasets with tons of useful AND useless information.
A skillset that is becoming increasingly important in these areas of work is the ability to effectively query and filter for data in large datasets and draw conclusions based on these filters.
Pandas
Pandas is a Python library that allows for the manipulation, querying, and filtering of data.
Over time, it has become one of the most popular Python libraries for data analysis.
Here is the documentation link: https://pandas.pydata.org/docs/
Our Data
Data is readable in many formats. As someone who is working with datasets, you should be able to recognize what formats are easiest to understand for you and for any program that you write. Here are two of the most common data formats:
-
JSON: This is a standard file format that is very easy for humans and computers to read and write. It is a compact way to store data.
-
CSV: These are comma-separated value files. This is where data has comma delimiters (separaters).
For the purpose of this notebook, we’ll use a JSON file containing data about the average income level in each of the 50 states in the USA (located in /assets/datasets/income.json)
We will look to first understand and interpret the data ourselves and use Pandas and Numpy to provide insightful statistical information about the dataset that we may not be as easy to find by ourselves.
import pandas as pd
df = pd.read_json('files/income.json')
# This defines the dataframe in question. In this case, it is reading from the JSON file (hence read_json) income.json.
# It is very important that the relative file path is correct, otherwise it won't be reading the file that you want it to.
display(df)
# This is now displaying the readable JSON data.
State | MeanHouseholdIncome | |
---|---|---|
0 | Alabama | 71964 |
1 | Alaska | 98811 |
2 | Arizona | 84380 |
3 | Arkansas | 69357 |
4 | California | 111622 |
5 | Colorado | 100933 |
6 | Connecticut | 115337 |
7 | Delaware | 92308 |
8 | Florida | 83104 |
9 | Georgia | 85961 |
10 | Hawaii | 107348 |
11 | Idaho | 77399 |
12 | Illinois | 95115 |
13 | Indiana | 76984 |
14 | Iowa | 80316 |
15 | Kansas | 82103 |
16 | Kentucky | 72318 |
17 | Louisiana | 73759 |
18 | Maine | 78301 |
19 | Maryland | 114236 |
20 | Massachusetts | 115964 |
21 | Michigan | 80803 |
22 | Minnesota | 96814 |
23 | Mississippi | 65156 |
24 | Missouri | 78194 |
25 | Montana | 76834 |
26 | Nebraska | 82306 |
27 | Nevada | 84350 |
28 | New Hampshire | 101292 |
29 | New Jersey | 117868 |
30 | New Mexico | 70241 |
31 | New York | 105304 |
32 | North Carolina | 79620 |
33 | North Dakota | 85506 |
34 | Ohio | 78796 |
35 | Oklahoma | 74195 |
36 | Oregon | 88137 |
37 | Pennsylvania | 87262 |
38 | Rhode Island | 92427 |
39 | South Carolina | 76390 |
40 | South Dakota | 77932 |
41 | Tennessee | 76937 |
42 | Texas | 89506 |
43 | Utah | 94452 |
44 | Vermont | 83767 |
45 | Virginia | 106023 |
46 | Washington | 103669 |
47 | West Virginia | 65332 |
48 | Wisconsin | 82757 |
49 | Wyoming | 83583 |
Dataset statistics
Let’s find and display some statistics from the dataset..
dfmean = df["MeanHouseholdIncome"].mean()
# Defines dfmean as using the "mean" operation (finds average) of the dataframe in question.
# A label is given to make sure that the user knows what is being computed.
print("Mean Household Income: $" + str(dfmean))
# The dfmean value is converted into a string format so that there is no space between the dollar sign and the mean value.
Mean Household Income: $87461.46
Dataframe sort, Household Income
In this example, analytical data is sorted.
df.sort_values(by="MeanHouseholdIncome")
# 50 states are being sorted based on the "MeanHouseholdIncome" column in ascending order
State | MeanHouseholdIncome | |
---|---|---|
23 | Mississippi | 65156 |
47 | West Virginia | 65332 |
3 | Arkansas | 69357 |
30 | New Mexico | 70241 |
0 | Alabama | 71964 |
16 | Kentucky | 72318 |
17 | Louisiana | 73759 |
35 | Oklahoma | 74195 |
39 | South Carolina | 76390 |
25 | Montana | 76834 |
41 | Tennessee | 76937 |
13 | Indiana | 76984 |
11 | Idaho | 77399 |
40 | South Dakota | 77932 |
24 | Missouri | 78194 |
18 | Maine | 78301 |
34 | Ohio | 78796 |
32 | North Carolina | 79620 |
14 | Iowa | 80316 |
21 | Michigan | 80803 |
15 | Kansas | 82103 |
26 | Nebraska | 82306 |
48 | Wisconsin | 82757 |
8 | Florida | 83104 |
49 | Wyoming | 83583 |
44 | Vermont | 83767 |
27 | Nevada | 84350 |
2 | Arizona | 84380 |
33 | North Dakota | 85506 |
9 | Georgia | 85961 |
37 | Pennsylvania | 87262 |
36 | Oregon | 88137 |
42 | Texas | 89506 |
7 | Delaware | 92308 |
38 | Rhode Island | 92427 |
43 | Utah | 94452 |
12 | Illinois | 95115 |
22 | Minnesota | 96814 |
1 | Alaska | 98811 |
5 | Colorado | 100933 |
28 | New Hampshire | 101292 |
46 | Washington | 103669 |
31 | New York | 105304 |
45 | Virginia | 106023 |
10 | Hawaii | 107348 |
4 | California | 111622 |
19 | Maryland | 114236 |
6 | Connecticut | 115337 |
20 | Massachusetts | 115964 |
29 | New Jersey | 117868 |
Dataframe sort, State
In this example, categorical can be sorted.
df.sort_values(by="State")
# The data is sorted alphabetically based on the "State" column.
State | MeanHouseholdIncome | |
---|---|---|
0 | Alabama | 71964 |
1 | Alaska | 98811 |
2 | Arizona | 84380 |
3 | Arkansas | 69357 |
4 | California | 111622 |
5 | Colorado | 100933 |
6 | Connecticut | 115337 |
7 | Delaware | 92308 |
8 | Florida | 83104 |
9 | Georgia | 85961 |
10 | Hawaii | 107348 |
11 | Idaho | 77399 |
12 | Illinois | 95115 |
13 | Indiana | 76984 |
14 | Iowa | 80316 |
15 | Kansas | 82103 |
16 | Kentucky | 72318 |
17 | Louisiana | 73759 |
18 | Maine | 78301 |
19 | Maryland | 114236 |
20 | Massachusetts | 115964 |
21 | Michigan | 80803 |
22 | Minnesota | 96814 |
23 | Mississippi | 65156 |
24 | Missouri | 78194 |
25 | Montana | 76834 |
26 | Nebraska | 82306 |
27 | Nevada | 84350 |
28 | New Hampshire | 101292 |
29 | New Jersey | 117868 |
30 | New Mexico | 70241 |
31 | New York | 105304 |
32 | North Carolina | 79620 |
33 | North Dakota | 85506 |
34 | Ohio | 78796 |
35 | Oklahoma | 74195 |
36 | Oregon | 88137 |
37 | Pennsylvania | 87262 |
38 | Rhode Island | 92427 |
39 | South Carolina | 76390 |
40 | South Dakota | 77932 |
41 | Tennessee | 76937 |
42 | Texas | 89506 |
43 | Utah | 94452 |
44 | Vermont | 83767 |
45 | Virginia | 106023 |
46 | Washington | 103669 |
47 | West Virginia | 65332 |
48 | Wisconsin | 82757 |
49 | Wyoming | 83583 |
Statistical summary
In this example, all the summary statistics generated using:
df.describe
.
print(df.describe())
MeanHouseholdIncome
count 50.000000
mean 87461.460000
std 13945.982845
min 65156.000000
25% 77532.250000
50% 83675.000000
75% 96389.250000
max 117868.000000
Statistical Review
As seen in the above output, the dataframe is being described from the information for the column where applicable statistics are present. The “count” statistic for example, is the number of not-empty cells in the mean household income column. The mean is the average mean household income across all 50 states, and the standard deviation is how much the values within the mean household income column deviate from the mean.
It is important to note that in many more complex datasets, there will be multiple columns with explanatory data. In those cases, the df.describe() method will need to be specified based on a specific column.
Conclusion
What are the key takeaways from this lesson?
The purpose is to obtain a basic understanding of working with a dataset, using Pandas dataframes. To obtain a more comprehensive understanding of Pandas capabilities, research operations such as filtering data based on certain criteria, grouping data, or performing calculations on multiple columns. Additional work can be done with these Python modules (ie numpy, matplotlib).
Explain each example briefly and provide a real-world scenario where such an operation would be useful. Every dataset that you work with should have a purpose - that’s what the field of data science is all about.
For instance, in the Household income example, we analyzed a mean household income by state dataset. This could be applicable if someone wanted to find out where the most affordable place to live.
- Find the minimum household income
- Expand data to look at affordability of areas within state
- Perhaps add other factors like employment in those areas
Additional Resources
- Pandas Documentation
- This is an essential resource for learning about Pandas and its various functionalities. It provides detailed documentation, examples, and explanations of different methods and operations.
- Data Science Applications
- This resource provides an overview of major applications of data science across various domains. It can help students understand the practical implications of data analysis and how it is used in different industries.
- Kaggle Datasets
- Kaggle is a popular platform for data science and machine learning. It offers a wide range of datasets for practice and exploration. Students can find interesting datasets on different topics to apply their Pandas learning and gain hands-on experience.
- NumPy Documentation
- NumPy is another important Python library often used in conjunction with Pandas for numerical operations and scientific computing. The official NumPy documentation provides in-depth explanations and examples of working with arrays, mathematical functions, and more.
- Matplotlib Documentation
- Matplotlib is a powerful data visualization library in Python. It allows students to create a wide range of plots and charts to visualize their data. The Matplotlib documentation offers comprehensive guidance on creating different types of visualizations, customizing plots, and using various plotting functions. By referring to these resources, students can further expand their knowledge and explore advanced topics in Pandas, NumPy, and data visualization.
Hacks
-
Find a CSV/JSON Dataset that interests you. Refer to Kaggle Datasets mentioned above.
-
Try to show your Pandas learning by illustrating 5 different numerical analysis operations being done on the dataset. After showing each operation in a separate code block, add a sentence explaining what that operation is showing and what real-world implication it has. It is important to make sure that you are not only able to run code to analyze data, but also understand its implications.
-
EXTRA: Research Matplotlib Documentation mentioned and implement a code block where you create a graph showing and visualize relationship in your chosen dataset. Then, add a sentence or two explaining what the relationship shows.