Machine_learning_ipynb_2_
#Lesson: Exploring Materials Project Data and Predicting Material Properties ##Step 1: Importing Libraries In this lesson, we’ll explore materials project data and perform various data analysis tasks using Python. First, we need to import the necessary libraries to work with data, visualize it, and perform machine learning.
from mp_api.client import MPRester
import pandas as pd
from pymatgen.core import Composition
import datetime
from math import inf
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict, cross_validate, KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
import seaborn as sns
#from sklearn.model_selection import train_test_split
#from sklearn.model_selection import cross_val_predict, cross_validate, KFold
#from sklearn.metrics import r2_score
#from sklearn.cluster import KMeans
#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
#from sklearn.linear_model import LogisticRegression
#from sklearn.cluster import DBSCAN
##Step 2: Accessing Materials Project Data The Materials Project is a database containing materials properties data. We’ll use the MPRester library to access materials project data related to ABO3 compounds. We’ll fetch properties like material_id, formula_pretty, nsites, band_gap, formation_energy_per_atom, theoretical, and energy_above_hull for all ABO3 compounds.
mpr = MPRester(api_key="PT7C2S6MEfO67wSwPtB3KlVvUu7ZI166")
ABO3 = mpr.summary.search(formula="**O3", fields=["material_id", "formula_pretty", "nsites", "band_gap", "formation_energy_per_atom", "theoretical", "energy_above_hull"])
Retrieving SummaryDoc documents: 0%| | 0/2544 [00:00<?, ?it/s]
ABO3_dicts = [i.dict() for i in ABO3]
mp_database = pd.DataFrame(data=ABO3_dicts)
Step 3: Data Manipulation and Cleaning
Now that we have the materials project data in a DataFrame named ‘mp_database,’ let’s clean and manipulate the data to make it more suitable for analysis.
mp_database = mp_database.drop("fields_not_requested", axis=1)
len(mp_database)
2544
grouped = mp_database.groupby('formula_pretty').size()
grouped.mean()
1.912781954887218
mp_database.nunique()
nsites 18
formula_pretty 1330
material_id 2544
formation_energy_per_atom 2544
energy_above_hull 2199
band_gap 1297
theoretical 2
dtype: int64
mp_database
nsites | formula_pretty | material_id | formation_energy_per_atom | energy_above_hull | band_gap | theoretical | |
---|---|---|---|---|---|---|---|
0 | 20 | NaNbO3 | mp-4681 | -2.833432 | 0.012851 | 2.3684 | False |
1 | 10 | MnZnO3 | mp-754318 | -1.814413 | 0.005794 | 1.3225 | True |
2 | 20 | MnAlO3 | mp-1368992 | -2.573144 | 0.161167 | 0.5506 | True |
3 | 5 | BaBiO3 | mp-545783 | -2.222888 | 0.021211 | 0.0000 | False |
4 | 20 | MnTlO3 | mp-770870 | -1.505971 | 0.053737 | 0.0000 | True |
... | ... | ... | ... | ... | ... | ... | ... |
2539 | 5 | BaCdO3 | mp-1183292 | -1.645883 | 0.202881 | 0.0000 | True |
2540 | 10 | LaMnO3 | mp-1205375 | -2.968812 | 0.161878 | 0.0000 | False |
2541 | 5 | TlSiO3 | mp-1187529 | -1.862867 | 0.534038 | 0.0000 | True |
2542 | 5 | GdBeO3 | mp-1184468 | -2.874174 | 0.315310 | 0.0000 | True |
2543 | 5 | InSiO3 | mp-1184751 | -2.064795 | 0.561967 | 0.0000 | True |
2544 rows × 7 columns
mp_database['theoretical'].value_counts()[False]
836
mp_database['formation_energy_per_atom_ev_mol'] = mp_database['formation_energy_per_atom'] * 96491.5666
mp_database
nsites | formula_pretty | material_id | formation_energy_per_atom | energy_above_hull | band_gap | theoretical | formation_energy_per_atom_ev_mol | |
---|---|---|---|---|---|---|---|---|
0 | 20 | NaNbO3 | mp-4681 | -2.833432 | 0.012851 | 2.3684 | False | -273402.322598 |
1 | 10 | MnZnO3 | mp-754318 | -1.814413 | 0.005794 | 1.3225 | True | -175075.572015 |
2 | 20 | MnAlO3 | mp-1368992 | -2.573144 | 0.161167 | 0.5506 | True | -248286.682798 |
3 | 5 | BaBiO3 | mp-545783 | -2.222888 | 0.021211 | 0.0000 | False | -214489.986143 |
4 | 20 | MnTlO3 | mp-770870 | -1.505971 | 0.053737 | 0.0000 | True | -145313.457382 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2539 | 5 | BaCdO3 | mp-1183292 | -1.645883 | 0.202881 | 0.0000 | True | -158813.786775 |
2540 | 10 | LaMnO3 | mp-1205375 | -2.968812 | 0.161878 | 0.0000 | False | -286465.330840 |
2541 | 5 | TlSiO3 | mp-1187529 | -1.862867 | 0.534038 | 0.0000 | True | -179750.985874 |
2542 | 5 | GdBeO3 | mp-1184468 | -2.874174 | 0.315310 | 0.0000 | True | -277333.565860 |
2543 | 5 | InSiO3 | mp-1184751 | -2.064795 | 0.561967 | 0.0000 | True | -199235.298396 |
2544 rows × 8 columns
##Step 4: Classifying Materials Let’s classify the materials based on their stability and band gap properties. We’ll define two functions, ‘stability_category’ and ‘bandgap_category,’ to perform the classification.
# Assuming you already have a pandas DataFrame named 'mp_database' with columns 'energy_above_hull' and 'band_gap'
# Function to classify materials based on energy_above_hull
def stability_category(energy):
if energy > 0.03:
return 'unstable'
else:
return 'potentially_stable'
# Function to classify materials based on band_gap
def bandgap_category(band_gap):
if band_gap == 0:
return 'metal'
elif 0 < band_gap <= 1:
return 'small_bandgap'
else:
return 'large_bandgap'
# Apply the classification functions to the DataFrame
mp_database['stability'] = mp_database['energy_above_hull'].apply(stability_category)
mp_database['bandgap_type'] = mp_database['band_gap'].apply(bandgap_category)
# Create a pivot table to get the desired table structure
new_table = pd.pivot_table(mp_database, index='stability', columns='bandgap_type', aggfunc='size', fill_value=0)
# Reorder the columns to match the desired order
new_table = new_table[['metal', 'small_bandgap', 'large_bandgap']]
print(new_table)
bandgap_type metal small_bandgap large_bandgap
stability
potentially_stable 197 103 525
unstable 1042 212 465
mp_database
nsites | formula_pretty | material_id | formation_energy_per_atom | energy_above_hull | band_gap | theoretical | formation_energy_per_atom_ev_mol | stability | bandgap_type | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 20 | NaNbO3 | mp-4681 | -2.833432 | 0.012851 | 2.3684 | False | -273402.322598 | potentially_stable | large_bandgap |
1 | 10 | MnZnO3 | mp-754318 | -1.814413 | 0.005794 | 1.3225 | True | -175075.572015 | potentially_stable | large_bandgap |
2 | 20 | MnAlO3 | mp-1368992 | -2.573144 | 0.161167 | 0.5506 | True | -248286.682798 | unstable | small_bandgap |
3 | 5 | BaBiO3 | mp-545783 | -2.222888 | 0.021211 | 0.0000 | False | -214489.986143 | potentially_stable | metal |
4 | 20 | MnTlO3 | mp-770870 | -1.505971 | 0.053737 | 0.0000 | True | -145313.457382 | unstable | metal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2539 | 5 | BaCdO3 | mp-1183292 | -1.645883 | 0.202881 | 0.0000 | True | -158813.786775 | unstable | metal |
2540 | 10 | LaMnO3 | mp-1205375 | -2.968812 | 0.161878 | 0.0000 | False | -286465.330840 | unstable | metal |
2541 | 5 | TlSiO3 | mp-1187529 | -1.862867 | 0.534038 | 0.0000 | True | -179750.985874 | unstable | metal |
2542 | 5 | GdBeO3 | mp-1184468 | -2.874174 | 0.315310 | 0.0000 | True | -277333.565860 | unstable | metal |
2543 | 5 | InSiO3 | mp-1184751 | -2.064795 | 0.561967 | 0.0000 | True | -199235.298396 | unstable | metal |
2544 rows × 10 columns
##Step 5: Data Visualization - Formation Energy Distribution Visualizing data is crucial for understanding its distribution and characteristics. Let’s create a histogram to visualize the distribution of formation energies per atom.
import pandas as pd
import matplotlib.pyplot as plt
# Assuming you have a pandas DataFrame named 'mp_database' with the 'formation_energies_per_atom' column
# Calculate the average and standard deviation
average_energy = mp_database['formation_energy_per_atom'].mean()
std_deviation = mp_database['formation_energy_per_atom'].std()
# Plot the distribution of the column
plt.figure(figsize=(10, 6))
plt.hist(mp_database['formation_energy_per_atom'], bins=30, edgecolor='black', alpha=0.7)
# Add labels and title to the plot
plt.xlabel('Formation Energies per Atom')
plt.ylabel('Frequency')
plt.title('Distribution of Formation Energies per Atom')
# Annotate the plot with average and standard deviation
plt.axvline(average_energy, color='red', linestyle='dashed', linewidth=2, label='Average')
plt.axvline(average_energy + std_deviation, color='orange', linestyle='dashed', linewidth=2, label='Standard Deviation')
plt.axvline(average_energy - std_deviation, color='orange', linestyle='dashed', linewidth=2)
plt.annotate(f'Average: {average_energy:.2f}', xy=(average_energy, 300), xytext=(average_energy + 0.5, 400),
arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)
plt.annotate(f'Standard Deviation: {std_deviation:.2f}', xy=(average_energy + std_deviation, 250),
xytext=(average_energy + std_deviation + 1, 350), arrowprops=dict(facecolor='black', shrink=0.05),
fontsize=12)
# Show legend
plt.legend()
# Display the plot
plt.show()
##Step 6: Filtering and Preparing Data for Analysis Next, we’ll filter the data to keep only the most relevant materials. We’ll sort the data based on formation energy and remove duplicate formulas to keep only the materials with the lowest formation energy for each formula.
orig_data = pd.read_csv("data2022.csv", na_filter=False)
len(orig_data)
681
orig_data
Unnamed: 0 | task_id | formula | formation_energy_per_atom | e_above_hull | band_gap | has_bandstructure | |
---|---|---|---|---|---|---|---|
0 | 297 | mp-570140 | AuBr | -0.125181 | 0.011579 | 1.9716 | True |
1 | 573 | mp-1206190 | ZnI6 | 0.387945 | 0.652470 | 0.1076 | False |
2 | 579 | mp-1208424 | TeBr4 | -0.390060 | 0.000000 | 2.5202 | False |
3 | 738 | mp-570480 | TcBr4 | -0.404404 | 0.000000 | 0.6025 | True |
4 | 880 | mp-1064459 | PBr | 1.067789 | 1.327882 | 0.0000 | False |
... | ... | ... | ... | ... | ... | ... | ... |
676 | 118973 | mp-974371 | Rh3Br | 0.670866 | 0.820678 | 0.0000 | True |
677 | 121600 | mp-867231 | TbI3 | -1.382737 | 0.000000 | 2.0937 | True |
678 | 121602 | mp-867247 | TbBr3 | -1.860767 | 0.000000 | 2.8570 | True |
679 | 121619 | mp-867355 | TeBr2 | -0.186842 | 0.140200 | 0.6529 | True |
680 | 121834 | mp-867893 | SmI3 | -1.399012 | 0.000000 | 2.0145 | True |
681 rows × 7 columns
import pandas as pd
# Assuming you have a pandas DataFrame named 'orig_data' with 'formation_energy_per_atom' and 'formula' columns
# Sort the DataFrame by 'formation_energy_per_atom' in ascending order
orig_data.sort_values(by='formation_energy_per_atom', ascending=True, inplace=True)
# Drop duplicates based on the 'formula' column and keep the first occurrence (smallest formation_energy_per_atom)
filtered_data = orig_data.drop_duplicates(subset='formula', keep='first')
# Display the resulting DataFrame
filtered_data
Unnamed: 0 | task_id | formula | formation_energy_per_atom | e_above_hull | band_gap | has_bandstructure | |
---|---|---|---|---|---|---|---|
191 | 29313 | mp-27456 | BaBr2 | -2.227832 | 0.000000 | 4.3731 | True |
475 | 80856 | mp-567744 | SrBr2 | -2.156160 | 0.000000 | 4.4697 | True |
219 | 35022 | mp-1071541 | YbBr2 | -2.109686 | 0.000000 | 4.3169 | False |
414 | 70294 | mp-27972 | AcBr3 | -2.098107 | 0.000000 | 4.1045 | True |
304 | 51723 | mp-22888 | CaBr2 | -2.052632 | 0.000000 | 4.5304 | True |
... | ... | ... | ... | ... | ... | ... | ... |
268 | 44109 | mp-1184756 | IrI | 0.949140 | 1.047085 | 0.0000 | False |
601 | 105675 | mp-1197830 | CI | 0.950707 | 0.950707 | 3.4057 | False |
541 | 93486 | mp-975905 | Mo3I | 0.977325 | 1.122018 | 0.0000 | True |
517 | 88638 | mp-1209830 | PBr2 | 1.261139 | 1.607930 | 1.7322 | False |
587 | 102482 | mp-974474 | ReBr | 1.306459 | 1.507749 | 0.0000 | True |
392 rows × 7 columns
filtered_data = filtered_data[filtered_data["formation_energy_per_atom"] < 0]
filtered_data
Unnamed: 0 | task_id | formula | formation_energy_per_atom | e_above_hull | band_gap | has_bandstructure | |
---|---|---|---|---|---|---|---|
191 | 29313 | mp-27456 | BaBr2 | -2.227832 | 0.000000 | 4.3731 | True |
475 | 80856 | mp-567744 | SrBr2 | -2.156160 | 0.000000 | 4.4697 | True |
219 | 35022 | mp-1071541 | YbBr2 | -2.109686 | 0.000000 | 4.3169 | False |
414 | 70294 | mp-27972 | AcBr3 | -2.098107 | 0.000000 | 4.1045 | True |
304 | 51723 | mp-22888 | CaBr2 | -2.052632 | 0.000000 | 4.5304 | True |
... | ... | ... | ... | ... | ... | ... | ... |
222 | 35973 | mp-862773 | TeI2 | -0.036698 | 0.070018 | 0.7144 | False |
481 | 82174 | mp-1185549 | Mg149Br | -0.022094 | 0.000000 | 0.0021 | False |
600 | 105624 | mp-1185590 | Mg149I | -0.017586 | 0.000000 | 0.0025 | False |
512 | 87707 | mp-1214775 | AuBr2 | -0.007624 | 0.143039 | 0.0000 | False |
544 | 94095 | mp-1179985 | PBr | -0.007386 | 0.252708 | 1.3954 | False |
338 rows × 7 columns
Step 7: Calculating Composition-Averaged Properties
We will now calculate the composition-averaged atomic radius for each material in ‘filtered_data’ using the element properties data available in ‘element_data’ DataFrame.
element_data = pd.read_csv("element_properties.csv", index_col=0)
element_data
AtomicRadius | AtomicVolume | AtomicWeight | BoilingT | CovalentRadius | Density | ElectronAffinity | Electronegativity | FirstIonizationEnergy | HeatCapacityMass | SecondIonizationEnergy | |
---|---|---|---|---|---|---|---|---|---|---|---|
H | 0.25 | 18618.051940 | 1.007940 | 20.13 | 31 | 0.0899 | 72.8 | 2.20 | 13.598443 | 14.304 | NaN |
He | NaN | 37236.035560 | 4.002602 | 4.07 | 28 | 0.1785 | 0.0 | NaN | 24.587387 | 5.130 | 54.41776 |
Li | 1.45 | 21.544058 | 6.941000 | 1615.00 | 128 | 535.0000 | 59.6 | 0.98 | 5.391719 | 3.582 | 75.64000 |
Be | 1.05 | 8.098176 | 9.012182 | 2743.00 | 96 | 1848.0000 | 0.0 | 1.57 | 9.322700 | 1.825 | 18.21114 |
B | 0.85 | 7.297767 | 10.811000 | 4273.00 | 84 | 2460.0000 | 26.7 | 2.04 | 8.298020 | 1.026 | 25.15480 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Pa | 1.80 | 24.961161 | 231.035860 | 4273.00 | 200 | 15370.0000 | NaN | 1.50 | 5.890000 | NaN | NaN |
U | 1.75 | 20.748847 | 238.028910 | 4200.00 | 196 | 19050.0000 | NaN | 1.38 | 6.194100 | 0.116 | 10.60000 |
Np | 1.75 | 19.244839 | 237.000000 | 4273.00 | 190 | 20450.0000 | NaN | 1.36 | 6.265700 | NaN | NaN |
Pu | 1.75 | 20.447164 | 244.000000 | 3503.00 | 187 | 19816.0000 | NaN | 1.28 | 6.026000 | NaN | 11.20000 |
Am | 1.75 | 29.518685 | 243.000000 | 2284.00 | 180 | 13670.0000 | NaN | 1.30 | 5.973800 | NaN | NaN |
95 rows × 11 columns
type(element_data["AtomicRadius"][1])
numpy.float64
comp = Composition('Al2O3')
print(comp.elements)
[Element Al, Element O]
print(comp.to_data_dict['unit_cell_composition'])
defaultdict(<class 'float'>, {'Al': 2.0, 'O': 3.0})
import numpy as np
count = 0
column_nans = {}
for column in element_data.columns:
for element in element_data[column]:
if np.isnan(element):
count += 1
column_nans[column] = count
count = 0
column_nans
{'AtomicRadius': 7,
'AtomicVolume': 2,
'AtomicWeight': 0,
'BoilingT': 2,
'CovalentRadius': 0,
'Density': 2,
'ElectronAffinity': 9,
'Electronegativity': 4,
'FirstIonizationEnergy': 1,
'HeatCapacityMass': 10,
'SecondIonizationEnergy': 12}
# Set of commands to compute the mean value for each column, replace the NaNs by the mean and output the updated data set
for i in element_data.columns:
col_mean = element_data[i].mean(skipna=True)
element_data[i] = element_data[i].fillna(col_mean)
element_data
AtomicRadius | AtomicVolume | AtomicWeight | BoilingT | CovalentRadius | Density | ElectronAffinity | Electronegativity | FirstIonizationEnergy | HeatCapacityMass | SecondIonizationEnergy | |
---|---|---|---|---|---|---|---|---|---|---|---|
H | 0.250000 | 18618.051940 | 1.007940 | 20.13 | 31 | 0.0899 | 72.800000 | 2.200000 | 13.598443 | 14.304000 | 18.947504 |
He | 1.500682 | 37236.035560 | 4.002602 | 4.07 | 28 | 0.1785 | 0.000000 | 1.747033 | 24.587387 | 5.130000 | 54.417760 |
Li | 1.450000 | 21.544058 | 6.941000 | 1615.00 | 128 | 535.0000 | 59.600000 | 0.980000 | 5.391719 | 3.582000 | 75.640000 |
Be | 1.050000 | 8.098176 | 9.012182 | 2743.00 | 96 | 1848.0000 | 0.000000 | 1.570000 | 9.322700 | 1.825000 | 18.211140 |
B | 0.850000 | 7.297767 | 10.811000 | 4273.00 | 84 | 2460.0000 | 26.700000 | 2.040000 | 8.298020 | 1.026000 | 25.154800 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Pa | 1.800000 | 24.961161 | 231.035860 | 4273.00 | 200 | 15370.0000 | 76.162209 | 1.500000 | 5.890000 | 0.635447 | 18.947504 |
U | 1.750000 | 20.748847 | 238.028910 | 4200.00 | 196 | 19050.0000 | 76.162209 | 1.380000 | 6.194100 | 0.116000 | 10.600000 |
Np | 1.750000 | 19.244839 | 237.000000 | 4273.00 | 190 | 20450.0000 | 76.162209 | 1.360000 | 6.265700 | 0.635447 | 18.947504 |
Pu | 1.750000 | 20.447164 | 244.000000 | 3503.00 | 187 | 19816.0000 | 76.162209 | 1.280000 | 6.026000 | 0.635447 | 11.200000 |
Am | 1.750000 | 29.518685 | 243.000000 | 2284.00 | 180 | 13670.0000 | 76.162209 | 1.300000 | 5.973800 | 0.635447 | 18.947504 |
95 rows × 11 columns
# Set of commands to compute and output the composition-averaged Atomic Radius for all materials
begin_time = datetime.datetime.now()
atomic_radius = pd.DataFrame()
filtered_data["composition_dict"] = [Composition(i).as_dict() for i in filtered_data.formula]
atomic_radius["formula"] = filtered_data.formula
atomic_radius["atomic_radius_avg"] = [np.average([element_data.loc[[el]]["AtomicRadius"][0] for el in comp], weights=[comp[el] for el in comp]) for comp in filtered_data.composition_dict]
print(datetime.datetime.now() - begin_time)
atomic_radius
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
0:00:00.211715
formula | atomic_radius_avg | |
---|---|---|
191 | BaBr2 | 1.483333 |
475 | SrBr2 | 1.433333 |
219 | YbBr2 | 1.350000 |
414 | AcBr3 | 1.350000 |
304 | CaBr2 | 1.366667 |
... | ... | ... |
222 | TeI2 | 1.400000 |
481 | Mg149Br | 1.497667 |
600 | Mg149I | 1.499333 |
512 | AuBr2 | 1.216667 |
544 | PBr | 1.075000 |
338 rows × 2 columns
atomic_radius_2 = pd.DataFrame()
atomic_radius_2["formula"] = filtered_data.formula
atomic_radius_2["composition_dict"] = filtered_data.composition_dict
composition_averaged_list = []
for composition in atomic_radius_2["composition_dict"]:
total_atoms_in_formula = 0
element_radius_weighted = 0
for element in composition:
total_atoms_in_formula += composition[element]
element_radius_weighted += element_data.loc[element]["AtomicRadius"] * composition[element]
average = element_radius_weighted / total_atoms_in_formula
composition_averaged_list.append(average)
atomic_radius_2["atomic_radius_avg"] = composition_averaged_list
atomic_radius_2
formula | composition_dict | atomic_radius_avg | |
---|---|---|---|
191 | BaBr2 | {'Ba': 1.0, 'Br': 2.0} | 1.483333 |
475 | SrBr2 | {'Sr': 1.0, 'Br': 2.0} | 1.433333 |
219 | YbBr2 | {'Yb': 1.0, 'Br': 2.0} | 1.350000 |
414 | AcBr3 | {'Ac': 1.0, 'Br': 3.0} | 1.350000 |
304 | CaBr2 | {'Ca': 1.0, 'Br': 2.0} | 1.366667 |
... | ... | ... | ... |
222 | TeI2 | {'Te': 1.0, 'I': 2.0} | 1.400000 |
481 | Mg149Br | {'Mg': 149.0, 'Br': 1.0} | 1.497667 |
600 | Mg149I | {'Mg': 149.0, 'I': 1.0} | 1.499333 |
512 | AuBr2 | {'Au': 1.0, 'Br': 2.0} | 1.216667 |
544 | PBr | {'P': 1.0, 'Br': 1.0} | 1.075000 |
338 rows × 3 columns
average_properties = pd.DataFrame()
average_properties["formula"] = filtered_data.formula
average_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
composition_averaged_list = []
for composition in average_properties["composition_dict"]:
total_atoms_in_formula = 0
element_property_weighted = 0
for element in composition:
total_atoms_in_formula += composition[element]
element_property_weighted += element_data.loc[element][prop] * composition[element]
average = element_property_weighted / total_atoms_in_formula
composition_averaged_list.append(average)
average_properties[prop] = composition_averaged_list
average_properties
formula | composition_dict | AtomicRadius | AtomicVolume | AtomicWeight | BoilingT | CovalentRadius | Density | ElectronAffinity | Electronegativity | FirstIonizationEnergy | HeatCapacityMass | SecondIonizationEnergy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
191 | BaBr2 | {'Ba': 1.0, 'Br': 2.0} | 1.483333 | 50.008311 | 99.045000 | 935.666667 | 151.666667 | 3250.000000 | 221.050000 | 2.270000 | 9.613088 | 0.384000 | 17.728610 |
475 | SrBr2 | {'Sr': 1.0, 'Br': 2.0} | 1.433333 | 46.792927 | 82.476000 | 773.000000 | 145.000000 | 2956.666667 | 218.076667 | 2.290000 | 9.774150 | 0.418000 | 18.070700 |
219 | YbBr2 | {'Yb': 1.0, 'Br': 2.0} | 1.350000 | 42.931774 | 110.954000 | 711.000000 | 142.333333 | 4270.000000 | 233.066667 | 2.393333 | 9.960587 | 0.367667 | 18.452667 |
414 | AcBr3 | {'Ac': 1.0, 'Br': 3.0} | 1.350000 | 41.254141 | 116.678000 | 1117.250000 | 143.750000 | 4857.500000 | 262.490552 | 2.495000 | 10.152850 | 0.385500 | 19.130750 |
304 | CaBr2 | {'Ca': 1.0, 'Br': 2.0} | 1.366667 | 42.664279 | 66.628667 | 807.000000 | 138.666667 | 2596.666667 | 217.190000 | 2.306667 | 9.913587 | 0.531667 | 18.351240 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
222 | TeI2 | {'Te': 1.0, 'I': 2.0} | 1.400000 | 39.758135 | 127.136313 | 725.200000 | 138.666667 | 5373.333333 | 260.200000 | 2.473333 | 9.970707 | 0.210000 | 18.954200 |
481 | Mg149Br | {'Mg': 149.0, 'Br': 1.0} | 1.497667 | 23.350997 | 24.675660 | 1356.126667 | 140.860000 | 1747.213333 | 2.164000 | 1.321000 | 7.674019 | 1.019340 | 15.078975 |
600 | Mg149I | {'Mg': 149.0, 'I': 1.0} | 1.499333 | 23.351870 | 24.988996 | 1356.962000 | 140.986667 | 1759.346667 | 1.968000 | 1.319000 | 7.664935 | 1.017607 | 15.062577 |
512 | AuBr2 | {'Au': 1.0, 'Br': 2.0} | 1.216667 | 34.000905 | 118.924856 | 1264.333333 | 125.333333 | 8513.333333 | 290.666667 | 2.820000 | 10.951043 | 0.359000 | 21.127333 |
544 | PBr | {'P': 1.0, 'Br': 1.0} | 1.075000 | 35.370974 | 55.438881 | 442.750000 | 113.500000 | 2471.500000 | 198.300000 | 2.575000 | 11.150245 | 0.621500 | 20.680250 |
338 rows × 13 columns
max_properties = pd.DataFrame()
max_properties["formula"] = filtered_data.formula
max_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
composition_max_list = []
max_property = 0
for composition in max_properties["composition_dict"]:
for element in composition:
if element_data.loc[element][prop] > max_property:
max_property = element_data.loc[element][prop]
composition_max_list.append(max_property)
max_properties[prop] = composition_max_list
max_properties
formula | composition_dict | AtomicRadius | AtomicVolume | AtomicWeight | BoilingT | CovalentRadius | Density | ElectronAffinity | Electronegativity | FirstIonizationEnergy | HeatCapacityMass | SecondIonizationEnergy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
191 | BaBr2 | {'Ba': 1.0, 'Br': 2.0} | 2.15 | 64.969282 | 137.327 | 2143.0 | 215.0 | 3510.0 | 324.6 | 2.96 | 11.813800 | 0.474 | 21.591 |
475 | SrBr2 | {'Sr': 1.0, 'Br': 2.0} | 2.15 | 64.969282 | 137.327 | 2143.0 | 215.0 | 3510.0 | 324.6 | 2.96 | 11.813800 | 0.474 | 21.591 |
219 | YbBr2 | {'Yb': 1.0, 'Br': 2.0} | 2.15 | 64.969282 | 173.054 | 2143.0 | 215.0 | 6570.0 | 324.6 | 2.96 | 11.813800 | 0.474 | 21.591 |
414 | AcBr3 | {'Ac': 1.0, 'Br': 3.0} | 2.15 | 64.969282 | 227.000 | 3473.0 | 215.0 | 10070.0 | 324.6 | 2.96 | 11.813800 | 0.474 | 21.591 |
304 | CaBr2 | {'Ca': 1.0, 'Br': 2.0} | 2.15 | 64.969282 | 227.000 | 3473.0 | 215.0 | 10070.0 | 324.6 | 2.96 | 11.813800 | 0.647 | 21.591 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
222 | TeI2 | {'Te': 1.0, 'I': 2.0} | 2.60 | 18618.051940 | 244.000 | 5869.0 | 244.0 | 22590.0 | 324.6 | 2.96 | 13.598443 | 14.304 | 75.640 |
481 | Mg149Br | {'Mg': 149.0, 'Br': 1.0} | 2.60 | 18618.051940 | 244.000 | 5869.0 | 244.0 | 22590.0 | 324.6 | 2.96 | 13.598443 | 14.304 | 75.640 |
600 | Mg149I | {'Mg': 149.0, 'I': 1.0} | 2.60 | 18618.051940 | 244.000 | 5869.0 | 244.0 | 22590.0 | 324.6 | 2.96 | 13.598443 | 14.304 | 75.640 |
512 | AuBr2 | {'Au': 1.0, 'Br': 2.0} | 2.60 | 18618.051940 | 244.000 | 5869.0 | 244.0 | 22590.0 | 324.6 | 2.96 | 13.598443 | 14.304 | 75.640 |
544 | PBr | {'P': 1.0, 'Br': 1.0} | 2.60 | 18618.051940 | 244.000 | 5869.0 | 244.0 | 22590.0 | 324.6 | 2.96 | 13.598443 | 14.304 | 75.640 |
338 rows × 13 columns
min_properties = pd.DataFrame()
min_properties["formula"] = filtered_data.formula
min_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
composition_min_list = []
min_property = inf
for composition in min_properties["composition_dict"]:
for element in composition:
if element_data.loc[element][prop] < min_property:
min_property = element_data.loc[element][prop]
composition_min_list.append(min_property)
min_properties[prop] = composition_min_list
min_properties
formula | composition_dict | AtomicRadius | AtomicVolume | AtomicWeight | BoilingT | CovalentRadius | Density | ElectronAffinity | Electronegativity | FirstIonizationEnergy | HeatCapacityMass | SecondIonizationEnergy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
191 | BaBr2 | {'Ba': 1.0, 'Br': 2.0} | 1.15 | 42.527825 | 79.90400 | 332.00 | 120.0 | 3120.0000 | 13.95 | 0.89 | 5.211664 | 0.204 | 10.00383 |
475 | SrBr2 | {'Sr': 1.0, 'Br': 2.0} | 1.15 | 42.527825 | 79.90400 | 332.00 | 120.0 | 2630.0000 | 5.03 | 0.89 | 5.211664 | 0.204 | 10.00383 |
219 | YbBr2 | {'Yb': 1.0, 'Br': 2.0} | 1.15 | 42.527825 | 79.90400 | 332.00 | 120.0 | 2630.0000 | 5.03 | 0.89 | 5.211664 | 0.155 | 10.00383 |
414 | AcBr3 | {'Ac': 1.0, 'Br': 3.0} | 1.15 | 37.433086 | 79.90400 | 332.00 | 120.0 | 2630.0000 | 5.03 | 0.89 | 5.170000 | 0.120 | 10.00383 |
304 | CaBr2 | {'Ca': 1.0, 'Br': 2.0} | 1.15 | 37.433086 | 40.07800 | 332.00 | 120.0 | 1550.0000 | 2.37 | 0.89 | 5.170000 | 0.120 | 10.00383 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
222 | TeI2 | {'Te': 1.0, 'I': 2.0} | 0.25 | 7.297767 | 1.00794 | 20.13 | 31.0 | 0.0899 | 0.00 | 0.79 | 3.893905 | 0.116 | 10.00383 |
481 | Mg149Br | {'Mg': 149.0, 'Br': 1.0} | 0.25 | 7.297767 | 1.00794 | 20.13 | 31.0 | 0.0899 | 0.00 | 0.79 | 3.893905 | 0.116 | 10.00383 |
600 | Mg149I | {'Mg': 149.0, 'I': 1.0} | 0.25 | 7.297767 | 1.00794 | 20.13 | 31.0 | 0.0899 | 0.00 | 0.79 | 3.893905 | 0.116 | 10.00383 |
512 | AuBr2 | {'Au': 1.0, 'Br': 2.0} | 0.25 | 7.297767 | 1.00794 | 20.13 | 31.0 | 0.0899 | 0.00 | 0.79 | 3.893905 | 0.116 | 10.00383 |
544 | PBr | {'P': 1.0, 'Br': 1.0} | 0.25 | 7.297767 | 1.00794 | 20.13 | 31.0 | 0.0899 | 0.00 | 0.79 | 3.893905 | 0.116 | 10.00383 |
338 rows × 13 columns
average_properties = average_properties.drop("composition_dict", axis=1)
average_properties = average_properties.drop("formula", axis=1)
max_properties = max_properties.drop("composition_dict", axis=1)
max_properties = max_properties.drop("formula", axis=1)
min_properties = min_properties.drop("composition_dict", axis=1)
min_properties = min_properties.drop("formula", axis=1)
Step 8: Combining Property Data
Next, we’ll combine the composition-averaged, maximum, and minimum property DataFrames (‘average_properties’, ‘max_properties’, and ‘min_properties’) into a single design matrix DataFrame named ‘design_matrix’.
design_matrix = pd.concat([average_properties, max_properties, min_properties], axis=1)
Step 9: Splitting Data for Machine Learning
To train and evaluate our machine learning model, we need to split the data into training and testing sets.
targets = filtered_data[["band_gap", "formation_energy_per_atom"]]
targets.index = filtered_data.formula
targets
band_gap | formation_energy_per_atom | |
---|---|---|
formula | ||
BaBr2 | 4.3731 | -2.227832 |
SrBr2 | 4.4697 | -2.156160 |
YbBr2 | 4.3169 | -2.109686 |
AcBr3 | 4.1045 | -2.098107 |
CaBr2 | 4.5304 | -2.052632 |
... | ... | ... |
TeI2 | 0.7144 | -0.036698 |
Mg149Br | 0.0021 | -0.022094 |
Mg149I | 0.0025 | -0.017586 |
AuBr2 | 0.0000 | -0.007624 |
PBr | 1.3954 | -0.007386 |
338 rows × 2 columns
train_X, test_X, train_y, test_y = train_test_split(design_matrix, targets, test_size=0.10, random_state=42)
Step 10: Data Normalization
Normalization is an essential preprocessing step in machine learning. We’ll normalize the training data by subtracting the mean and dividing by the standard deviation.
# Set of commands to obtain the mean and standard deviation of train_X
train_mean = train_X.mean()
train_std = train_X.std()
# Set of commands to obtain the new normalized train and test data sets
norm_train_X = (train_X - train_mean) / (train_std)
norm_test_X = (test_X - train_mean) / (train_std)
Step 11: Cross-Validated Linear Regression
We’ll perform linear regression using scikit-learn and apply cross-validation to evaluate the model’s performance.
# Set of commands to run the 5-fold cross-validation on the training set, pick the best-scoring model, and apply it to the test set
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
mlr = linear_model.LinearRegression()
mlr.fit(train_X, train_y["formation_energy_per_atom"])
cross_val = cross_validate(mlr, train_X, train_y["formation_energy_per_atom"], cv=kfold, return_estimator=True)
cv_mlr = cross_val["estimator"][np.argmax(cross_val["test_score"])]
cv_mlr.fit(test_X, test_y["formation_energy_per_atom"])
yhat = cv_mlr.predict(test_X)
# Set of commands to obtain the RMSE and MAE
MAE = mean_absolute_error(test_y["formation_energy_per_atom"], yhat)
RMSE = mean_squared_error(test_y["formation_energy_per_atom"], yhat, squared=False)
print("The mean absolute error for this cross-validated linear model is " + str(MAE))
print("The root mean squared error for this cross-validated linear model is " + str(RMSE))
The mean absolute error for this cross-validated linear model is 0.028795650045063535
The root mean squared error for this cross-validated linear model is 0.03491365888584181
Step 12: Data Visualization - Pairplot
Finally, we’ll use seaborn to create a pairplot to visualize the relationships between different element properties.
features = [c for c in element_data.columns]
grid = sns.pairplot(element_data[features])
train_y
band_gap | formation_energy_per_atom | |
---|---|---|
formula | ||
PtI2 | 1.2280 | -0.165995 |
TiBr2 | 0.0000 | -1.044361 |
WI2 | 1.9586 | -0.208133 |
Mg3Br | 0.0000 | -0.118789 |
NpBr3 | 0.0000 | -1.358265 |
... | ... | ... |
CdI2 | 2.4223 | -0.592137 |
HfBr4 | 4.2902 | -1.394327 |
SmBr | 0.0000 | -1.149715 |
PdI2 | 0.9081 | -0.251194 |
ThI3 | 0.0000 | -1.212139 |
304 rows × 2 columns
Conclusion
Congratulations! You have completed the lesson on exploring materials project data and predicting material properties. We covered data access, cleaning, manipulation, visualization, and machine learning using Python and various libraries.
Keep exploring and experimenting with different data analysis techniques and machine learning models to gain deeper insights into materials science and make accurate predictions for material properties.