Machine_learning_ipynb_2_

#Lesson: Exploring Materials Project Data and Predicting Material Properties ##Step 1: Importing Libraries In this lesson, we’ll explore materials project data and perform various data analysis tasks using Python. First, we need to import the necessary libraries to work with data, visualize it, and perform machine learning.

from mp_api.client import MPRester
import pandas as pd
from pymatgen.core import Composition
import datetime
from math import inf
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict, cross_validate, KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
import seaborn as sns

#from sklearn.model_selection import train_test_split
#from sklearn.model_selection import cross_val_predict, cross_validate, KFold
#from sklearn.metrics import r2_score
#from sklearn.cluster import KMeans
#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
#from sklearn.linear_model import LogisticRegression
#from sklearn.cluster import DBSCAN

##Step 2: Accessing Materials Project Data The Materials Project is a database containing materials properties data. We’ll use the MPRester library to access materials project data related to ABO3 compounds. We’ll fetch properties like material_id, formula_pretty, nsites, band_gap, formation_energy_per_atom, theoretical, and energy_above_hull for all ABO3 compounds.

mpr = MPRester(api_key="PT7C2S6MEfO67wSwPtB3KlVvUu7ZI166")

ABO3 = mpr.summary.search(formula="**O3", fields=["material_id", "formula_pretty", "nsites", "band_gap", "formation_energy_per_atom", "theoretical", "energy_above_hull"])

Retrieving SummaryDoc documents:   0%|          | 0/2544 [00:00<?, ?it/s]

ABO3_dicts = [i.dict() for i in ABO3]

mp_database = pd.DataFrame(data=ABO3_dicts)

Step 3: Data Manipulation and Cleaning

Now that we have the materials project data in a DataFrame named ‘mp_database,’ let’s clean and manipulate the data to make it more suitable for analysis.

mp_database = mp_database.drop("fields_not_requested", axis=1)

len(mp_database)

grouped = mp_database.groupby('formula_pretty').size()

grouped.mean()

1.912781954887218

mp_database.nunique()

nsites                         18
formula_pretty               1330
material_id                  2544
formation_energy_per_atom    2544
energy_above_hull            2199
band_gap                     1297
theoretical                     2
dtype: int64

mp_database

	nsites	formula_pretty	material_id	formation_energy_per_atom	energy_above_hull	band_gap	theoretical
0	20	NaNbO3	mp-4681	-2.833432	0.012851	2.3684	False
1	10	MnZnO3	mp-754318	-1.814413	0.005794	1.3225	True
2	20	MnAlO3	mp-1368992	-2.573144	0.161167	0.5506	True
3	5	BaBiO3	mp-545783	-2.222888	0.021211	0.0000	False
4	20	MnTlO3	mp-770870	-1.505971	0.053737	0.0000	True
...	...	...	...	...	...	...	...
2539	5	BaCdO3	mp-1183292	-1.645883	0.202881	0.0000	True
2540	10	LaMnO3	mp-1205375	-2.968812	0.161878	0.0000	False
2541	5	TlSiO3	mp-1187529	-1.862867	0.534038	0.0000	True
2542	5	GdBeO3	mp-1184468	-2.874174	0.315310	0.0000	True
2543	5	InSiO3	mp-1184751	-2.064795	0.561967	0.0000	True

2544 rows × 7 columns

mp_database['theoretical'].value_counts()[False]

mp_database['formation_energy_per_atom_ev_mol'] = mp_database['formation_energy_per_atom'] * 96491.5666

mp_database

	nsites	formula_pretty	material_id	formation_energy_per_atom	energy_above_hull	band_gap	theoretical	formation_energy_per_atom_ev_mol
0	20	NaNbO3	mp-4681	-2.833432	0.012851	2.3684	False	-273402.322598
1	10	MnZnO3	mp-754318	-1.814413	0.005794	1.3225	True	-175075.572015
2	20	MnAlO3	mp-1368992	-2.573144	0.161167	0.5506	True	-248286.682798
3	5	BaBiO3	mp-545783	-2.222888	0.021211	0.0000	False	-214489.986143
4	20	MnTlO3	mp-770870	-1.505971	0.053737	0.0000	True	-145313.457382
...	...	...	...	...	...	...	...	...
2539	5	BaCdO3	mp-1183292	-1.645883	0.202881	0.0000	True	-158813.786775
2540	10	LaMnO3	mp-1205375	-2.968812	0.161878	0.0000	False	-286465.330840
2541	5	TlSiO3	mp-1187529	-1.862867	0.534038	0.0000	True	-179750.985874
2542	5	GdBeO3	mp-1184468	-2.874174	0.315310	0.0000	True	-277333.565860
2543	5	InSiO3	mp-1184751	-2.064795	0.561967	0.0000	True	-199235.298396

2544 rows × 8 columns

##Step 4: Classifying Materials Let’s classify the materials based on their stability and band gap properties. We’ll define two functions, ‘stability_category’ and ‘bandgap_category,’ to perform the classification.

# Assuming you already have a pandas DataFrame named 'mp_database' with columns 'energy_above_hull' and 'band_gap'

# Function to classify materials based on energy_above_hull
def stability_category(energy):
    if energy > 0.03:
        return 'unstable'
    else:
        return 'potentially_stable'

# Function to classify materials based on band_gap
def bandgap_category(band_gap):
    if band_gap == 0:
        return 'metal'
    elif 0 < band_gap <= 1:
        return 'small_bandgap'
    else:
        return 'large_bandgap'

# Apply the classification functions to the DataFrame
mp_database['stability'] = mp_database['energy_above_hull'].apply(stability_category)
mp_database['bandgap_type'] = mp_database['band_gap'].apply(bandgap_category)

# Create a pivot table to get the desired table structure
new_table = pd.pivot_table(mp_database, index='stability', columns='bandgap_type', aggfunc='size', fill_value=0)

# Reorder the columns to match the desired order
new_table = new_table[['metal', 'small_bandgap', 'large_bandgap']]

print(new_table)

bandgap_type        metal  small_bandgap  large_bandgap
stability                                              
potentially_stable    197            103            525
unstable             1042            212            465

mp_database

	nsites	formula_pretty	material_id	formation_energy_per_atom	energy_above_hull	band_gap	theoretical	formation_energy_per_atom_ev_mol	stability	bandgap_type
0	20	NaNbO3	mp-4681	-2.833432	0.012851	2.3684	False	-273402.322598	potentially_stable	large_bandgap
1	10	MnZnO3	mp-754318	-1.814413	0.005794	1.3225	True	-175075.572015	potentially_stable	large_bandgap
2	20	MnAlO3	mp-1368992	-2.573144	0.161167	0.5506	True	-248286.682798	unstable	small_bandgap
3	5	BaBiO3	mp-545783	-2.222888	0.021211	0.0000	False	-214489.986143	potentially_stable	metal
4	20	MnTlO3	mp-770870	-1.505971	0.053737	0.0000	True	-145313.457382	unstable	metal
...	...	...	...	...	...	...	...	...	...	...
2539	5	BaCdO3	mp-1183292	-1.645883	0.202881	0.0000	True	-158813.786775	unstable	metal
2540	10	LaMnO3	mp-1205375	-2.968812	0.161878	0.0000	False	-286465.330840	unstable	metal
2541	5	TlSiO3	mp-1187529	-1.862867	0.534038	0.0000	True	-179750.985874	unstable	metal
2542	5	GdBeO3	mp-1184468	-2.874174	0.315310	0.0000	True	-277333.565860	unstable	metal
2543	5	InSiO3	mp-1184751	-2.064795	0.561967	0.0000	True	-199235.298396	unstable	metal

2544 rows × 10 columns

##Step 5: Data Visualization - Formation Energy Distribution Visualizing data is crucial for understanding its distribution and characteristics. Let’s create a histogram to visualize the distribution of formation energies per atom.

import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have a pandas DataFrame named 'mp_database' with the 'formation_energies_per_atom' column

# Calculate the average and standard deviation
average_energy = mp_database['formation_energy_per_atom'].mean()
std_deviation = mp_database['formation_energy_per_atom'].std()

# Plot the distribution of the column
plt.figure(figsize=(10, 6))
plt.hist(mp_database['formation_energy_per_atom'], bins=30, edgecolor='black', alpha=0.7)

# Add labels and title to the plot
plt.xlabel('Formation Energies per Atom')
plt.ylabel('Frequency')
plt.title('Distribution of Formation Energies per Atom')

# Annotate the plot with average and standard deviation
plt.axvline(average_energy, color='red', linestyle='dashed', linewidth=2, label='Average')
plt.axvline(average_energy + std_deviation, color='orange', linestyle='dashed', linewidth=2, label='Standard Deviation')
plt.axvline(average_energy - std_deviation, color='orange', linestyle='dashed', linewidth=2)
plt.annotate(f'Average: {average_energy:.2f}', xy=(average_energy, 300), xytext=(average_energy + 0.5, 400),
             arrowprops=dict(facecolor='black', shrink=0.05), fontsize=12)
plt.annotate(f'Standard Deviation: {std_deviation:.2f}', xy=(average_energy + std_deviation, 250),
             xytext=(average_energy + std_deviation + 1, 350), arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=12)

# Show legend
plt.legend()

# Display the plot
plt.show()

png

##Step 6: Filtering and Preparing Data for Analysis Next, we’ll filter the data to keep only the most relevant materials. We’ll sort the data based on formation energy and remove duplicate formulas to keep only the materials with the lowest formation energy for each formula.

orig_data = pd.read_csv("data2022.csv", na_filter=False)

len(orig_data)

orig_data

	Unnamed: 0	task_id	formula	formation_energy_per_atom	e_above_hull	band_gap	has_bandstructure
0	297	mp-570140	AuBr	-0.125181	0.011579	1.9716	True
1	573	mp-1206190	ZnI6	0.387945	0.652470	0.1076	False
2	579	mp-1208424	TeBr4	-0.390060	0.000000	2.5202	False
3	738	mp-570480	TcBr4	-0.404404	0.000000	0.6025	True
4	880	mp-1064459	PBr	1.067789	1.327882	0.0000	False
...	...	...	...	...	...	...	...
676	118973	mp-974371	Rh3Br	0.670866	0.820678	0.0000	True
677	121600	mp-867231	TbI3	-1.382737	0.000000	2.0937	True
678	121602	mp-867247	TbBr3	-1.860767	0.000000	2.8570	True
679	121619	mp-867355	TeBr2	-0.186842	0.140200	0.6529	True
680	121834	mp-867893	SmI3	-1.399012	0.000000	2.0145	True

681 rows × 7 columns

import pandas as pd

# Assuming you have a pandas DataFrame named 'orig_data' with 'formation_energy_per_atom' and 'formula' columns

# Sort the DataFrame by 'formation_energy_per_atom' in ascending order
orig_data.sort_values(by='formation_energy_per_atom', ascending=True, inplace=True)

# Drop duplicates based on the 'formula' column and keep the first occurrence (smallest formation_energy_per_atom)
filtered_data = orig_data.drop_duplicates(subset='formula', keep='first')

# Display the resulting DataFrame
filtered_data

	Unnamed: 0	task_id	formula	formation_energy_per_atom	e_above_hull	band_gap	has_bandstructure
191	29313	mp-27456	BaBr2	-2.227832	0.000000	4.3731	True
475	80856	mp-567744	SrBr2	-2.156160	0.000000	4.4697	True
219	35022	mp-1071541	YbBr2	-2.109686	0.000000	4.3169	False
414	70294	mp-27972	AcBr3	-2.098107	0.000000	4.1045	True
304	51723	mp-22888	CaBr2	-2.052632	0.000000	4.5304	True
...	...	...	...	...	...	...	...
268	44109	mp-1184756	IrI	0.949140	1.047085	0.0000	False
601	105675	mp-1197830	CI	0.950707	0.950707	3.4057	False
541	93486	mp-975905	Mo3I	0.977325	1.122018	0.0000	True
517	88638	mp-1209830	PBr2	1.261139	1.607930	1.7322	False
587	102482	mp-974474	ReBr	1.306459	1.507749	0.0000	True

392 rows × 7 columns

filtered_data = filtered_data[filtered_data["formation_energy_per_atom"] < 0]

filtered_data

	Unnamed: 0	task_id	formula	formation_energy_per_atom	e_above_hull	band_gap	has_bandstructure
191	29313	mp-27456	BaBr2	-2.227832	0.000000	4.3731	True
475	80856	mp-567744	SrBr2	-2.156160	0.000000	4.4697	True
219	35022	mp-1071541	YbBr2	-2.109686	0.000000	4.3169	False
414	70294	mp-27972	AcBr3	-2.098107	0.000000	4.1045	True
304	51723	mp-22888	CaBr2	-2.052632	0.000000	4.5304	True
...	...	...	...	...	...	...	...
222	35973	mp-862773	TeI2	-0.036698	0.070018	0.7144	False
481	82174	mp-1185549	Mg149Br	-0.022094	0.000000	0.0021	False
600	105624	mp-1185590	Mg149I	-0.017586	0.000000	0.0025	False
512	87707	mp-1214775	AuBr2	-0.007624	0.143039	0.0000	False
544	94095	mp-1179985	PBr	-0.007386	0.252708	1.3954	False

338 rows × 7 columns

Step 7: Calculating Composition-Averaged Properties

We will now calculate the composition-averaged atomic radius for each material in ‘filtered_data’ using the element properties data available in ‘element_data’ DataFrame.

element_data = pd.read_csv("element_properties.csv", index_col=0)

element_data

	AtomicRadius	AtomicVolume	AtomicWeight	BoilingT	CovalentRadius	Density	ElectronAffinity	Electronegativity	FirstIonizationEnergy	HeatCapacityMass	SecondIonizationEnergy
H	0.25	18618.051940	1.007940	20.13	31	0.0899	72.8	2.20	13.598443	14.304	NaN
He	NaN	37236.035560	4.002602	4.07	28	0.1785	0.0	NaN	24.587387	5.130	54.41776
Li	1.45	21.544058	6.941000	1615.00	128	535.0000	59.6	0.98	5.391719	3.582	75.64000
Be	1.05	8.098176	9.012182	2743.00	96	1848.0000	0.0	1.57	9.322700	1.825	18.21114
B	0.85	7.297767	10.811000	4273.00	84	2460.0000	26.7	2.04	8.298020	1.026	25.15480
...	...	...	...	...	...	...	...	...	...	...	...
Pa	1.80	24.961161	231.035860	4273.00	200	15370.0000	NaN	1.50	5.890000	NaN	NaN
U	1.75	20.748847	238.028910	4200.00	196	19050.0000	NaN	1.38	6.194100	0.116	10.60000
Np	1.75	19.244839	237.000000	4273.00	190	20450.0000	NaN	1.36	6.265700	NaN	NaN
Pu	1.75	20.447164	244.000000	3503.00	187	19816.0000	NaN	1.28	6.026000	NaN	11.20000
Am	1.75	29.518685	243.000000	2284.00	180	13670.0000	NaN	1.30	5.973800	NaN	NaN

95 rows × 11 columns

type(element_data["AtomicRadius"][1])

numpy.float64

comp = Composition('Al2O3')

print(comp.elements)

[Element Al, Element O]

print(comp.to_data_dict['unit_cell_composition'])

defaultdict(<class 'float'>, {'Al': 2.0, 'O': 3.0})

import numpy as np

count = 0
column_nans = {}
for column in element_data.columns:
    for element in element_data[column]:
        if np.isnan(element):
            count += 1
    column_nans[column] = count
    count = 0

column_nans

{'AtomicRadius': 7,
 'AtomicVolume': 2,
 'AtomicWeight': 0,
 'BoilingT': 2,
 'CovalentRadius': 0,
 'Density': 2,
 'ElectronAffinity': 9,
 'Electronegativity': 4,
 'FirstIonizationEnergy': 1,
 'HeatCapacityMass': 10,
 'SecondIonizationEnergy': 12}

# Set of commands to compute the mean value for each column, replace the NaNs by the mean and output the updated data set
for i in element_data.columns:
    col_mean = element_data[i].mean(skipna=True)
    element_data[i] = element_data[i].fillna(col_mean)
element_data

	AtomicRadius	AtomicVolume	AtomicWeight	BoilingT	CovalentRadius	Density	ElectronAffinity	Electronegativity	FirstIonizationEnergy	HeatCapacityMass	SecondIonizationEnergy
H	0.250000	18618.051940	1.007940	20.13	31	0.0899	72.800000	2.200000	13.598443	14.304000	18.947504
He	1.500682	37236.035560	4.002602	4.07	28	0.1785	0.000000	1.747033	24.587387	5.130000	54.417760
Li	1.450000	21.544058	6.941000	1615.00	128	535.0000	59.600000	0.980000	5.391719	3.582000	75.640000
Be	1.050000	8.098176	9.012182	2743.00	96	1848.0000	0.000000	1.570000	9.322700	1.825000	18.211140
B	0.850000	7.297767	10.811000	4273.00	84	2460.0000	26.700000	2.040000	8.298020	1.026000	25.154800
...	...	...	...	...	...	...	...	...	...	...	...
Pa	1.800000	24.961161	231.035860	4273.00	200	15370.0000	76.162209	1.500000	5.890000	0.635447	18.947504
U	1.750000	20.748847	238.028910	4200.00	196	19050.0000	76.162209	1.380000	6.194100	0.116000	10.600000
Np	1.750000	19.244839	237.000000	4273.00	190	20450.0000	76.162209	1.360000	6.265700	0.635447	18.947504
Pu	1.750000	20.447164	244.000000	3503.00	187	19816.0000	76.162209	1.280000	6.026000	0.635447	11.200000
Am	1.750000	29.518685	243.000000	2284.00	180	13670.0000	76.162209	1.300000	5.973800	0.635447	18.947504

95 rows × 11 columns

# Set of commands to compute and output the composition-averaged Atomic Radius for all materials
begin_time = datetime.datetime.now()
atomic_radius = pd.DataFrame()
filtered_data["composition_dict"] = [Composition(i).as_dict() for i in filtered_data.formula]
atomic_radius["formula"] = filtered_data.formula
atomic_radius["atomic_radius_avg"] = [np.average([element_data.loc[[el]]["AtomicRadius"][0] for el in comp], weights=[comp[el] for el in comp]) for comp in filtered_data.composition_dict]
print(datetime.datetime.now() - begin_time)
atomic_radius

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

0:00:00.211715

	formula	atomic_radius_avg
191	BaBr2	1.483333
475	SrBr2	1.433333
219	YbBr2	1.350000
414	AcBr3	1.350000
304	CaBr2	1.366667
...	...	...
222	TeI2	1.400000
481	Mg149Br	1.497667
600	Mg149I	1.499333
512	AuBr2	1.216667
544	PBr	1.075000

338 rows × 2 columns

atomic_radius_2 = pd.DataFrame()
atomic_radius_2["formula"] = filtered_data.formula
atomic_radius_2["composition_dict"] = filtered_data.composition_dict
composition_averaged_list = []
for composition in atomic_radius_2["composition_dict"]:
    total_atoms_in_formula = 0
    element_radius_weighted = 0
    for element in composition:
        total_atoms_in_formula += composition[element]
        element_radius_weighted += element_data.loc[element]["AtomicRadius"] * composition[element]
    average = element_radius_weighted / total_atoms_in_formula
    composition_averaged_list.append(average)
atomic_radius_2["atomic_radius_avg"] = composition_averaged_list

atomic_radius_2

	formula	composition_dict	atomic_radius_avg
191	BaBr2	{'Ba': 1.0, 'Br': 2.0}	1.483333
475	SrBr2	{'Sr': 1.0, 'Br': 2.0}	1.433333
219	YbBr2	{'Yb': 1.0, 'Br': 2.0}	1.350000
414	AcBr3	{'Ac': 1.0, 'Br': 3.0}	1.350000
304	CaBr2	{'Ca': 1.0, 'Br': 2.0}	1.366667
...	...	...	...
222	TeI2	{'Te': 1.0, 'I': 2.0}	1.400000
481	Mg149Br	{'Mg': 149.0, 'Br': 1.0}	1.497667
600	Mg149I	{'Mg': 149.0, 'I': 1.0}	1.499333
512	AuBr2	{'Au': 1.0, 'Br': 2.0}	1.216667
544	PBr	{'P': 1.0, 'Br': 1.0}	1.075000

338 rows × 3 columns

average_properties = pd.DataFrame()
average_properties["formula"] = filtered_data.formula
average_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
    composition_averaged_list = []
    for composition in average_properties["composition_dict"]:
        total_atoms_in_formula = 0
        element_property_weighted = 0
        for element in composition:
            total_atoms_in_formula += composition[element]
            element_property_weighted += element_data.loc[element][prop] * composition[element]
        average = element_property_weighted / total_atoms_in_formula
        composition_averaged_list.append(average)
    average_properties[prop] = composition_averaged_list

average_properties

	formula	composition_dict	AtomicRadius	AtomicVolume	AtomicWeight	BoilingT	CovalentRadius	Density	ElectronAffinity	Electronegativity	FirstIonizationEnergy	HeatCapacityMass	SecondIonizationEnergy
191	BaBr2	{'Ba': 1.0, 'Br': 2.0}	1.483333	50.008311	99.045000	935.666667	151.666667	3250.000000	221.050000	2.270000	9.613088	0.384000	17.728610
475	SrBr2	{'Sr': 1.0, 'Br': 2.0}	1.433333	46.792927	82.476000	773.000000	145.000000	2956.666667	218.076667	2.290000	9.774150	0.418000	18.070700
219	YbBr2	{'Yb': 1.0, 'Br': 2.0}	1.350000	42.931774	110.954000	711.000000	142.333333	4270.000000	233.066667	2.393333	9.960587	0.367667	18.452667
414	AcBr3	{'Ac': 1.0, 'Br': 3.0}	1.350000	41.254141	116.678000	1117.250000	143.750000	4857.500000	262.490552	2.495000	10.152850	0.385500	19.130750
304	CaBr2	{'Ca': 1.0, 'Br': 2.0}	1.366667	42.664279	66.628667	807.000000	138.666667	2596.666667	217.190000	2.306667	9.913587	0.531667	18.351240
...	...	...	...	...	...	...	...	...	...	...	...	...	...
222	TeI2	{'Te': 1.0, 'I': 2.0}	1.400000	39.758135	127.136313	725.200000	138.666667	5373.333333	260.200000	2.473333	9.970707	0.210000	18.954200
481	Mg149Br	{'Mg': 149.0, 'Br': 1.0}	1.497667	23.350997	24.675660	1356.126667	140.860000	1747.213333	2.164000	1.321000	7.674019	1.019340	15.078975
600	Mg149I	{'Mg': 149.0, 'I': 1.0}	1.499333	23.351870	24.988996	1356.962000	140.986667	1759.346667	1.968000	1.319000	7.664935	1.017607	15.062577
512	AuBr2	{'Au': 1.0, 'Br': 2.0}	1.216667	34.000905	118.924856	1264.333333	125.333333	8513.333333	290.666667	2.820000	10.951043	0.359000	21.127333
544	PBr	{'P': 1.0, 'Br': 1.0}	1.075000	35.370974	55.438881	442.750000	113.500000	2471.500000	198.300000	2.575000	11.150245	0.621500	20.680250

338 rows × 13 columns

max_properties = pd.DataFrame()
max_properties["formula"] = filtered_data.formula
max_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
    composition_max_list = []
    max_property = 0
    for composition in max_properties["composition_dict"]:
        for element in composition:
            if element_data.loc[element][prop] > max_property:
                max_property = element_data.loc[element][prop]
        composition_max_list.append(max_property)
    max_properties[prop] = composition_max_list

max_properties

	formula	composition_dict	AtomicRadius	AtomicVolume	AtomicWeight	BoilingT	CovalentRadius	Density	ElectronAffinity	Electronegativity	FirstIonizationEnergy	HeatCapacityMass	SecondIonizationEnergy
191	BaBr2	{'Ba': 1.0, 'Br': 2.0}	2.15	64.969282	137.327	2143.0	215.0	3510.0	324.6	2.96	11.813800	0.474	21.591
475	SrBr2	{'Sr': 1.0, 'Br': 2.0}	2.15	64.969282	137.327	2143.0	215.0	3510.0	324.6	2.96	11.813800	0.474	21.591
219	YbBr2	{'Yb': 1.0, 'Br': 2.0}	2.15	64.969282	173.054	2143.0	215.0	6570.0	324.6	2.96	11.813800	0.474	21.591
414	AcBr3	{'Ac': 1.0, 'Br': 3.0}	2.15	64.969282	227.000	3473.0	215.0	10070.0	324.6	2.96	11.813800	0.474	21.591
304	CaBr2	{'Ca': 1.0, 'Br': 2.0}	2.15	64.969282	227.000	3473.0	215.0	10070.0	324.6	2.96	11.813800	0.647	21.591
...	...	...	...	...	...	...	...	...	...	...	...	...	...
222	TeI2	{'Te': 1.0, 'I': 2.0}	2.60	18618.051940	244.000	5869.0	244.0	22590.0	324.6	2.96	13.598443	14.304	75.640
481	Mg149Br	{'Mg': 149.0, 'Br': 1.0}	2.60	18618.051940	244.000	5869.0	244.0	22590.0	324.6	2.96	13.598443	14.304	75.640
600	Mg149I	{'Mg': 149.0, 'I': 1.0}	2.60	18618.051940	244.000	5869.0	244.0	22590.0	324.6	2.96	13.598443	14.304	75.640
512	AuBr2	{'Au': 1.0, 'Br': 2.0}	2.60	18618.051940	244.000	5869.0	244.0	22590.0	324.6	2.96	13.598443	14.304	75.640
544	PBr	{'P': 1.0, 'Br': 1.0}	2.60	18618.051940	244.000	5869.0	244.0	22590.0	324.6	2.96	13.598443	14.304	75.640

338 rows × 13 columns

min_properties = pd.DataFrame()
min_properties["formula"] = filtered_data.formula
min_properties["composition_dict"] = filtered_data.composition_dict
for prop in element_data.columns:
    composition_min_list = []
    min_property = inf
    for composition in min_properties["composition_dict"]:
        for element in composition:
            if element_data.loc[element][prop] < min_property:
                min_property = element_data.loc[element][prop]
        composition_min_list.append(min_property)
    min_properties[prop] = composition_min_list

min_properties

	formula	composition_dict	AtomicRadius	AtomicVolume	AtomicWeight	BoilingT	CovalentRadius	Density	ElectronAffinity	Electronegativity	FirstIonizationEnergy	HeatCapacityMass	SecondIonizationEnergy
191	BaBr2	{'Ba': 1.0, 'Br': 2.0}	1.15	42.527825	79.90400	332.00	120.0	3120.0000	13.95	0.89	5.211664	0.204	10.00383
475	SrBr2	{'Sr': 1.0, 'Br': 2.0}	1.15	42.527825	79.90400	332.00	120.0	2630.0000	5.03	0.89	5.211664	0.204	10.00383
219	YbBr2	{'Yb': 1.0, 'Br': 2.0}	1.15	42.527825	79.90400	332.00	120.0	2630.0000	5.03	0.89	5.211664	0.155	10.00383
414	AcBr3	{'Ac': 1.0, 'Br': 3.0}	1.15	37.433086	79.90400	332.00	120.0	2630.0000	5.03	0.89	5.170000	0.120	10.00383
304	CaBr2	{'Ca': 1.0, 'Br': 2.0}	1.15	37.433086	40.07800	332.00	120.0	1550.0000	2.37	0.89	5.170000	0.120	10.00383
...	...	...	...	...	...	...	...	...	...	...	...	...	...
222	TeI2	{'Te': 1.0, 'I': 2.0}	0.25	7.297767	1.00794	20.13	31.0	0.0899	0.00	0.79	3.893905	0.116	10.00383
481	Mg149Br	{'Mg': 149.0, 'Br': 1.0}	0.25	7.297767	1.00794	20.13	31.0	0.0899	0.00	0.79	3.893905	0.116	10.00383
600	Mg149I	{'Mg': 149.0, 'I': 1.0}	0.25	7.297767	1.00794	20.13	31.0	0.0899	0.00	0.79	3.893905	0.116	10.00383
512	AuBr2	{'Au': 1.0, 'Br': 2.0}	0.25	7.297767	1.00794	20.13	31.0	0.0899	0.00	0.79	3.893905	0.116	10.00383
544	PBr	{'P': 1.0, 'Br': 1.0}	0.25	7.297767	1.00794	20.13	31.0	0.0899	0.00	0.79	3.893905	0.116	10.00383

338 rows × 13 columns

average_properties = average_properties.drop("composition_dict", axis=1)
average_properties = average_properties.drop("formula", axis=1)
max_properties = max_properties.drop("composition_dict", axis=1)
max_properties = max_properties.drop("formula", axis=1)
min_properties = min_properties.drop("composition_dict", axis=1)
min_properties = min_properties.drop("formula", axis=1)

Step 8: Combining Property Data

Next, we’ll combine the composition-averaged, maximum, and minimum property DataFrames (‘average_properties’, ‘max_properties’, and ‘min_properties’) into a single design matrix DataFrame named ‘design_matrix’.

design_matrix = pd.concat([average_properties, max_properties, min_properties], axis=1)

Step 9: Splitting Data for Machine Learning

To train and evaluate our machine learning model, we need to split the data into training and testing sets.

targets = filtered_data[["band_gap", "formation_energy_per_atom"]]
targets.index = filtered_data.formula

targets

	band_gap	formation_energy_per_atom
formula
BaBr2	4.3731	-2.227832
SrBr2	4.4697	-2.156160
YbBr2	4.3169	-2.109686
AcBr3	4.1045	-2.098107
CaBr2	4.5304	-2.052632
...	...	...
TeI2	0.7144	-0.036698
Mg149Br	0.0021	-0.022094
Mg149I	0.0025	-0.017586
AuBr2	0.0000	-0.007624
PBr	1.3954	-0.007386

338 rows × 2 columns

train_X, test_X, train_y, test_y = train_test_split(design_matrix, targets, test_size=0.10, random_state=42)

Step 10: Data Normalization

Normalization is an essential preprocessing step in machine learning. We’ll normalize the training data by subtracting the mean and dividing by the standard deviation.

# Set of commands to obtain the mean and standard deviation of train_X
train_mean = train_X.mean()
train_std = train_X.std()

# Set of commands to obtain the new normalized train and test data sets
norm_train_X = (train_X - train_mean) / (train_std)
norm_test_X = (test_X - train_mean) / (train_std)

Step 11: Cross-Validated Linear Regression

We’ll perform linear regression using scikit-learn and apply cross-validation to evaluate the model’s performance.

# Set of commands to run the 5-fold cross-validation on the training set, pick the best-scoring model, and apply it to the test set
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
mlr = linear_model.LinearRegression()
mlr.fit(train_X, train_y["formation_energy_per_atom"])
cross_val = cross_validate(mlr, train_X, train_y["formation_energy_per_atom"], cv=kfold, return_estimator=True)
cv_mlr = cross_val["estimator"][np.argmax(cross_val["test_score"])]
cv_mlr.fit(test_X, test_y["formation_energy_per_atom"])
yhat = cv_mlr.predict(test_X)

# Set of commands to obtain the RMSE and MAE
MAE = mean_absolute_error(test_y["formation_energy_per_atom"], yhat)
RMSE = mean_squared_error(test_y["formation_energy_per_atom"], yhat, squared=False)
print("The mean absolute error for this cross-validated linear model is " + str(MAE))
print("The root mean squared error for this cross-validated linear model is " + str(RMSE))

The mean absolute error for this cross-validated linear model is 0.028795650045063535
The root mean squared error for this cross-validated linear model is 0.03491365888584181

Step 12: Data Visualization - Pairplot

Finally, we’ll use seaborn to create a pairplot to visualize the relationships between different element properties.

features = [c for c in element_data.columns]
grid = sns.pairplot(element_data[features])

png

train_y

	band_gap	formation_energy_per_atom
formula
PtI2	1.2280	-0.165995
TiBr2	0.0000	-1.044361
WI2	1.9586	-0.208133
Mg3Br	0.0000	-0.118789
NpBr3	0.0000	-1.358265
...	...	...
CdI2	2.4223	-0.592137
HfBr4	4.2902	-1.394327
SmBr	0.0000	-1.149715
PdI2	0.9081	-0.251194
ThI3	0.0000	-1.212139

304 rows × 2 columns

Conclusion

Congratulations! You have completed the lesson on exploring materials project data and predicting material properties. We covered data access, cleaning, manipulation, visualization, and machine learning using Python and various libraries.

Keep exploring and experimenting with different data analysis techniques and machine learning models to gain deeper insights into materials science and make accurate predictions for material properties.