Today, we’re thrilled to announce the launch of LeMaterial, an open-source collaborative project led by Entalpic and Hugging Face. LeMaterial goals to simplify and speed up materials research, making it easier to coach ML models, discover novel materials and explore chemical spaces. ⚛️🤗
As a primary step, we’re releasing a dataset called LeMat-Bulk, which unifies, cleans and standardizes probably the most distinguished material datasets, including Materials Project, Alexandria and OQMD — giving rise to a single harmonized data format with 6.7M entries and 7 materials properties.
LeMaterial is standing on the shoulders of giants and we’re constructing upon incredible projects which have been instrumental in the event of this initiative: Optimade, Materials Project, Alexandria and OQMD, and more to come back. Please credit them accordingly when using LeMaterial.
Why LeMaterial?
The world of materials science, on the intersection of quantum chemistry and machine learning, is brimming with opportunity — from brighter LEDs, to electro-chemical batteries, more efficient photovoltaic cells and recyclable plastics, the applications are limitless. By leveraging machine learning (ML) on large, structured datasets, researchers can perform high-throughput screening and testing of recent materials at unprecedented scales, significantly accelerating the invention cycle of novel compounds with desired properties. On this paradigm, data becomes the essential fuel powering ML models that may guide experiments, reduce costs, and unlock breakthroughs faster than ever before.
This field is fueled by very complete datasets corresponding to Materials Project, Alexandria and OQMD, all open-source and under a CC-BY-4.0 license. Nonetheless, those datasets vary in format, parameters, and scope, presenting the next challenges:
- Dataset integration issues (eg. inconsistent formats or field definitions, incompatible calculations)
- Biases in dataset composition (for eg. Materials Project’s deal with oxides and battery materials)
- Limited scope (eg. NOMADs deal with quantum chemistry calculations fairly than material properties)
- Lack of clear connections or identifiers between similar materials across different databases
This fragmented landscape makes it difficult for researchers in AI4Science and materials informatics to leverage existing data effectively. Whether the appliance involves training foundational ML models, constructing accurate phase diagrams, identifying novel materials or exploring the chemical space effectively, there isn’t a easy solution. While efforts like Optimade standardize structural data, they do not address discrepancies in material properties or biases in dataset scopes.
LeMaterial addresses these challenges by unifying and standardizing data from three major databases — Materials Project, Alexandria and OQMD — right into a high-quality resource with consistent and systematic properties. The basic composition treemap below highlights the worth of this integration, showcasing how we increase the scope of existing datasets, like Materials Project, that are focused on specific material types, corresponding to battery materials (Li, O, P) or oxides.
Materials Project and LeMat-BulkUnique treemap
Achieving a clean, unified and standardized dataset
LeMat-Bulk is greater than a large-scale merged dataset with a permissive license (CC-BY-4.0). With its 6.7M entries with consistent properties, it represents a foundational step towards making a curated and standardized open ecosystem for material science, designed to simplify research workflows and improve data quality. Below is a more in-depth view of what’s looks like. To interactively flick thru our materials, take a look at the Materials Explorer space, built using MP Dash components.
| Release | Description & Value | Date |
|---|---|---|
| v.1.0 |
|
Dec. 10, 2024 |
| v.1.1 |
|
Q1 2025 |
| Future releases |
|
Q2 2025 |
We provide different datasets and subsets, enabling tailored workflows for researchers depending on their needs (consistent calculations, deduplicating materials, or comprehensive exploration):
- Compatibility: these subsets only provides calculations that are compatible to combine. This is on the market in 3 functionals today (PBE, PBESol and SCAN)
- Non-compatible: this subset provides all materials not included within the compatibility subsets.
- LeMat-BulkUnique : this dataset split provides de-duplicated material using our structure fingerprint algorithm. It is on the market in 3 subsets, for PBE, PBESol, and SCAN functionals. More Details on the dataset might be found on 🤗Hugging Face
Integrating a well-benchmarked materials fingerprint
Beside constructing this standardized dataset, certainly one of the important thing contribution of LeMaterial is to propose a definition of a cloth fingerprint through a hashing function that assigns a novel identifier to every material.
Current approaches to identifying a cloth as novel relative to a database have predominantly relied on similarity metrics, which necessitate a combinatorial effort to screen the prevailing database for novelty. To offer faster novelty detection in a dataset, Entalpic introduces a hashing method to compute the fingerprint of a cloth.
Above is a breakdown of the fingerprinting. We use a bonding algorithm (e.g. EconNN) on the crystal structure to extract a graph, on which we then compute the Weisfeiler-Lehman algorithm to get a hash. This hash is combined with composition and space group information to create the fabric fingerprint.
Our fingerprinting approach offers several advantages:
- Quickly identifying whether a cloth is novel or already catalogued.
- Ensuring the dataset is free from duplicates and inconsistencies.
- Allowing to attach materials between datasets.
- Supporting more efficient calculations for thermodynamic properties, corresponding to energy above the hull.
Below lies a comparison of our hash function with the StructureMatcher of Pymatgen, to seek out all duplicates of a dataset. The experiment was run on two datasets having very different structures.
When using our method, almost the entire task time was dedicated to calculating material hashes; the follow-up comparison step is negligible time-wise. When using StructureMatcher, the overwhelming majority of the duty time was spent comparing pairs of structures; constructing said structures is negligible time-wise.
| Dataset | Variety of structures | Task time for the hash function (parallelized on 12 CPUs) | Task time for StructureMatcher (parallelized on 64 CPUs) |
|---|---|---|---|
| Carbon-24 | 10,153 | 100 seconds | 17 hours |
| MPTS-52 | 40,476 | 330 seconds | 4.9 hours |
Moreover, we’re planning on releasing a set of well-curated benchmarks to guage the validity of our hashing function. As an illustration, we investigated:
- If distinct materials result in different hashes based on material identification tags across existing databases
- Whether adding small noises or applying symmetry operations to a cloth results in the identical hash
- If materials sharing the identical hash, across or inside databases, could indeed be the identical material — with manual and DFT checks
- How briskly and accurate our hash is in comparison with Pymatgen’s StructureMatcher on existing databases
🤗 Call to the community: our aim shouldn’t be to position this fingerprint method as the only solution to de-duplicate materials databases and find novel materials, but fairly to foster discussion around this query. One current limitation of this hashing technique is that it doesn’t cover disordered structures; we would really like to push the community towards finding a consensus, while proposing a comparatively easy and efficient fingerprint method within the meantime.
LeMaterial in Motion: applications and impact
In the long run, LeMaterial goals to be a community-driven initiative that gathers large & curated datasets, machine learning models, handy toolkits, etc. It’s designed to be practical and versatile, enabling a wide selection of applications, corresponding to:
- Exploring prolonged phase diagrams (Link to our phase diagram explorer, built thanks to varied open-source tools from Materials Project), constructed with a broader dataset, to investigate chemical spaces in greater detail. Combining larger datasets signifies that we will provide a finer resolution of fabric stability in a given compositional space:
Experimental phase diagram of Ti, Bb, Sn from
this research paper
LeMat-Bulk phase diagram for Sn, Ti, Nb, built because of Pymatgen, Crystal Toolkit (Materials Project tools)
-
Compare materials properties across databases and functionals: by providing researchers with data across DFT functionals, and by linking materials via our materials fingerprint algorithm we’re able to ascertain and connect materials properties calculated via different parameters. This offers researchers insight into how functionals might behave and differ across compositional space.
-
Determining if a cloth is novel. Our hashing function allows researchers to quickly assess whether a cloth is exclusive or a replica, streamlining the invention process and avoiding redundant calculations.
-
Example 1: Our fingerprint method identified the next Alexandria entries (
agm002153972,agm002153975) as potentially being the identical material — having the identical hash. After we did a rest on the upper energy entry, the fabric relaxed to the lower energy configuration.

Lower energy structure
Higher energy structure
-
Example 2: applying our hash to a different dataset (AIRSS) that is commonly utilized in training generative models, we found the next materials with the identical hash.

Unit cells of materials sharing the identical fingerprint
To an untrained eye these visually appear as if very different materials. Nonetheless after we replicate the lattice we quickly discover that they’re quite similar:

Supercells of materials sharing the identical fingerprint
It is necessary to notice here that Pymatgen’s StructureMatcher identified these two unit cells as different materials, after they are indeed the very same structures. Here, our hashing algorithm was in a position to discover them as indeed the identical.
-
-
Training predictive ML models. We may also train machine learning interatomic potentials like EquiformerV2 on
LeMat-Bulk. These models should advantages from its scale and data quality and the removal of bias across compositional space, and it could be interesting to evaluate the advantages of this latest dataset. An example of the right way to incorporate LeMaterial with Fairchem might be present in Colab. We’re currently in means of training an EquiformerV2 model using this dataset — stay tuned 💫
Take-aways
As a community, we regularly place a whole lot of value within the quality of those large-scale open-source databases. Nonetheless the dearth of standardization makes utilizing multiple dataset an enormous challenge. LeMaterial offers an answer that unifies, standardizes, performs extra cleansing and validation efforts on existing major data sources. This latest open-science project is designed to speed up research, improve quality of ML models, and make materials discovery more efficient and accessible.
We are only getting began — we all know there are still flaws and enhancements to be made — and would thus love to listen to your feedback! So please reach out for those who have an interest to contribute to this open-source initiative. We can be excited to proceed expanding LeMaterial with latest datasets, tools, and applications — alongside the community! ⚛️🤗
We extend our heartfelt because of Zachary Ulissi and Luis Barroso-Luque (Meta), and Matt McDermott (Newfound Materials, Inc.) for his or her useful feedback regarding is initiative.
Citations
By downloading content from LeMaterial, you agree to simply accept the Creative Commons Attribution 4.0 license implying that content could also be copied, distributed, transmitted, and adapted, without obtaining specific permission from LeMaterial, provided proper attribution is given to LeMaterial.
For those who use the LeMaterial as a resource in your research, please cite the citation section from our data-card (paper to come back).
CC-BY-4.0 (license used for Materials Project, Alexandria, OQMD) requires proper acknowledgement.
Thus, for those who use materials data which include (”mp-”) within the immutable_id, please cite the Materials Project. For those who use materials data which include (”agm-”) within the immutable_id, please cite Alexandria, PBE or Alexandria PBESol, SCAN. For those who use materials data which include (”oqmd-”) within the immutable_id, please cite OQMD. Finally, for those who make use of the Phase Diagram for visualization purposes, or the crystal viewer within the Materials Explorer, please acknowledge Crystal Toolkit.
To learn more about LeMaterial and become involved:

