In February, Reddit announced a latest content partnership with Google where they would supply data that will power the brand new Generative AI based search engine using Retrieval Augmented Generation (RAG). That attempt didn’t go as planned, and shortly, people were seeing recommendations like adding glue to pizza:
Within the age of artificial intelligence, massive amounts of information fuel the expansion and class of machine learning models. But not all data is created equal; AI systems require high-quality data to provide high-quality outputs.
So, what makes data “high-quality,” and why is it crucial to prioritize data quality from the outset? Achieving data quality shouldn’t be only a matter of accuracy or quantity; it requires a holistic, responsible approach woven throughout your entire AI development lifecycle. As data quality has garnered renewed attention, we explore what constitutes “prime quality” data, why prioritizing data quality from the outset is crucial, and the way organizations can utilize AI for helpful initiatives while mitigating risks to privacy, fairness, safety, and sustainability.
In this text, we first provide a high-level overview of the relevant concepts, followed by a more detailed discussion.
What’s Good, High-Quality Data?
Good data is not just accurate or plentiful; it’s data fit for its intended purpose. Data quality should be evaluated based on the particular use cases it supports. As an example, the pretraining data for a heart disease prediction model must include detailed patient histories, current health status, and precise medication dosages, but usually, shouldn’t require patients’ phone numbers or addresses for privacy. The hot button is to match the information to the needs of the duty at hand. From a policy standpoint, consistently advocating for a safety-by-design approach towards responsible machine learning is crucial. This includes taking thoughtful steps at the information stage itself. Desirable facets of information quality include (but are usually not limited to!):
- Relevance: The information should be directly applicable and meaningful to the particular problem the AI model is trying to unravel. Irrelevant data can introduce noise, i.e., random errors or irrelevant information in the information that may obscure the underlying patterns and result in poor performance or unintended consequences. “Relevance” is widely recognized as critical across work on data quality, because it provides for control over what a system may or may not do and helps optimize statistical estimates.
- Comprehensiveness: The information should capture the complete breadth and variety of the real-world scenarios the AI will encounter. Incomplete or narrow datasets can result in biases and neglected issues. This can also be generally known as “Completeness” in data quality work.
- Timeliness: Particularly for rapidly evolving domains, the information should be up-to-date and reflect the present state of affairs. Outdated information can render an AI system ineffective and even dangerous. This can also be generally known as “Currentness” and “Freshness” in work on data quality.
- Mitigation of Biases: Collecting data brings with it biases in every little thing from the information sources to the gathering protocols. Data selection work must due to this fact make every effort to avoid encoding unintended harmful biases, which may end up in systems that exacerbate patterns of societal oppression, stereotypes, discrimination, and underrepresentation of marginalized groups.
While we’ve focused on a subset of information quality measures, many more measures have been defined which are useful for machine learning datasets, resembling traceability and consistency.
Why Data Quality?
Investing in data quality is prime for improving AI model performance. In an era where AI and machine learning are increasingly integrated into decision-making processes, ensuring data quality shouldn’t be just helpful but essential. Properly curated data allows AI systems to operate more effectively, accurately, and fairly. It supports the event of models that may handle diverse scenarios, promotes sustainable practices by optimizing resource usage, and upholds ethical standards by mitigating biases and enhancing transparency. Some key motivators of information quality:
- Enhanced Model Outcomes: High-quality data improves model performance by eliminating noise, correcting inaccuracies, and standardizing formats.
- Robustness and Generalization: Diverse, multi-source data prevents overfitting and ensures that models are robust across various real-world scenarios. Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor generalization.
- Efficiency: High-quality data results in more efficient, compact models that require fewer computational resources.
- Representation and Inclusivity: High-quality data needs to be representative and inclusive, which helps address biases, promote equity, and make sure the representation of diverse societal groups.
- Governance and Accountability: Practices resembling transparency about data sources, preprocessing, and provenance ensure effective AI governance and accountability.
- Scientific Reproducibility: High-quality data is crucial for open science because it ensures the validity of the findings and facilitates reproducibility and further research.
What’s the Process toward Data Quality?
The method toward high-quality datasets involves several key strategies. Meticulous data curation and preprocessing, resembling deduplication, content filtering, and human feedback, e.g., through domain expertise and stakeholder feedback, are essential to keep up dataset relevance and accuracy to the duty at hand. Participatory data collection and open community contributions enhance representation and inclusivity. Establishing a sturdy data governance framework with clear policies, standards, and accountability ensures consistent data management. Regular quality assessments using metrics like accuracy and completeness help discover and rectify issues. Thorough documentation, including dataset cards, improves usability, collaboration, and transparency. Lastly, while synthetic data will be helpful, it needs to be used alongside real-world data and validated rigorously to stop biases and ensure model performance. Some approaches to data quality include:
We dive deeper into these different facets below.
Data Quality for Improving Model Performance
Investing in data quality is crucial for enhancing the performance of AI systems. Quite a few studies have demonstrated that higher data quality directly correlates with improved model outcomes, as most recently seen within the Yi 1.5 model release. Achieving high data quality involves meticulous data cleansing and preprocessing to remove noise, correct inaccuracies, fill in missing values, and standardize formats. Incorporating diverse, multi-source data prevents overfitting and exposes models to a wide selection of real-world scenarios.
The advantages of high-quality data extend beyond improved metrics. Cleaner, smaller datasets allow models to be more compact and parameter-efficient, requiring fewer computational resources and energy for training and inference.
Data Quality for Improving Representation
One other crucial aspect of information quality is representation. Models are sometimes trained on training data that over-represents dominant groups and perspectives, leading to skewed object representations, imbalanced occupational and site biases, or the consistent depiction of harmful stereotypes. This implies including data from all groups in society and capturing a wide selection of languages, especially in text data. Diverse representation helps mitigate cultural biases and improves model performance across different populations. An example of such a dataset is CIVICS.
Participatory approaches are key to achieving this. By involving a bigger variety of stakeholders in the information creation process, we will be sure that the information used to coach models is more inclusive. Initiatives like “Data is Higher Together” encourage community contributions to datasets, enriching the variety and quality of the information. Similarly, the Masakhane project focuses on creating datasets for African languages, resembling evaluation datasets, which have been underrepresented in AI research. These efforts be sure that AI systems are more equitable and effective across different contexts and populations, ultimately fostering more inclusive technological development.
Data Quality for Governance and Accountability
Maintaining high data quality practices is crucial for enabling effective governance and accountability of AI systems. Transparency about data sources, licenses, and any preprocessing applied is crucial. Developers should provide clear documentation around data provenance, including where the information originated, the way it was collected, and any transformations it underwent.
This transparency empowers external audits and oversight, allowing for thorough examination and validation of the information utilized in AI models. Clear documentation and data traceability also help discover potential issues and implement mitigation strategies. This level of transparency is critical for constructing trust and facilitating responsible AI development, ensuring that AI systems operate ethically and responsibly.
Data Quality for Adaptability and Generalizability
One other critical aspect is ensuring that data reflects the variety required for AI models to adapt and generalize across contexts. This involves capturing a wide selection of languages, cultures, environments, and edge cases representative of the actual world. Participatory data collection approaches involving impacted communities can enrich datasets and improve representation, ensuring robust and adaptable models.
Constantly evaluating model performance across different demographics is vital to identifying generalizability gaps. Achieving adaptable AI hinges on continuous data collection and curation processes that ingest real-world feedback loops. As latest products are released or business landscapes shift, the training data should evolve in lockstep to reflect these changes. Developers should implement processes to discover data drifts and model performance drops in comparison with the present state, ensuring the AI models remain relevant and effective in changing environments.
Data Quality for Scientific Reproducibility and Replicability
Within the research realm, data quality has profound implications for the reproducibility and validity of findings. Poor quality training data can undermine the integrity of experiments and result in non-reproducible results. Stringent data quality practices, resembling meticulous documentation of preprocessing steps and sharing of datasets, enable other researchers to scrutinize findings and construct upon previous work.
Replicability, defined because the means of arriving at the identical scientific findings using latest data, is a little more nuanced. Sometimes, the non-replicability of a study can actually aid in scientific progress by expanding research from a narrow applied field into broader areas. Regardless, replicability can also be difficult without proper documentation of information collection procedures and training methodology, and the present reproducibility and replicability crisis in AI will be significantly ameliorated by high-quality, well-documented data.
High-Quality Data needs High-Quality Documentation
Considered one of the crucial facets for high-quality data, just as for code, is the thorough documentation of the information. Proper documentation enables users to grasp the content and context of the information, facilitating higher decision-making and enhancing the transparency and reliability of AI models. Considered one of the modern approaches to data documentation is using dataset cards, as offered by the Hugging Face hub. There are numerous methods to document data including data statements, datasheets, data nutrition labels, dataset cards, and dedicated research papers. Often these documentation methods cover data sources and composition of the dataset, processing steps, descriptive statistics including demographics represented within the dataset, and the unique purpose of the dataset (see for more details on the importance of information transparency). Data documentation, resembling dataset cards, may help with:
- Enhanced Usability: By providing a transparent and comprehensive overview of the dataset, dataset cards make it easier for users to grasp and utilize the information effectively.
- Improved Collaboration: Detailed documentation fosters higher communication and collaboration, as everyone has a shared understanding of the information.
- Informed Decision-Making: With access to detailed information concerning the data, users could make more informed decisions regarding its application and suitability for various tasks.
- Transparency and Accountability: Thorough documentation promotes transparency and accountability in data management, constructing trust amongst users and stakeholders.
A Note on Synthetic Data
Synthetic data has emerged as a cost-efficient alternative to real-world data, providing a scalable solution for training and testing AI models without the expenses and privacy concerns related to collecting and managing large volumes of real data, as done for instance in Cosmopedia. This approach enables organizations to generate diverse datasets tailored to specific needs, accelerating development cycles and reducing costs. Nevertheless, it’s crucial to pay attention to the potential downsides. Synthetic data can inadvertently introduce biases if the algorithms generating the information are themselves biased, resulting in skewed model final results. It can be crucial to mark model output as generated content, e.g., by watermarking across modalities (overview). Moreover, over-reliance on synthetic data can lead to model collapse, where the model becomes overly tuned to the synthetic data patterns. Due to this fact, while synthetic data is a robust tool, it needs to be used judiciously, complemented by real-world data and robust validation processes to make sure model performance and fairness.
Data Quality Practices at Hugging Face
Ensuring high data quality is crucial for developing effective and reliable AI models. Listed below are some examples of information quality strategies that teams at Hugging Face have employed:
An important aspect of information quality is filtering and deduplication. As an example, in creating large, high-quality datasets like FineWeb-Edu. Hugging Face employs tools resembling DataTrove. Filtering involves choosing only relevant and high-quality data, ensuring that the dataset is comprehensive without unnecessary noise. Deduplication removes redundant entries, which improves the efficiency and performance of AI models. This meticulous approach ensures that the dataset stays robust and relevant.
Responsible multi-modal data creation is one other key area where Hugging Face has set an example. The OBELICS dataset showcases several best practices on this regard. One significant practice is opt-out filtering, where images which were opted out of redistribution or model training are removed using APIs like Spawning. This respects the rights and preferences of content creators. Moreover, deduplication ensures that images appear not more than ten times across the dataset, reducing redundancy and ensuring diverse representation. Content filtering can also be essential; employing open-source classifiers to detect and exclude NSFW content, and filtering images based on their URLs, maintains the dataset’s appropriateness and relevance.
Handling diverse data types is one more strategy employed by Hugging Face. In creating The Stack V2, which covers a broad range of programming languages and frameworks, careful collection of repositories and projects was done to make sure diversity and comprehensiveness. Quality checks, each automated and manual, confirm the syntactic correctness and functional relevance of the code within the dataset, maintaining its prime quality – for instance, the efforts in deduplication within the BigCode project.
Gathering human feedback using data labeling tools (like Argilla) can have a big impact on data quality, especially by including stakeholders in the information creation process. Examples of this include the improvement of the UltraFeedback dataset through human curation, resulting in Notus, an improved version of the Zephyr model, or the community efforts of the Data is Higher Together initiative.
Beyond these specific practices, there are general strategies that may ensure data quality. Establishing a sturdy data governance framework is foundational. This framework should include policies, standards, and processes for data management, with clearly defined roles and responsibilities to make sure accountability and maintain high standards. Regular quality assessments are also vital. These assessments, which may utilize metrics like accuracy, completeness, consistency, and validity, help discover and address issues early. Tools resembling data profiling and statistical evaluation will be instrumental on this process.
Are you working on data quality? Share your tools and methods on the Hugging Face Hub!
Crucial a part of Hugging Face is our community. In the event you’re a researcher focused on improving data quality in machine learning, especially throughout the context of open science, we wish to support and showcase your work!
Thanks for reading! 🤗
~ Avijit and Lucie, on behalf of the Ethics & Society regulars
If you need to cite this blog post, please use the next (authors in alphabetical order):
@misc{hf_ethics_soc_blog_6,
creator = {Avijit Ghosh and Lucie-Aimée Kaffee},
title = {Hugging Face Ethics and Society Newsletter 6: Constructing Higher AI: The Importance of Data Quality},
booktitle = {Hugging Face Blog},
12 months = {2024},
url = {https://huggingface.co/blog/ethics-soc-6},
doi = {10.57967/hf/2610}
}
