Resources, tools, papers, and projects for ensuring data reliability and effectiveness across traditional data, LLM pretraining/fine-tuning data, multimodal data, and more.
- Introduction
- Traditional Data
- Large Language Model Data
- Multimodal Data
- Tabular Data
- Time Series Data
- Graph Data
- Data-Centric AI
Data quality is a critical aspect of any data-driven application or research. This repository collects resources related to data quality across different data types, including traditional data, large language model data (both pretraining and fine-tuning), multimodal data, and more.
This section covers data quality for traditional structured and unstructured data.
- Data Cleaning: Problems and Current Approaches - A comprehensive overview of data cleaning approaches. (2000)
- A Survey on Data Quality: Classifying Poor Data - A survey on data quality issues and classification. (2016)
- Great Expectations - A Python framework for validating, documenting, and profiling data. (2018)
- Deequ - A library built on top of Apache Spark for defining "unit tests for data". (2018)
- OpenRefine - A powerful tool for working with messy data, cleaning it, and transforming it. (2010)
- Pandas Profiling - Generates profile reports from pandas DataFrames. (2016)
- DataProfiler - A Python library for automated data profiling. (2021)
- PyDeequ - Python API for Deequ, enabling "unit tests for data". (2020)
- Evidently - An open-source ML monitoring framework for data drift detection. (2021)
- TensorFlow Data Validation (TFDV) - A library for exploring and validating ML data at scale. (2018)
- Deepchecks - A Python package for validating ML models and data. (2021)
- Provero - A vendor-neutral, declarative data quality engine. Define checks in YAML and run anywhere. (2026)
- DataScreenIQ - A hosted real-time data quality screening API that returns PASS / WARN / BLOCK verdicts at the ingest boundary before data enters pipelines or warehouses. Detects schema drift, null spikes, and type mismatches in milliseconds. (2026)
This subsection covers methods and tools for assessing data readiness for AI applications.
- Data Readiness for AI: A 360-Degree Survey - A comprehensive survey examining metrics for evaluating data readiness for AI training across structured and unstructured datasets. (2024)
- Assessing Student Adoption of Generative Artificial Intelligence across Engineering Education - An empirical study on data quality considerations in educational AI applications. (2025)
- Data Readiness Assessment Framework - A framework for evaluating data quality and readiness for AI applications. (2024)
- AI Data Quality Metrics - Standardized metrics for assessing data quality in AI contexts. (2024)
This section covers data quality for large language model pretraining data.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling - A large-scale curated dataset for language model pretraining. (2021)
- Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets - An audit of the quality of web-crawled multilingual datasets. (2021)
- Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus - Documentation of the C4 dataset. (2021)
- Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models - REWire method for recycling and improving low-quality web documents through guided rewriting, addressing the "data wall" problem in LLM pretraining. (2025)
- Assessing the Role of Data Quality in Training Bilingual Language Models - A study revealing that unequal data quality is a major driver of performance degradation in bilingual settings, with a practical data filtering strategy for multilingual models. (2025)
- Dolma - A framework for curating and documenting large language model pretraining data. (2023)
- Text Data Cleaner - A tool for cleaning text data for language model pretraining. (2022)
- CCNet - Tools for downloading and filtering CommonCrawl data. (2020)
- Dingo - A comprehensive AI data quality evaluation tool supporting multiple data sources, types, and modalities. (2024)
This section covers data quality for large language model fine-tuning data.
- Training language models to follow instructions with human feedback - The RLHF paper from Anthropic. (2022)
- Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP - A study on the importance of data quality over quantity. (2021)
- Data Quality for Machine Learning Tasks - A survey on data quality for machine learning. (2021)
- LMSYS Chatbot Arena - A platform for evaluating LLM responses. (2023)
- OpenAssistant - A project to create high-quality instruction-following data. (2022)
- Argilla - An open-source data curation platform for LLMs. (2021)
This section covers comprehensive data management approaches for LLMs, including data processing, storage, and serving.
- A Survey of LLM × DATA - A comprehensive survey on data-centric methods for large language models covering data processing, storage, and serving. (2025)
- Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval - A method for identifying and relabeling false negatives in training data to improve model performance. (2025)
- awesome-data-llm - Official repository of "LLM × DATA" survey paper with curated resources. (2025)
- CommonCrawl - A massive web crawl dataset covering diverse languages and domains. (2008)
- RedPajama - An open-source reproduction of the LLaMA training dataset. (2023)
- FineWeb - A large-scale, high-quality web dataset for language model training. (2024)
- Dokime - An open-source CLI toolkit for scoring, filtering, deduplicating, and diagnosing ML training data quality. (2026)
This section focuses on cognition engineering and test-time scaling methods that improve data quality through enhanced reasoning and thinking processes.
- Generative AI Act II: Test Time Scaling Drives Cognition Engineering - A comprehensive survey on cognition engineering through test-time scaling and reinforcement learning. (2025)
- Unlocking Deep Thinking in Language Models: Cognition Engineering through Inference Time Scaling and Reinforcement Learning - A framework for developing AI thinking capabilities through test-time scaling paradigms. (2025)
- O1 Journey--Part 1 - A dataset for math reasoning with long chain-of-thought. (2024)
- Marco-o1 - Reasoning dataset synthesized from Qwen2-7B-Instruct. (2024)
- STILL-2 - Long-form thought data for math, code, science, and puzzle domains. (2024)
- OpenThoughts-114k - Large-scale dataset of reasoning trajectories distilled from DeepSeek R1. (2024)
- High-impact Sample Selection - Methods for prioritizing training samples based on learning impact measurement. (2025)
- Noise Reduction Filtering - Techniques for removing noisy web-extracted data to improve generalization. (2025)
- Length-Adaptive Training - Approaches for handling variable-length sequences in training data. (2024)
This section covers data quality for multimodal data, including image-text pairs, video, and audio.
- LAION-5B: An open large-scale dataset for training next generation image-text models - A large-scale dataset of image-text pairs. (2022)
- DataComp: In search of the next generation of multimodal datasets - A benchmark for evaluating data curation strategies. (2023)
- CLIP-Benchmark - A benchmark for evaluating CLIP models. (2021)
- img2dataset - A tool for efficiently downloading and processing image-text datasets. (2021)
This section covers data quality for tabular data.
- Automating Data Quality Validation for Dynamic Data Ingestion - A framework for automating data quality validation. (2019)
- A Survey on Data Quality for Machine Learning in Practice - A survey on data quality issues in machine learning. (2021)
- Pandas Profiling - A tool for generating profile reports from pandas DataFrames. (2016)
- DataProfiler - A Python library for data profiling and data quality validation. (2021)
This section covers data quality for time series data.
- Cleaning Time Series Data: Current Status, Challenges, and Opportunities - A survey on cleaning time series data. (2022)
- Time Series Data Augmentation for Deep Learning: A Survey - A survey on time series data augmentation. (2020)
- Darts - A Python library for time series forecasting and anomaly detection. (2020)
- tslearn - A machine learning toolkit dedicated to time series data. (2017)
This section covers data quality for graph data.
- A Survey on Graph Cleaning Methods for Noise and Errors in Graph Data - A survey on graph cleaning methods. (2022)
- Graph Data Quality: A Survey from the Database Perspective - A survey on graph data quality from a database perspective. (2022)
- DGL - A Python package for deep learning on graphs. (2018)
- NetworkX - A Python package for the creation, manipulation, and study of complex networks. (2008)
This section focuses on data quality management for machine learning models, following the Data-Centric AI paradigm. It includes papers and resources related to data valuation, data selection, and benchmarks for evaluating data quality in ML pipelines.
- A Survey on Data Quality Dimensions and Tools for Machine Learning - A comprehensive survey reviewing 17 data quality tools for ML applications. GitHub resource. (2024)
- Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems - A comprehensive survey on data quality awareness across traditional data management and modern data science systems. (2024)
- A Survey on Data Selection for Language Models - A survey focusing on data selection techniques for language models. (2024)
- Advances, challenges and opportunities in creating data for trustworthy AI - A Nature Machine Intelligence paper discussing the challenges and opportunities in creating high-quality data for AI. (2022)
- Data-centric Artificial Intelligence: A Survey - A comprehensive survey on data-centric AI approaches. (2023)
- Data Management For Large Language Models: A Survey - A survey on data management techniques for large language models. (2023)
- Training Data Influence Analysis and Estimation: A Survey - A survey on methods for analyzing and estimating the influence of training data on model performance. (2022)
- Data Management for Machine Learning: A Survey - A TKDE survey on data management techniques for machine learning. (2022)
- Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges - An IJCAI paper on data valuation methods in machine learning. (2022)
- Explanation-Based Human Debugging of NLP Models: A Survey - A TACL survey on explanation-based debugging of NLP models. (2021)
- Data Shapley: Equitable Valuation of Data for Machine Learning - An ICML paper introducing the Data Shapley method for valuing training data. (2019)
- Efficient task-specific data valuation for nearest neighbor algorithms - A VLDB paper on efficient data valuation for nearest neighbor algorithms. (2019)
- Towards Efficient Data Valuation Based on the Shapley Value - An AISTATS paper on efficient data valuation using Shapley values. (2019)
- Understanding Black-box Predictions via Influence Functions - An ICML paper introducing influence functions for understanding model predictions. (2017)
- Data Cleansing for Models Trained with SGD - A NeurIPS paper on data cleansing for SGD-trained models. (2019)
- Modyn: Data-Centric Machine Learning Pipeline Orchestration - A SIGMOD paper on pipeline orchestration for data-centric machine learning. (2023)
- Data Selection via Optimal Control for Language Models - An ICLR paper on optimal control methods for data selection in language models. (2024)
- ADAM Optimization with Adaptive Batch Selection - An ICLR paper on adaptive batch selection for ADAM optimization. (2024)
- Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws - An ICLR paper on dynamic sample selection using scaling laws. (2024)
- Selection via proxy: Efficient data selection for deep learning - An ICLR paper on efficient data selection using proxy models. (2020)
- DataPerf: Benchmarks for Data-Centric AI Development - A NeurIPS paper introducing benchmarks for data-centric AI development. (2023)
- OpenDataVal: a Unified Benchmark for Data Valuation - A NeurIPS paper on a unified benchmark for data valuation. (2023)
- Improving multimodal datasets with image captioning - A NeurIPS paper on improving multimodal datasets with image captioning. (2023)
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias - A NeurIPS paper on using LLMs as training data generators. (2023)
- dcbench: A Benchmark for Data-Centric AI Systems - A DEEM paper introducing a benchmark for data-centric AI systems. (2022)