Back to AI Glossary
AI Governance & Ethics

What is Datasheets for Datasets?

Datasheets for Datasets is a standardised documentation framework that records the provenance, composition, collection process, intended use, and known limitations of datasets used to train AI systems, enabling informed decisions about data quality and appropriateness.

What are Datasheets for Datasets?

Datasheets for Datasets is a documentation practice proposed by researchers at Microsoft in 2018 that creates standardised accompanying documents for datasets used in AI and machine learning. Just as electronic components come with datasheets describing their specifications and operating conditions, datasets should come with documentation explaining what data they contain, how they were collected, what they are intended for, and what their limitations are.

The concept addresses a fundamental problem in AI development: the quality and appropriateness of training data are among the most important factors determining whether an AI system works well and works fairly. Yet datasets are routinely shared and used with minimal documentation, leaving AI developers to guess at critical characteristics that affect model performance and fairness.

Why Datasheets for Datasets Matter

Data Quality and Provenance

AI models are only as good as the data they are trained on. Without clear documentation about how a dataset was collected, who is represented in it, and what biases it might contain, developers cannot make informed decisions about whether the data is appropriate for their use case. A dataset collected from one population may not represent another. Data collected during one time period may not reflect current conditions.

Datasheets provide the provenance information that enables these assessments. They answer critical questions: Where did this data come from? Who collected it and why? What populations does it represent? What are its known limitations?

Preventing Misuse

Datasets are frequently repurposed for applications their creators never intended. A dataset collected for academic research might be used to train a commercial product. A dataset representing one geographic region might be applied to another. Without documentation of intended use and known limitations, these misapplications happen routinely and can lead to AI systems that perform poorly or unfairly.

Regulatory Compliance

Data governance regulations across Southeast Asia require organisations to understand and document their data practices. Singapore's Personal Data Protection Act, Thailand's PDPA, and Indonesia's PDP Law all impose requirements around data collection, consent, and use. Datasheets help organisations demonstrate compliance by documenting the provenance and handling of their training data.

What a Datasheet Contains

Motivation

Why was the dataset created? Who created it? What task was it designed to support? Was it created for a specific research question, a commercial product, or general use? Understanding motivation helps assess whether the dataset is appropriate for your specific application.

Composition

What does the dataset contain? How many instances are there? What features are included? What data types are represented? Are there any known errors, noise, or redundancies? For datasets containing information about people, what demographic groups are represented and in what proportions?

Collection Process

How was the data collected? Was it gathered through surveys, web scraping, sensor readings, user interactions, or some other mechanism? Who collected it? Over what time period? Were the subjects aware their data was being collected? Did they provide consent?

Preprocessing and Cleaning

What preprocessing, cleaning, or transformation was applied to the raw data? Were any instances removed? Were any features derived or imputed? Understanding these decisions is critical because preprocessing choices can introduce biases or remove important variation.

Uses

What has the dataset been used for previously? What are the recommended use cases? Are there tasks or applications the dataset should not be used for? Understanding the intended and actual uses helps new users assess whether their planned use is appropriate.

Distribution

How is the dataset distributed? Is it publicly available or restricted? Are there licensing terms or usage restrictions? Is there an ongoing maintenance plan, or is the dataset a static snapshot?

Maintenance

Who is responsible for maintaining the dataset? Will it be updated over time? Is there a process for reporting errors or issues? How can users contact the maintainers?

Creating Datasheets in Practice

Integrate into Your Data Pipeline

The most effective approach is to create datasheets as part of your standard data collection and preparation process. When data is collected, document its provenance. When preprocessing is applied, record the decisions and rationale. This is far easier than trying to reconstruct this information after the fact.

Use Templates

Develop a standard template that your organisation uses consistently across all datasets. This ensures completeness and makes it easier for teams to compare and evaluate different datasets. Several open-source templates are available, including the original template from the Microsoft researchers who proposed the framework.

Involve Data Collectors

The people closest to the data collection process are best positioned to document provenance, methodology, and known issues. Ensure they are involved in creating the datasheet, not just the data scientists who use the data later.

Document Known Biases

Be explicit about known biases and limitations. If the dataset overrepresents certain demographics, geographic regions, or time periods, state this clearly. In Southeast Asia, this is particularly important given the region's demographic diversity. A dataset collected primarily from urban Singapore may not represent rural Indonesian populations.

Review and Update

Datasheets should be reviewed and updated when datasets are modified, when new limitations are discovered, or when the dataset is used in a new context that reveals previously unknown characteristics.

Datasheets for Datasets in Southeast Asia

The practice is particularly valuable in Southeast Asia for several reasons. The region's linguistic diversity means that text datasets may not represent all languages equally. Economic disparities between urban and rural areas can create representation biases in data collected through digital channels. Cultural differences across ASEAN countries mean that datasets reflecting one country's norms may not transfer well to another.

Singapore's emphasis on responsible AI governance and data documentation aligns well with the datasheets framework. The ASEAN Guide on AI Governance and Ethics encourages transparency in data practices, which datasheets directly support. As data privacy regulations across the region mature, documentation of data provenance and consent will become increasingly important for compliance.

For organisations building AI systems that serve diverse Southeast Asian populations, datasheets provide a practical tool for ensuring that training data is appropriate, representative, and used responsibly.

Why It Matters for Business

Datasheets for Datasets address a root cause of AI failures that many organisations overlook: inadequate understanding of training data. The majority of AI quality problems, including bias, poor performance on specific populations, and unexpected failures, originate in the training data. Without proper data documentation, these problems are invisible until they affect customers or trigger regulatory scrutiny.

For CEOs, data documentation reduces business risk. It ensures your teams can demonstrate that AI systems were built on appropriate, well-understood data, which is increasingly a regulatory expectation across Southeast Asia. For CTOs, datasheets improve engineering quality by making data characteristics explicit, enabling better decisions about model design and deployment.

The investment is proportional to the risk: simple datasets need simple datasheets, while complex or sensitive datasets warrant more thorough documentation. In a region as diverse as Southeast Asia, where AI systems often need to serve populations across multiple countries, languages, and cultures, understanding your training data is not optional; it is essential for building AI that works for everyone.

Key Considerations
  • Create datasheets as part of your standard data collection process rather than retroactively, when critical context is freshest.
  • Document known biases and representation gaps explicitly, particularly regarding the demographic diversity of Southeast Asian populations.
  • Include information about consent and collection methodology to support compliance with data privacy regulations across ASEAN jurisdictions.
  • Use a consistent template across your organisation so that teams can easily compare and evaluate different datasets.
  • Record all preprocessing and cleaning decisions, as these choices can introduce biases that affect downstream model fairness.
  • Update datasheets when datasets are modified, when new limitations are discovered, or when datasets are applied to new use cases.
  • Ensure datasheets are accessible to both technical and non-technical stakeholders, as data quality decisions affect business outcomes.

Frequently Asked Questions

What is the difference between a datasheet and a data dictionary?

A data dictionary describes the technical structure of a dataset, defining field names, data types, allowed values, and relationships between tables. A datasheet is broader and more contextual. It covers provenance, collection methodology, intended use, known biases, ethical considerations, and limitations. A data dictionary tells you what each column contains. A datasheet tells you whether the dataset is appropriate for your intended use, who is represented in it, and what risks it might introduce into your AI system.

Do we need datasheets for third-party datasets?

Yes. When you use third-party data to train AI systems, you inherit whatever biases and limitations that data contains. If the third-party provider does not supply a datasheet, you should create your own based on whatever information is available and document the gaps in your knowledge. This is particularly important for regulatory compliance, as data privacy regulations in Southeast Asia hold the data user responsible regardless of where the data originated.

More Questions

The level of detail should be proportional to the risk associated with the dataset and the AI systems it supports. A dataset used to train a product recommendation engine might need a concise datasheet covering basic provenance and composition. A dataset used to train a credit scoring model that affects people's financial access should have comprehensive documentation covering every aspect of collection, composition, known biases, and limitations. Start with a standard template and adjust the depth based on the stakes involved.

Need help implementing Datasheets for Datasets?

Pertama Partners helps businesses across Southeast Asia adopt AI strategically. Let's discuss how datasheets for datasets fits into your AI roadmap.