Opening AI the Data-centric Way: The Case for Data Institutions in Democratising the Design & Development of ML Datasets
Masters of Public Policy Thesis, 2024 Hertie School of Governance ¡ Advisor: Prof. Joanna Bryson
Download Thesis (PDF) ¡ Download Poster (PDF)
This thesis examines the role of dedicated Data Institutions (DIs) in the design and development of machine learning datasets, emphasising their potential to democratise AI. The analysis critiques the current open AI approach and its limitations, proposing that as an institutionalised form of commons-based data governance, DIs offer a structured democratic approach to data-centric AI policy-making.
The research proposes that integrating DIs into the ML value chain would leverage the strategic significance of datasets to create public value and insert a decentralised, community-accountable governance layer into the otherwise opaque ML development process. It sets out an agenda for DIsâ collective governance mechanisms and fiduciary responsibilities to improve current practice in dataset development around data quality, usage, sustainability, and embedding of good governance practices.
Key contributions:
This research developed as a response to perceived limitations in debates around open source and its relationship to AI systems. The premise: there are more effective, more responsible, and less publicised policy approaches to achieve the fundamental objectives motivating the âopenâ vs âclosedâ AI debate than are currently considered.
I operationalise democratisation of AI as the establishment of legitimate mechanisms along the value chain for supporting:
This account of AIâs democratisation starts from the belief that these outcomes can be at least partially achieved at an upstream technology-governance level.
From a political-theory perspective, AI democratisation is not solely about the free deployment of AI without regard for social consequence. Rather, democratisation entails collective decision-making power over how AI is to be developed and deployed.
The ânarrowâ democratisation of open AI conflicts with broader democratic ideals: unstructured access to AI systems could hinder societies from restricting those uses they deem undesirable. Free access â democratic control.
From a political-economy perspective, the liberatory potential of open AI is bounded by near-term resource constraints (compute, data, and human capital). The ability of platform incumbents operating across parallel digital markets to cross-leverage scale and data risks the instrumentation and capture of open AI discourses by entrenched actors.
Strategic âopen washingâ along ML value chains allows incumbents to benefit from open-source contributions while maintaining proprietary advantages over the most valuable resources.
Current policy approaches are a priori rooted in evaluating openness through a risk-benefit calculus over downstream use cases. This framing:
Policy-makers are currently legislating model capabilities, post-release conformity, risk categorisation (EU AI Act), and access regimesâall downstream interventions. While gradient-release processes and diversified terms-of-service represent progress, viewing openness primarily through the lens of access and associated risks lacks analytical purchase.
I propose that integrative, upstream policy interventions also have a role in AI governance. Only after disaggregating AI systems and evaluating each component by stakeholders affected and values encoded can requisite policy nuance be found.
Whereas compute, algorithms, and weights remain access issues, data governanceâhow data is generated, collected, licensed, processed, owned, stewarded, and distributedâpresents a variety of decision dimensions and policy positions for democratic considerations.
As a developing and much-contested area of digital policy, there are a range of overlapping and often-competing naming conventions and governance claims made by proponents of Data Institutions.
| DI Type | Proponent | Key Feature |
|---|---|---|
| Data Trusts | Data Trusts Initiative | Legal fiduciary structure |
| Data Collaboratives | The GovLab | Cross-sector data sharing |
| Data Intermediaries | UK Government (National Data Strategy) | Market-oriented brokerage |
| Data Spaces | EU Commission | Federated data infrastructure |
| Data Altruism Organisations | EU Commission | Non-profit data donation |
This plurality of naming devices reflects gradations of legal distinctions and policy emphases within the concept of DIs writ large. This diversity matters at the implementation levelâin terms of legal justification and domain-fit designâbut it has less to say regarding DIsâ common institutional logic.
Below the apparent institutional variety, two concerns are common to all DI types:
This institutional logic structures a decentralised, community-accountable layer of decision-making over resource allocation. The flexibility of this core template means DIs can be deployed to meet diverse data governance demands across different local conditions.
DIs offer policy-makers the ability to build on top of this core institutional template to incorporate a wide range of policy and technical functions. This flexibility means DIs can be deployed to meet diverse data governance demands across divergent local conditionsâfrom health data collaboratives to indigenous language preservation to environmental monitoring.
This research seeks to widen the scope of possible intervention points into AI governance. ML datasets and their collection present a strategic policy surface within otherwise opaque ML value chainsâand DIs present an ideal template for their governance.
Data availability constrains modelling efforts towards tasks on which inference can be run. Datasets are where values are encoded, biases are introduced, and representativeness is determined. Unlike compute or algorithms, datasets are:
ML-relevant use cases for DIs extend beyond their well-theorised ability to redistribute value across imbalanced digital markets. Establishing dedicated DIs for ML datasets creates a new governance layer offering two-fold public value:
DIs integrate diverse voices into foundational value-encoding and direction-setting processes at the heart of AI development. This includes:
DIs serve as institutional platforms for contextually-appropriate downstream service offerings:
| Service | Benefit to Data Producers | Benefit to ML Developers |
|---|---|---|
| Quality assurance | Validation of data integrity | Higher-quality training data |
| Semantic interoperability | Standardised metadata | Easier data integration |
| Consent management | Respected data rights | Legal clarity |
| Documentation | Community-articulated context | Better model alignment |
| Financial sustainability | Fair compensation | Sustainable data supply |
This dual-sided utility helps policy-makers gain broad-based stakeholder buy-in while ensuring the DI layer provides overhead and resources to support a data commons capable of sustaining future AI innovations.
DIs move to close the paradox of openness that remains unaddressed by the existing open AI debate: how to make data accessible for beneficial AI development while maintaining meaningful governance and accountability.
The thesis derives nine high-level principles for practitioners implementing Data Institutions, structured around the ML data lifecycle.
Data is not a homogeneous conceptâit comes in different forms (copyrighted, anonymised, public-statistical, administrative, personal, industrial, research, transactional, health, digital media, etc.) governed by different, often overlapping, legal, contextual and custodial frameworks.
Implication: DIs should inhabit a world of small models, customised to community values as appropriate.
DIs should serve and represent a defined constituencyâusually the data-generating community and impacted stakeholders (data subjects, domain experts, etc.). The most explicit fiduciary template is the data trust, but a range of permutations exist within the DI universe.
Implication: The appropriate governance template will be contextually-dependent on the data type and community needs.
This extends the concept of ML âtasksâ and âtask-communitiesâ from purely technical affairs to encompass societal and policy considerations. Data availability constrains modelling efforts towards tasks on which inference can be run.
Implication: DIs should be structured around community-identified problems to support purpose-built datasets tailored to real-world tasks.
Example: Te Hiku Mediaâs voice recordings drive for MÄori language preservationâactively acquiring data representative of the population and task at hand through community opt-ins.
Human data labelling presents the primary bottleneck to quality data, and is where most labour abuses occur. DIs should provide sites of accountability and coordination for guaranteeing investment in quality data work.
Implication: Involve domain experts, community members, and impacted stakeholders. Where appropriate, design sustainable financing and standards for fair working conditions.
Examples: Wikimediaâs Enterprise API, Mozillaâs Common Voice, GIZâs FAIR Forward programme.
DIs should invest meaningful resources into documentation to improve understanding of datasetsâ contextual validityâhow data was created, collected, processed, and annotated.
Implication: This supports informed AI accountability discussions, comprehension of model capabilities/limitations, collaborative data refinement, and identification of biases. Data-generating communities can self-articulate relevant contextual knowledge and downstream specifications.
Tools: Datasheets for datasets, data nutrition labels, EU AI Act transparency obligations, structured checklists.
DIs should incorporate quality assurance, validation tools, and semantic functionality into a dataset-as-a-service offering.
What to identify:
Additional services: Linked data integration, enriched semantic metadata, federated learning support, smart contracts for data integrity and traceability.
DIs should follow both the CARE Principles for Indigenous Data Governance and the FAIR principles for data reusability, with appropriate access regimes.
Access regime options:
Implication: DIs might offer multiple differentiated access modes to a single dataset, lowering barriers for non-commercial users while commercial fees subsidise operations.
Providing the resources (infrastructure, compute, human) to design, develop, and manage quality datasets requires sustainable funding.
Funding options:
Implication: Structure incentive mechanisms and support for data producers and stewards.
DIs (especially those handling sensitive or safety-enhancing data) should consider incorporating technical enforcement mechanisms:
Additional measures: Legal counsel for mission-aligned compliance, data sharing frameworks accounting for jurisdictional differences, mechanisms for handling deprecated datasets.
Moving beyond the constraining terms of the open AI debate, this research proposes three key shifts in how we think about AI governance:
Data governanceâhow data is generated, collected, licensed, processed, owned, stewarded, and distributedâpresents a variety of decision dimensions for democratic considerations. Unlike compute or model weights, data governance offers meaningful points for community participation and value alignment.
This shifts focus from asking âhow open should AI be?â to âwho participates in decisions about AIâs foundational resources?â
Upstream interventions have a legitimate role in AI governance alongside downstream risk regulation. Rather than only regulating outputs (model capabilities, access regimes, post-release conformity), policy-makers can shape AI development through the data layer.
This complementsârather than replacesâexisting regulatory approaches like the EU AI Act.
Commons-based approaches can address the paradox of openness that remains unaddressed by current AI policy debates: how to make resources accessible for beneficial development while maintaining meaningful governance and accountability.
DIs demonstrate that openness and governance are not mutually exclusiveâstructured commons governance can support both innovation and accountability.
This research is not intended as the final word on DIs and ML datasets. It does not make empirical claims over the precise nature of the relationship between commons-based data governance and ML development practice. Rather, it flags the under-appreciated synergy between institutional theory and challenges faced in ML dataset curation.
Areas for further investigation:
This thesis draws on multiple research traditions:
For implemented examples and case studies:
| Initiative | Focus |
|---|---|
| Data Trusts Initiative | Legal frameworks and pilot implementations |
| GovLab Data Collaboratives | Cross-sector data sharing for public good |
| Open Data Institute | Bottom-up data institutions and policy guidance |
| Te Hiku Media | Indigenous data sovereignty in practice |
| Mozilla Common Voice | Community-driven voice dataset |
Full academic thesis available as PDF. Conference poster available as PDF.
Powered by Jekyll and Minimal Light theme.