Governance of Data Institutions

Opening AI the Data-centric Way: The Case for Data Institutions in Democratising the Design & Development of ML Datasets

Masters of Public Policy Thesis, 2024 Hertie School of Governance · Advisor: Prof. Joanna Bryson

Download Thesis (PDF) · Download Poster (PDF)

Executive Summary

This thesis examines the role of dedicated Data Institutions (DIs) in the design and development of machine learning datasets, emphasising their potential to democratise AI. The analysis critiques the current open AI approach and its limitations, proposing that as an institutionalised form of commons-based data governance, DIs offer a structured democratic approach to data-centric AI policy-making.

The research proposes that integrating DIs into the ML value chain would leverage the strategic significance of datasets to create public value and insert a decentralised, community-accountable governance layer into the otherwise opaque ML development process. It sets out an agenda for DIs’ collective governance mechanisms and fiduciary responsibilities to improve current practice in dataset development around data quality, usage, sustainability, and embedding of good governance practices.

Key contributions:

A critique of open AI from political theory, political economy, and policy framing perspectives
A normative and theoretical evaluation of DIs as an institutional template for ML datasets
Nine high-level principles for practitioners implementing Data Institutions

The Problem with ‘Open AI’

This research developed as a response to perceived limitations in debates around open source and its relationship to AI systems. The premise: there are more effective, more responsible, and less publicised policy approaches to achieve the fundamental objectives motivating the ‘open’ vs ‘closed’ AI debate than are currently considered.

Defining AI Democratisation

I operationalise democratisation of AI as the establishment of legitimate mechanisms along the value chain for supporting:

Equitable, inclusive and sustainable distributions of benefit and access around AI development
Democratic participation in the use, governance, and value-generation of AI systems

This account of AI’s democratisation starts from the belief that these outcomes can be at least partially achieved at an upstream technology-governance level.

Three Limitations of the Open AI Debate

1. Narrow Democratisation (Political Theory)

From a political-theory perspective, AI democratisation is not solely about the free deployment of AI without regard for social consequence. Rather, democratisation entails collective decision-making power over how AI is to be developed and deployed.

The ‘narrow’ democratisation of open AI conflicts with broader democratic ideals: unstructured access to AI systems could hinder societies from restricting those uses they deem undesirable. Free access ≠ democratic control.

2. Resource Capture (Political Economy)

From a political-economy perspective, the liberatory potential of open AI is bounded by near-term resource constraints (compute, data, and human capital). The ability of platform incumbents operating across parallel digital markets to cross-leverage scale and data risks the instrumentation and capture of open AI discourses by entrenched actors.

Strategic ‘open washing’ along ML value chains allows incumbents to benefit from open-source contributions while maintaining proprietary advantages over the most valuable resources.

3. Risk-Centric Framing (Policy Approach)

Current policy approaches are a priori rooted in evaluating openness through a risk-benefit calculus over downstream use cases. This framing:

Diverts attention towards distributive policy choices at the system-output level
‘Naturally’ suggests restrictive point-of-service regulation
Encourages binary-choice conceptualisations of AI’s openness problem

Policy-makers are currently legislating model capabilities, post-release conformity, risk categorisation (EU AI Act), and access regimes—all downstream interventions. While gradient-release processes and diversified terms-of-service represent progress, viewing openness primarily through the lens of access and associated risks lacks analytical purchase.

An Alternative Framing

I propose that integrative, upstream policy interventions also have a role in AI governance. Only after disaggregating AI systems and evaluating each component by stakeholders affected and values encoded can requisite policy nuance be found.

Whereas compute, algorithms, and weights remain access issues, data governance—how data is generated, collected, licensed, processed, owned, stewarded, and distributed—presents a variety of decision dimensions and policy positions for democratic considerations.

What Are Data Institutions?

As a developing and much-contested area of digital policy, there are a range of overlapping and often-competing naming conventions and governance claims made by proponents of Data Institutions.

A Diverse Family of Organisational Forms

DI Type	Proponent	Key Feature
Data Trusts	Data Trusts Initiative	Legal fiduciary structure
Data Collaboratives	The GovLab	Cross-sector data sharing
Data Intermediaries	UK Government (National Data Strategy)	Market-oriented brokerage
Data Spaces	EU Commission	Federated data infrastructure
Data Altruism Organisations	EU Commission	Non-profit data donation

This plurality of naming devices reflects gradations of legal distinctions and policy emphases within the concept of DIs writ large. This diversity matters at the implementation level—in terms of legal justification and domain-fit design—but it has less to say regarding DIs’ common institutional logic.

Two Core Principles

Below the apparent institutional variety, two concerns are common to all DI types:

Delegated Fiduciary Responsibility: Data-generating individuals and communities delegate responsibility for stewarding their data to a management function
Collective or Commons-Based Governance: DIs engage with data users (here: ML developers) on the community’s behalf through democratic mechanisms

This institutional logic structures a decentralised, community-accountable layer of decision-making over resource allocation. The flexibility of this core template means DIs can be deployed to meet diverse data governance demands across different local conditions.

Why Flexibility Matters

DIs offer policy-makers the ability to build on top of this core institutional template to incorporate a wide range of policy and technical functions. This flexibility means DIs can be deployed to meet diverse data governance demands across divergent local conditions—from health data collaboratives to indigenous language preservation to environmental monitoring.

The Case for Data Institutions in ML

This research seeks to widen the scope of possible intervention points into AI governance. ML datasets and their collection present a strategic policy surface within otherwise opaque ML value chains—and DIs present an ideal template for their governance.

Why Datasets Are Strategic

Data availability constrains modelling efforts towards tasks on which inference can be run. Datasets are where values are encoded, biases are introduced, and representativeness is determined. Unlike compute or algorithms, datasets are:

Contextually embedded: Reflecting the communities and conditions from which they emerge
Labour-intensive: Requiring human collection, labelling, and validation
Governance-amenable: Subject to consent, licensing, and stewardship decisions

Two-Fold Public Value Creation

ML-relevant use cases for DIs extend beyond their well-theorised ability to redistribute value across imbalanced digital markets. Establishing dedicated DIs for ML datasets creates a new governance layer offering two-fold public value:

1. Democratic Voice

DIs integrate diverse voices into foundational value-encoding and direction-setting processes at the heart of AI development. This includes:

Representation of data-generating communities in decisions about data use
Input from domain experts and impacted stakeholders on appropriate applications
Democratic mechanisms for consent, licensing, and downstream restrictions

2. Service Platform

DIs serve as institutional platforms for contextually-appropriate downstream service offerings:

Service	Benefit to Data Producers	Benefit to ML Developers
Quality assurance	Validation of data integrity	Higher-quality training data
Semantic interoperability	Standardised metadata	Easier data integration
Consent management	Respected data rights	Legal clarity
Documentation	Community-articulated context	Better model alignment
Financial sustainability	Fair compensation	Sustainable data supply

The Strategic Advantage

This dual-sided utility helps policy-makers gain broad-based stakeholder buy-in while ensuring the DI layer provides overhead and resources to support a data commons capable of sustaining future AI innovations.

DIs move to close the paradox of openness that remains unaddressed by the existing open AI debate: how to make data accessible for beneficial AI development while maintaining meaningful governance and accountability.

Nine Policy Principles

The thesis derives nine high-level principles for practitioners implementing Data Institutions, structured around the ML data lifecycle.

Foundational Principles

1. No One-Size Fits All

Data is not a homogeneous concept—it comes in different forms (copyrighted, anonymised, public-statistical, administrative, personal, industrial, research, transactional, health, digital media, etc.) governed by different, often overlapping, legal, contextual and custodial frameworks.

Implication: DIs should inhabit a world of small models, customised to community values as appropriate.

2. Be Democratic

DIs should serve and represent a defined constituency—usually the data-generating community and impacted stakeholders (data subjects, domain experts, etc.). The most explicit fiduciary template is the data trust, but a range of permutations exist within the DI universe.

Implication: The appropriate governance template will be contextually-dependent on the data type and community needs.

Use Case & Design

3. Assume a Task-Community Role

This extends the concept of ML ‘tasks’ and ‘task-communities’ from purely technical affairs to encompass societal and policy considerations. Data availability constrains modelling efforts towards tasks on which inference can be run.

Implication: DIs should be structured around community-identified problems to support purpose-built datasets tailored to real-world tasks.

Example: Te Hiku Media’s voice recordings drive for Māori language preservation—actively acquiring data representative of the population and task at hand through community opt-ins.

4. Ensure Fair Work

Human data labelling presents the primary bottleneck to quality data, and is where most labour abuses occur. DIs should provide sites of accountability and coordination for guaranteeing investment in quality data work.

Implication: Involve domain experts, community members, and impacted stakeholders. Where appropriate, design sustainable financing and standards for fair working conditions.

Examples: Wikimedia’s Enterprise API, Mozilla’s Common Voice, GIZ’s FAIR Forward programme.

Collection & Processing

5. Promote Transparency Norms

DIs should invest meaningful resources into documentation to improve understanding of datasets’ contextual validity—how data was created, collected, processed, and annotated.

Implication: This supports informed AI accountability discussions, comprehension of model capabilities/limitations, collaborative data refinement, and identification of biases. Data-generating communities can self-articulate relevant contextual knowledge and downstream specifications.

Tools: Datasheets for datasets, data nutrition labels, EU AI Act transparency obligations, structured checklists.

6. Smart Quality

DIs should incorporate quality assurance, validation tools, and semantic functionality into a dataset-as-a-service offering.

What to identify:

Noise, inconsistencies, redundancy
Completeness and bias
Data poisoning and synthetic elements
Personal identifiers and illicit content

Additional services: Linked data integration, enriched semantic metadata, federated learning support, smart contracts for data integrity and traceability.

Validation & Maintenance

DIs should follow both the CARE Principles for Indigenous Data Governance and the FAIR principles for data reusability, with appropriate access regimes.

Access regime options:

Open-access commons (via Kaggle, HuggingFace, OpenML)
Gated-access with registration and monitoring
Customised licensing (e.g., Te Hiku’s bespoke licence combining open sharing with communal control)

Implication: DIs might offer multiple differentiated access modes to a single dataset, lowering barriers for non-commercial users while commercial fees subsidise operations.

8. Sustainability

Providing the resources (infrastructure, compute, human) to design, develop, and manage quality datasets requires sustainable funding.

Funding options:

Revenue streams from licensing data to commercial developers
Dedicated funding and grants teams
Recurring licensing revenue compatible with stewarding principles

Implication: Structure incentive mechanisms and support for data producers and stewards.

9. Leverage Agency Through Technical Enforcement

DIs (especially those handling sensitive or safety-enhancing data) should consider incorporating technical enforcement mechanisms:

Verification systems: Proof-of-learning frameworks to ensure compliant model training
Digital signatures: Data poisoning and digital proofs to ensure agreed-upon data use
Hash-matching: Ensure deployed models are DI-verified ones
Certification: Similar to fair trade labels, leveraging end-user purchasing power

Additional measures: Legal counsel for mission-aligned compliance, data sharing frameworks accounting for jurisdictional differences, mechanisms for handling deprecated datasets.

Implications for AI Governance

Moving beyond the constraining terms of the open AI debate, this research proposes three key shifts in how we think about AI governance:

1. Data Governance as a Democratic Surface

Data governance—how data is generated, collected, licensed, processed, owned, stewarded, and distributed—presents a variety of decision dimensions for democratic considerations. Unlike compute or model weights, data governance offers meaningful points for community participation and value alignment.

This shifts focus from asking “how open should AI be?” to “who participates in decisions about AI’s foundational resources?”

2. Upstream Interventions Are Legitimate

Upstream interventions have a legitimate role in AI governance alongside downstream risk regulation. Rather than only regulating outputs (model capabilities, access regimes, post-release conformity), policy-makers can shape AI development through the data layer.

This complements—rather than replaces—existing regulatory approaches like the EU AI Act.

3. Closing the Paradox of Openness

Commons-based approaches can address the paradox of openness that remains unaddressed by current AI policy debates: how to make resources accessible for beneficial development while maintaining meaningful governance and accountability.

DIs demonstrate that openness and governance are not mutually exclusive—structured commons governance can support both innovation and accountability.

Limitations & Future Research

This research is not intended as the final word on DIs and ML datasets. It does not make empirical claims over the precise nature of the relationship between commons-based data governance and ML development practice. Rather, it flags the under-appreciated synergy between institutional theory and challenges faced in ML dataset curation.

Areas for further investigation:

Detailed DI design and implementation (sustainable funding, technical architectures)
Empirical evaluation of existing DI models
Legal frameworks across jurisdictions
Integration with existing ML development workflows

Initiative	Focus
Data Trusts Initiative	Legal frameworks and pilot implementations
GovLab Data Collaboratives	Cross-sector data sharing for public good
Open Data Institute	Bottom-up data institutions and policy guidance
Te Hiku Media	Indigenous data sovereignty in practice
Mozilla Common Voice	Community-driven voice dataset