Thoughts based on convo with Jonny
Problem
Data access, not programming skill, is the constraint on open source AI development in the UK. More people can build and adapt AI models than can access the right training data. The UK has high quality data assets; NHS patient records, UKRI-funded research, one of the largest genomic databases in the world, an internationally dominant legal system. Most of these are practically inaccessible despite being technically open. The result is that the UK's data advantages are largely untapped, and the open source AI community builds on whatever happens to be accessible rather than on what would be most valuable.
There are two related problems here. First, data that is open in principle is not open in practice; it is scattered, unindexed, in incompatible formats, and there is no initiative to fix this. Second, data that should be open is locked behind commercial publishers, even when it was created with public money. Third: the UK's most valuable data (NHS, legal, sensitive research) cannot realistically be centralised for legal, commercial, or governance reasons, so any strategy that depends on aggregation alone will hit a ceiling well before it reaches the highest-value use cases.
UK Scientific Data Platform
A discovery and access layer for data that is already technically open, primarily UKRI-funded research data and other publicly-funded outputs, but practically inaccessible because it is scattered across a number of repositories and research archives in incompatible formats with no unified point of access.
The UK becomes the host of an aggregated, searchable, interoperable platform. No new data is purchased. The platform does not create data, it makes existing open data actually usable. UK Biobank's access infrastructure, applied more broadly.
You could feasibly do this in 4 steps;
- A discovery layer; a single, unified index of all UKRI-funded research data with standardised metadata, consistent licensing information, and a searchable interface. This is low-cost and largely a coordination problem rather than a technical one. UKRI already has open data mandates; what is missing is the infrastructure that makes these meaningfully useful.
- An access layer with standardised APIs, allowing AI developers to pull datasets programmatically rather than navigating individual institutional portals. This requires interoperability standards across institutions, which is harder but achievable.
- A federated execution layer for data that cannot be centralised. For a large share of the UK's most valuable data (NHS records, sensitive research cohorts, commercially constrained legal material) aggregation is not legally or politically available, and will not become available on any useful timescale. Federated learning lets AI models be trained across institutions without the underlying data ever leaving its host environment. The compute goes to the data rather than the data coming to a central platform. Mature open source tooling for this already exists, and it aligns with existing UK governance and GDPR constraints rather than fighting them. This positions the UK as a credible leader in trusted, privacy-preserving AI, rather than competing purely on dataset scale.
- Priority domain expansion starting with the sectors where the UK has the strongest comparative advantage and the clearest demand (health, legal, public services). Each domain probably has its own requirements and would have to be developed separately.
Pretty confident in this because;
- UK Biobank is the proof of concept: the world's largest genomic and health database, already operating with a sophisticated access governance model. It has attracted global research because the data quality and access infrastructure are excellent. The scientific data platform is essentially UK Biobank, but for the rest of publicly-funded research.
- UKRI mandates already require open data from funded research. The gap is that data is in scattered, inaccessible forms. No new mandates are needed in terms of publishing/copyright.
- English is the dominant language of global research, which means a UK-hosted platform for English-language academic data has global relevance and attracts international contribution, increasing the network effect without the UK bearing the full cost.
- International researchers and companies deposit data in UK Biobank precisely because they trust UK data governance. This reputation makes a UK-hosted platform more attractive than alternatives.
Challenges