The Plan — Pseudopeople

Census Data

The United States Census Bureau has two equally important tasks. The first is to collect and report data about the country - its residents and their lives. The second is to protect the privacy of those residents while doing so. Traditionally, melding these two tasks has taken the form of storing a central (highly-protected) copy of their raw data, and only publishing aggregated summaries that do not contain any individual data.

Over the last few years (and decades), the increase in computing power and precision has made it more and more clear that this may not be enough: that it may be possible to recombine those aggregate, public data releases and reconstruct some (private) data. In response, the census has switched to more precise approaches to data anonymization, which successfully work to prevent these kind of attacks. The issue is that these approaches also make it harder for researchers who are relying on census data releases to do their work, and recombine datasets for beneficial purposes.

The solution is to develop new processes for recombining data that do not risk revealing private information. But developing those techniques requires, of course, access to the confidential data - precisely the thing that Census is trying to avoid.

Simulated Data

The solution is simulated data: entirely artificial data that mimics confidential census data in every possible way, but is, crucially, not based on real people. With a sufficiently accurate simulated dataset, new approaches to recombine datasets can be developed and evaluated without requiring access to the real data. That’s where pseudopeople comes in.

Relational Governance

Any dataset involves, directly or indirectly, many players. There are:

The data subjects: the people affected by the dataset;
The data creators: the people developing the dataset, and;
The data users: the third-party researchers or developers who hope to use a dataset.

When data causes harm to subjects, it can often be hard to ensure accountability. This is particularly the case when harm is caused by data users: creators often treat users as far away from them from a moral perspective. The approach we want to take is different, and is instead built on bringing people together. On ensuring that subjects, creators, and users are in mutual relationships of accountability and responsibility, so that there is both a will and a way to address data subjects’ concerns about the reuse of this data.

To do this, we are proposing that data access will be overseen by a committee of people representing data subjects - from civil society organizations to experts in particular domains - and structured so that in the event of concerns, either from them or anyone else, users (and developers) addressing issues will be an ongoing and accountable process. Further, we are designing mechanisms that encourage the formation of relationships between parties, allowing for stronger accountability and systems of feedback.