The Data — Pseudopeople

The pseudopeople software can generate a range of different datasets - not only a synthetic copy of the United States Decennial Census, but also social security data, tax data and synthetic copies of various Census-run surveys. This lets researchers test approaches to linking datasets without requiring access to real (and sensitive) data.

The synthetic data is generated by taking existing, public datasets and using them to simulate a “virtual” United States. By doing this, the original data can be expanded to fit the size and dynamics of the U.S. Census.

The resulting datasets have the same structure as the “real thing” - featuring names, addresses, ages and more - but without the same privacy concerns, since none of the people are real. You can read more about the range of datasets available, and what data each one contains, here.