Datafaker Overview

Datafaker provides a way to create a fake version of sensitive data.

Datafaker’s workflow permits Information Governance oversight by design.

Background

Conceptually, any faking of sensitive data achieves privacy through one or both of two mechanisms: Reducing the real data down to summaries, and obfuscating the data by adding noise.

After these privacy-preserving steps, reproducing a fake version of the full steps involves adding noise to the summary data to turn it back into data with the required structure and volume. We can call these stages Reduce, Obfuscate, Repopulate.

        block-beta
columns 1
Source["Sensitive Data"]
block:Arrow1
    ArrowNote1(["Remove Personally\nIdentifiable Data\n/ summarize / other removal"])
    Summary<["Reduce"]>(down)
end
block:Intermediate1
    ReducedNote(["This data is less sensitive"])
    Reduced["Reduced data / summaries"]
end
block:Arrow2
    ArrowNote2(["Adding noise"])
    Obfuscation<["Obfuscate"]>(down)
end
block:Intermediate2
    ObfuscatedNote(["This data is even\nless sensitive"])
    Obfuscated("Obfuscated\nReduced data")
end
block:Arrow3
    ArrowNote3(["Replacing removed data\nbased on summaries"])
    Repopulation<["Repopulate"]>(down)
end
Destination["Synthetic Data"]
classDef nobox fill:#fff,stroke-width:0px;
class Intermediate1 nobox
class Intermediate2 nobox
class Arrow1 nobox
class Arrow2 nobox
class Arrow3 nobox
classDef note fill:#18c;
class ArrowNote1 note
class ArrowNote2 note
class ArrowNote3 note
class ReducedNote note
class ObfuscatedNote note
    

In this diagram, Reduce and Obfuscate take place in a private environment such as a Trusted Research Environment or some machine on a private network from which the sensitive data can be read. Information Governance oversight happens at the end of this process to decide if this obfucated summary can be released out of the private network.

The Repopulate process happens outside of the private network and outside of the remit of Information Governance. This is important! The act of turning reduced obfucated data into a larger quantity of randomized data is not a privacy-sheilding process! By this time all the actual privacy-sheilding has already happened and the Information Governance has already been applied.

Reduction techniques

Here are a few techniques that can be used to preserve privacy through data reduction. Most Reduction techniques have a corresponding Repopulation technique to add the missing data back into the synthetic data.

Reduction Techniques

Reduction technique

Description

Repopulation technique

Anonymization

Removing Personally Identifiable Information (remembering what sort of information was removed and its format)

Generation of Names/Addresses etcetera

Removal of unnecessary data

Some data does not need to be supplied and so can be removed completely.

None.

Summarization

Extract the means, standard deviations and size of numeric data, the frequencies of categorical data, or other summary measures.

Pick the new data from a suitable random distribution.

Low number suppression

Removal of rare data, which could be used to infer the presence of certain individuals in the data set.

None.

Grouping

Gathering data into clusters and extracting summary data from each cluster.

Picking new data by picking a cluster then picking from the appropriate random distribution.

Each of these extracts less than the full information from the source, then most of them replace the information left behind with generated information (perhaps after an Obfuscation step).

Differential Privacy

One actual concrete method of faking data is to apply a process such as Differential Privacy to the real data without summarizing. In such a case Information Governance cannot be expected to examine the data for privacy breaches as there is the same amount of data to examine as in the real database, and breaches are likely to be more subtle than real names being produced; in this case Information Governance would have to understand and trust the obfucation process.

In the US, The National Institute for Science and Technology has some guidelines about the use of Differential Privacy but even that is at a loss to explain how to understand its main parameter:

Selecting privacy loss parameters, such as ε, is challenging, and we offer no specific guidelines on their selection.

So Differential Privacy can be useful, but it is very hard to quantify exactly how much good it is doing to preserve privacy.

Differential Privacy is an Obfuscation technique rather than a Reduction technique because the same amount of data is output as was input.

Datafaker’s Operation

Datafaker is based on the Alan Turing Institute’s SqlSynthGen tool. SqlSynthGen implements all three operations of Reduce, Obfuscate and Repopulate.

Datafaker builds on SqlSynthGen by automating the specification of the Reduce operation and making the dataflow clearer for Information Governance purposes (amongst other improvements). Datafaker still retains SqlSynthGen’s Obfuscation by Differential Privacy functionality, but this has not yet been folded into the automation and will not be described further here: we are relying on the Reduce operation to provide privacy shielding.

Datafaker has a number of phases of operation; below we describe how these map to the Reduce, Obfuscate and Repopulate stages described above.

        block-beta
columns 1
Source["Sensitive Data"]
block:Processing
    columns 4
    Arrow1<["Extract\nstructure"]>(down)
    Arrow2<["User guided\nconfiguration"]>(down)
    Arrow3<["Extract\nsummary"]>(down)
    Arrow4<["Extract\nvocabulary"]>(down)
    Orm["orm.yaml\nstructure\ndefinition"]
    Config["config.yaml\nconfiguration\nof summarization"]
    SrcStats["src-stats.yaml\nSummary\nstatistics"]
    Vocab["Vocabulary\ntable data"]
end
Arrow5<["Information Governance\npermits release"]>(down)
block:Release
    columns 1
    Arrow6<["Repopulation"]>(down)
    Destination["Synthetic Data"]
end
classDef nobox fill:#fff,stroke-width:0px;
class Processing nobox
classDef publicnet fill:#18c;
class Release publicnet
    

Datafaker make-tables phase

datafaker make-tables makes a file called orm.yaml that describes the structure of the source database. This is part of the Reduce phase, but this file is used in every other Datafaker phase. By describing the structure of the database, no private data is leaked. However it is not impossible that, in describing the structure of some commercial database, some commercially-sensitive information could be leaked. In such a case, the file is alterable by hand as long as the YAML structure is maintained.

Datafaker configuration phase

These commands are not really part of the Reduce phase, but allow the user to configure what the Reduce phase will entail (and hence also what the Repopulate phase will entail).

  • datafaker configure-tables makes a file called config.yaml that describes what needs to happen to each table.

  • datafaker configure-generators amends config.yaml with information on what happens to each column.

  • datafaker configure-missingness optionally amends config.yaml with simple summary missingness information,

Additional configuration can be applied by hand at this point to allow more sophisticated Repopulation to happen than the automated generation can manage.

config.yaml contains data that has been set by a user that has seen the contents of the database, but should not contain any sensitive data as Datafaker has not written any of the data (or summaries thereof) into the file.

config.yaml also contains human-readable descriptions of the summaries that will be produced, but these will be copied into the summary file produced in the following stage, next to the actual summary data.

Datafaker summary stats phase

datafaker make-stats makes a file called src-stats.yaml. This contains the summary data from the source database.

Information Governance should be focused on ensuring that private data is not leaking out in this file.

Obfuscation happens in this phase if it is configured; therefore this command represents most of the Reduce and all of the Obfuscate phase.

Once these files, orm.yaml, config.yaml and make-stats.yaml have been examined and approved for release, we can move to the next phase.

Datafaker make-vocab phase

datafaker make-vocab makes a whole set of .yaml or .yaml.gz files, each of which represents the entire contents of one table in the source database.

The idea is that some tables will never contain sensitive data, they might describe, for example, all the care centres that are referenced in the database, or all the set of all possible deseases that might be diagnosed (not who has such a diagnosis or even if anybody has). Such tables are referred to as Vocabulary tables.

Which tables are represented is configured by the configure-tables command, so the user must be careful not to configure any tables containing sensitive data as Vocabulary tables.

Releasing the intermediate data

These files – orm.yaml, config.yaml, the vocabulary files and especially the src-stats.yaml file – now need to have Information Governance processes applied to them as it is these files that can be extracted from the private network or Trusted Research Environment to allow the construction of the synthetic data in a less sensitive computing environment, if required.

The sensitive database is no longer required in Datafaker’s operation.

Datafaker Repopulate phase

  • datafaker create-tables creates the structure of the destination database to match (as much as is requested) the structure of the source database

  • datafaker create-generators creates Python code files that will actually generate the data (this phase might be removed in a future version of Datafaker)

  • datafaker create-data writes fake data into the destination database.

As these operations require no access to the sensitive data, this phase can be distributed to anybody that wants to make their own fake data in their own database should we want to allow that. This could allow them to make larger or smaller similar databases, or tweak the generated data with the results of their own simulations in order to test their own analyses.