Advanced Example: OMOP Health Data

The OMOP common data model (CDM) is a widely used format for storing health data. Here we will show how datafaker can be configured to generate data for OMOP.

Before getting into the config itself, we need to discuss a few peculiarities of the OMOP CDM that need to be taken into account:

Some versions of OMOP contain a circular foreign key, for instance between the vocabulary, concept, and domain tables.
There are several standardized vocabulary tables (concept, concept_relationship, etc). These should be marked as such using configure-tables, and will be exported to .yaml files during the make-vocab step. However, some of these vocabulary tables may be too large to practically be writable to .yaml files, and will need to be dealt with manually. You should also check the license agreement of each standardized vocabulary before sharing any of the .yaml files.

Dealing with Circular Foreign Keys

Datafaker will warn if schemas have circular foreign keys as these can indicate an error in the schema (by circular references we mean a table has a foreign key to itself, or to another table that has a foreign key back to the first table, or some longer loop). Datafaker cannot cope with circular keys between generated tables because it needs to generate data in referenced tables before it can generate references to them, and a circular dependency loop implies that there is no sensible order in which to do this. However, datafaker can cope with circular foreign keys between vocabulary (or ignored) tables without user intervention.

Vocabulary Tables

The OMOP schema has many vocabulary tables. Using the configure-tables command, tables can be marked as vocabulary tables; this means that the destination table will not have any generators run, but will instead get all its data copied directly from the source database’s table via a yaml file.

This mechanism can cope with tables of tens or hundreds of thousands of rows without too much trouble. Datafaker can be make to cope with even more by passing the --compress flag to the make-vocab command, which will produce compressed .yaml.gz files instead of plain text .yaml files. More stability might be achieved by limiting make-vocab to one table at a time via the --only TABLENAME option. Even still, a make-vocab call can take many hours to complete.

If all of these mechanisms fail, the offending tables will need to be marked (using configure-tables) as ignore. Such tables can be downloaded manually via the database’s own software (for example psql for Postgres), or you can just leave them ignored. For example, if you do not have permission to publish the OMOP concepts table, you might have to leave it as ignored. If you do leave a concept table ignored, the default generator for all foreign keys to it will be a plain integer genenerator, which is probably not what you want. Instead, use one of the choice generators for such foreign keys. The null-partitioned grouped generators will also work.

The standard OMOP vocabulary tables are as follows:

Table name	Approximate size
concept	7 million rows
concept_ancestor	80 million rows
concept_class	small
concept_relationship	50 million rows
concept_synonym	2 million rows
domain	small
drug_strength	3 million rows
relationship	small
source_to_concept_map	small
vocabulary	small

You might also want to treat other tables such as care_site, cdm_source and location as vocabulary.

So, concept, concept_ancestor, concept_relationship, concept_synonym and drug_strength might need to be ignored and dealt with manually depending on the source database’s speed and the capabilities of the machine running datafaker.

Entity Attribute Values

We can get reasonable fidelity in most OMOP tables with simple generators that are suggested by the propose command in configure-generators for each individual column. However, there are two tables that are very different. These are observation and measurement.

These tables have a concept_id column which indicates what some of the rest of the columns mean. Depending on the value in the concept_id column, you will have a different pattern of nulls in the rest of the table, a different spread of values for numeric colums, and a different set of choices for other columns.

For example, in the measurement table, if the measurement_concept_id is 4152194 (which references Systolic blood pressure in the concept table), then unit_concept_id is 8876 (which references the unit of pressure mmHg), unit_source_value is null and value_as_number has values somewhere near 160. This means that measurements of the systolic blood pressure are numeric and around 160mmHg. Other types of measurements will have their own ranges of values.

Having all different kinds of measurements in the one table presents a challenge for generating fake data. Datafaker has “null-partitioned grouped” generators for this sort of difficulty.

Datafaker’s null-partitioned grouped generators are multi-column genenerators. The amount of data captured in the src-stats.yaml file for this sort of generator is, sadly, prodigious. “Null-partitioning” refers to dividing the table into different “partitions” based on which columns are null, then gathering data for each partition separately (and also gathering data on how big each partition is). Each partition will have multiple results, one for each combination of values in the non-numeric columns.

Even if the source data has no examples of a row with a particular pattern of columns being null, datafaler will still generate a query for that partition (which will return no results), just in case a later run of make-stats does find some matching data. Therefore the amount of queries appearing in src-stats.yaml grows exponentially with the number of columns being generated for. For example, with eight columns generated, 257 query results will appear in src-stats.yaml! That’s a doubling for every column added plus one for the sizes of the partitions.

Therefore, depending on the user’s patience for generating such data and your information governance process’ capacity for looking through it, restraint might be called for in merging too many columns!

But which columns should we merge for observation and measurement tables?

The essential columns for the observation table would seem to be: observation_concept_id, value_as_number, value_as_string, value_as_concept_id, and unit_concept_id. But note that these columns are essential to be in the generator output only if they need to be present in the destination database at all. For example, if the destination database is producing the minimal data required to be processed by the UK HRA’s Cohort Discovery Tool, then the unit_concept_id column will not be produced and so does not need to be in any generator.

Another group of columns (depending on what the source data looks like) that could be nice: observation_type_concept_id, qualifier_concept_id, observation_source_value, observation_source_concept_id, unit_source_value and qualifier_source_value. Check if the source database puts meaningful values into these columns before adding.

If your provider table is a vocabulary table and there are not too many different values, provider_id can be added usefully.

observation_date and observation_datetime should not be added because these generators do not know how to handle dates, and so will just treat them as a set of choices on the same level as observation_concept_id. This would destroy any correlation in the data, so please don’t do it. A later update of datafaker might allow dates and datetimes to be numeric values, which would then piggyback onto the correlation and so keep it intact.

Similarly observation_id and visit_occurrence_id would destroy all correlation and person_id would destroy all correlation except that of any single individual with multiple observations of the same type. Do not add these.

The measurement table is very similar. Essential columns are measurement_concept_id, value_as_number, value_as_concept_id and unit_concept_id (with the same caveat as for the observation table). Useful columns are measurement_type_concept_id, operator_concept_id, range_low, range_high, measurement_source_value, measurement_source_concept_id, unit_source_value and value_source_value. provider_id is possibly useful, but only if the provider table is a vocabulary table. The following columns should not be added: measurement_id, person_id, measurement_date, measurement_datetime and visit_occurrence_id.

You currently have a choice of four null-partitioned generators. You can have any combination of sampled and suppressed or not, and normal or lognormal.

Normal vs lognormal are simply different distributions. Different measurements will suit one distribution or the other. Sadly we have to choose one or the other for the whole table. Generally lognormal seems to work a little better; it will never produce negative values, which is usually good but will sometimes be bad.

This will produce fairly faithful fake data. What is completely lacking is correlations between the different rows in the table. For instance, diastolic and systolic blood pressure readings are taken at times and have values that are independent of each other, the patients are given random drugs at random times, uncorrelated with their diagnoses or any other aspect of their medical record, etcetera. To go further we would have to write a story generator in Python which can carry information over from one line to others.