Content from What is OMOP?


Last updated on 2025-09-12 | Edit this page

Overview

Questions

  • What is OMOP?
  • What information would you expect to find in the person table?
  • What information would you expect to find in the condition_occurrence table?
  • How can you join these tables to aggregate information?

Objectives

  • Examine the diagram of the OMOP tables and the data specification
  • Interrogate the data in the tables
  • Join these tables to find the concept names

Setting up R


Getting started

Since we want to import the files called *.csv into our R environment, we need to be able to tell our computer where the file is. To do this, we will create a “Project” with RStudio that contains the data we want to work with. The “Projects” interface in RStudio not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it). The interface also (optionally) preserves custom settings and open files to make it easier to resume work after a break.

Install the required packages

You will need the dplyr, remotes and readr packages from CRAN (the official package repository). You will also need a package we have developed omopcept. This can be installed with:

remotes::install_github(“SAFEHR-data/omopcept”)

Create a new project

Introduction


OMOP is a format for recording Electronic Healthcare Records. It allows you to follow a patient journey through a hospital by linking every aspect to a standard vocabulary thus enabling easy sharing of data between hospitals, trusts and even countries.

OMOP CDM Diagram

A diagram showing the tables that occur in the OMOP-CDM , how they relate to each other and standard vocabularies.
The OMOP Common Data Model

OMOP CDM stands for the Observational Medical Outcomes Partnership Common Data Model. You don’t really need to remember what OMOP stands for. Remembering that CDM stands for Common Data Model can help you remember that it is a data standard that can be applied to different data sources to create data in a Common (same) format. The table diagram will look confusing to start with but you can use data in the OMOP CDM without needing to understand (or populate) all 37 tables.

Challenge

Test yourself

Look at the OMOP-CDM figure and answer the following questions:

  1. Which table is the key to all the other tables?
  2. Which table allows you to distinguish between different stays in hospital?
  1. The Person table

  2. The Visit_occurrence table

There are a handful of core tables and columns that contain key information about a patient’s journey in the hospital. These are 7 tables to get you started :

  • person uniquely identifies each person or patient, and some demographic information. This is the central table that all other tables relate to.
  • condition_occurrence records relating to a Person suggesting the presence of a medical condition.
  • drug_exposure records about exposure of a patient to a drug.
  • procedure_occurrence activities carried out by a healthcare provider on the patient with a diagnostic or therapeutic purpose.
  • measurement numerical or categorical values obtained through standardized examination of a Person or Person’s sample.
  • observation clinical facts about a Person obtained in the context of examination, questioning or a procedure.
  • visit_occurrence records of times where Persons engage with the healthcare system.

Why use OMOP?


A diagram showing that different sources of data, transformed to OMOP, can then be used by multiple analysis tools.
Why use the OMOP-CDM

Once a database has been converted to the OMOP CDM, evidence can be generated using standardized analytics tools. This means that different tools can also be shared and reused. So using OMOP can help make your research FAIR.

Some simple tables


Loading Data

Now that we are set up with an Rstudio project, we are sure that the data and scripts we are using are all in our working directory. The data files should be located in the directory data, inside the working directory. Now we can load the data into R, there are three data files person.csv, condition_occurrence.csv and drug_exposure.csv. Read each of these into a table with the same name.

Challenge

Read the Data

There are three data files person.csv, condition_occurrence.csv and drug_exposure.csv. Read each of these into a table with the same name.

NOTE: The data does have headers.

R

person <- read.csv(file = "data/person.csv")
condition_occurrence <- read.csv(file = "data/condition_occurrence.csv")
drug_exposure <- read.csv(file = "data/drug_exposure.csv")

When you have read in the data, take some time to explore it.

Adding concept names


You will have noticed that content of the tables are not terribly easy to understand. This is because everything in OMOP is viewed as a concept that allows it to be related to one or more standard vocabularies such as SNOMED, ICD-10, etc.

We have developed a package that makes it very easy to add concept names to the tables.

You will need the function omopcept::omop_join_name_all(). This will look up the concept_id in the main table of concepts and add a column for the name of the concept associated with that id.

Challenge

Who’s who?

By creating tables that also have the name of the concepts answer the following questions

  1. How old is the black gentleman?
  2. In which month was an unspecified fever prevalent in the hospital?
  3. What was the ethnicity of the patient not affected by this fever?
  4. Give a description of the patient who received Amoxicillin because they were wheezing?

R

library(omopcept)
person_named <- person |> omop_join_name_all()

OUTPUT

Warning: downloading a subset of omop vocab files, pre-processed.
If you want to make sure you have the vocabs you need, download from Athena, save locally & call `omop_vocabs_preprocess()`

OUTPUT

downloading concept file, may take a few minutes, this only needs to be repeated if the package is re-installed

R

condition_occurrence_named <- condition_occurrence |> omop_join_name_all()
drug_exposure_named <- drug_exposure |> omop_join_name_all()
  1. 25 (or 24 if he hasn’t had his birthday this year)
  2. July
  3. Don’t know - it hasn’t been specified
  4. A 53/54 white female

Joining and interrogating the tables


Using join

We established when looking at the diagram that the person table was the key to accessing all the other tables. In fact it is the person_id column that is the actual key that will allow us to join with other tables.

So we can join two of the tables together to get information about the different conditions suffered by each person.

I am going to use a left join because I want a record of every person and the conditions they may have.

R

library(dplyr)
person_condition <- 
  person_named |> 
  left_join(condition_occurrence_named, by = join_by(person_id) )

This produces a new table with all the column names from both tables and six rows.

Using count

Challenge

Challenge

Count the number of people with each condition

R

person_condition |> count(gender_concept_name, condition_concept_name)

OUTPUT

# A tibble: 5 × 3
  gender_concept_name condition_concept_name     n
  <chr>               <chr>                  <int>
1 FEMALE              Fever, unspecified         2
2 FEMALE              Nausea and vomiting        1
3 FEMALE              Wheezing                   1
4 MALE                Fever, unspecified         1
5 MALE                Nausea and vomiting        1

This produces a table:

A table showing the different conditions listed with the n.umber of males and females suffering from them
A table of the condition counts
Challenge

For the Confident

Using the “GiBleed” database work out the number of male and female patients with each condition_concept_id

R

cdm$person |> 
  left_join( cdm$condition_occurrence, by = join_by(person_id) ) |> 
  group_by(condition_concept_id, gender_concept_id) |>
  summarise(num_persons = n_distinct(person_id)) |>
  collect() |>
  ungroup() |> 
  omopcept::omop_join_name_all() |> 
  #remove some columns to make display clearer
  select(-condition_concept_id, -gender_concept_id) |> 
  arrange(condition_concept_name)  

OUTPUT

# A tibble: 158 × 3
   condition_concept_name    gender_concept_name num_persons
   <chr>                     <chr>                     <dbl>
 1 Acute allergic reaction   FEMALE                       59
 2 Acute allergic reaction   MALE                         57
 3 Acute bacterial sinusitis FEMALE                      418
 4 Acute bacterial sinusitis MALE                        368
 5 Acute bronchitis          MALE                       1243
 6 Acute bronchitis          FEMALE                     1300
 7 Acute cholecystitis       FEMALE                       29
 8 Acute cholecystitis       MALE                          6
 9 Acute viral pharyngitis   MALE                       1284
10 Acute viral pharyngitis   FEMALE                     1322
# ℹ 148 more rows

Solutions for working out Who’s who programmatically.

Challenge

How old is the black gentleman?

R

library(dplyr)

year_of_birth <- person_named %>%
  filter(grepl("black", race_concept_name, ignore.case = TRUE)) %>%
  select(year_of_birth)

person_age <- year_of_birth$year_of_birth[1]  
age = 2025 - person_age

print(age)

OUTPUT

[1] 25
Challenge

In which month was an unspecified fever prevalent in the hospital?

The lubridate package has a function

R

library(lubridate)

OUTPUT


Attaching package: 'lubridate'

OUTPUT

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

R

 month = month(ymd("2025-01-30"), label = TRUE)
 print(month)

OUTPUT

[1] Jan
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

which returns month = Jan

The dyplr package has a function that will allow you to add columns to a table

R

 new_person_table = mutate(person, age = 2025-year_of_birth)
 print(new_person_table)

OUTPUT

  person_id year_of_birth gender_concept_id race_concept_id age
1         1          1980              8532        46285833  45
2         2          1971              8532        46286810  54
3         3          2000              8507        46285836  25
4         4          2010              8507        37394011  15

which adds a column called age to the person table

R

library(dplyr)
library(lubridate)

fever_months <- condition_occurrence_named %>%
  # Filter for rows where the condition name contains "fever"
  filter(grepl("fever", condition_concept_name, ignore.case = TRUE)) %>%
  # Extract month from the condition_start_date
  mutate(month = month(condition_start_date, label = TRUE)) %>%
  # Select just the month column for the results
  select(month) %>%
  # Count occurrences by month
  group_by(month) %>%
  summarise(count = n()) %>%
  # Sort by count (descending)
  arrange(desc(count))

# This gives us a table with the months where people had a fever and how many people had the fever each month. The question tells us that there is only one month so we select that.
fever <- fever_months$month[1]

# View the results
print(fever)

OUTPUT

[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Challenge

What was the ethnicity of the patient not affected by this fever?

R

library(dplyr)

people_without_condition <- person_named %>%
  # we want to join with the people who do not meet the condition
  anti_join(condition_occurrence_named %>%
    filter(grepl("fever", condition_concept_name, ignore.case = TRUE)),           
    by = "person_id") 
  
  ethnicity <- people_without_condition$race_concept_name 

# View results
print(ethnicity)

OUTPUT

[1] "Ethnicity not stated"
Challenge

Give a description of the patient who received Amoxicillin because they were wheezing?

R

library(dplyr)

# Query to find persons prescribed amoxicillin for wheezing
amoxicillin_for_wheezing <- person_named %>%
  # Join with condition_occurrence table 
  inner_join(condition_occurrence_named, by = "person_id") %>%
  # Filter for wheezing condition
  filter(grepl("wheez", condition_concept_name, ignore.case = TRUE)) %>%
  # Join with drug_exposure table to find medications
  inner_join(drug_exposure_named, by = "person_id") %>%
  # Filter for amoxicillin prescriptions
  filter(grepl("amoxicillin", drug_concept_name, ignore.case = TRUE)) 
  
  # again the question implies there is only one person
  age <- 2025-amoxicillin_for_wheezing$year_of_birth[1]
  gender <- amoxicillin_for_wheezing$gender_concept_name[1]
  ethnicity <- amoxicillin_for_wheezing$race_concept_name[1]

# View results
print(age)

OUTPUT

[1] 54

R

print(gender)

OUTPUT

[1] "FEMALE"

R

print(ethnicity)

OUTPUT

[1] "White: English or Welsh or Scottish or Northern Irish or British - England and Wales ethnic category 2011 census"
Key Points
  • Using a standard makes it much easier to share data
  • OMOP uses concepts to link dated to standard vocabularies
  • R can be used to join and interrogate data

Content from Why OMOP?


Last updated on 2025-09-17 | Edit this page

TODO not sure we need this episode, maybe some objectives could be copied to others

Overview

Questions

  • Why use OMOP?
  • Why not use spreadsheets?
  • What are the advantages of OMOP?
  • What are the disadvantages of OMOP?

Objectives

  • Examine the diagram of the OMOP tables and the data specification
  • Familiarise with the vocab schema
  • Join two or more tables together
  • Attempt to join data from spreadsheets with different structures
  • Describe the pros and cons on using OMOP vs raw data, why this is the way forward
  • Use Athena and other OHDSI tools for reference
  • Describe the full landscape of OMOP tools and the community

Introduction


TODO decide whether to include this earlier example, doesn’t currently work because we don’t have the data. Maybe not ?

Discussion

Challenge 1: Compare data from two separate OMOP data sets

Let’s read in an OMOP extract called extract_1 from the local files.

R

omop_dataset_file_location_1 <- here::here("extracts/uclh1")

extract_1 <- read_omop_dataset(omop_dataset_file_location_1)

A colleague at the hospital is familiar with the events that occurred in hospital to one of the patients in the dataset. This patient has been identified by the data team as the anonymised patient with person id 7.

R

extract_1$person |>
  filter(person_id==7) |>
  select(person_id, race_concept_id, gender_concept_id, year_of_birth) |>
  omopcept::omop_join_name_all() |>
  collect()
Checklist

Verify that the patient details match your colleague’s description.

Let’s take a sample of patients in this dataset, selecting those same columns.

R

extract_1_pt_sample <- extract_1$person |>
  slice_sample(n = 10) |>
  select(person_id, race_concept_id, gender_concept_id, year_of_birth) |>
  collect()

We’ve received another OMOP dataset from another site.

R

omop_dataset_file_location_2 <- here::here("extracts/other_site_1")

extract_2 <- read_omop_dataset(omop_dataset_file_location_2)

Let’s take a sample of patients from the second extract and bind them together.

Callout

Note that, because the structure of the data (table names, columns and data types) are set as standard by the OMOP specification, we are guaranteed to be able to bind these two datasets together without error. We can also re-apply the same code, only changing the reference to the new extract.

R

extract_2_pt_sample <- extract_2$person |>
  slice_sample(n = 10) |>
  select(person_id, race_concept_id, gender_concept_id, year_of_birth) |>
  collect()
  
 bind_rows(extract_1_pt_sample, extract_2_pt_sample)
Key Points

Content from concepts and conditions


Last updated on 2025-09-17 | Edit this page

Overview

Questions

  • What is an OMOP concept ?
  • Where are patient conditions stored in OMOP ?

Objectives

  • Understand that nearly everything in a hospital can be represented by an OMOP concept_id.
  • Know that OMOP data usually includes the OMOP concept table and other data from the vocabularies
  • Be able to look up concepts by their name
  • Know that patient conditions are stored in the condition_occurrence table

Introduction


Nearly everything in a hospital can be represented by an OMOP concept_id.

Any column within the OMOP CDM named *concept_id contains OMOP concept IDs. An OMOP concept_id is a unique integer identifier. Concept_ids are defined in the OMOP concept table where a corresponding name and other attributes are stored. OMOP contains concept_ids for other medical vocabularies such as SNOMED and LOINC, which OMOP terms as source vocabularies.

concept table columns Description
concept_id Unique identifier for the concept.
concept_name Name or description of the concept.
domain_id The domain to which the concept belongs (e.g. Condition, Drug).
vocabulary_id The vocabulary from which the concept originates (e.g. SNOMED, RxNorm).
concept_class_id Classification within the vocabulary (e.g. Clinical Finding, Ingredient).
standard_concept ‘S’ for standard concepts that can be included in network studies
concept_code Code used by the source vocabulary to identify the concept.
valid_start_date Date the concept became valid in OMOP.
valid_end_date Date the concept ceased to be valid.
invalid_reason Reason for invalidation, if applicable

Looking up OMOP concepts


OMOP concepts can be looked up in Athena an online tool provided by OHDSI.

TODO provide a challenge to look up a concept in Athena TODO add a challenge to look up the same concept in one of the synthea datasets TODO add briefly about concept relationships

TODO ? add about the condition_occurrence table to make the episode longer and give something for concept to refer to

Key Points
  • Understand that nearly everything in a hospital can be represented by an OMOP concept_id.
  • Know that OMOP data usually includes the OMOP concept table and other data from the vocabularies
  • Be able to look up concepts by their name
  • Know that patient conditions are stored in the condition_occurrence table

Content from measurements and observations


Last updated on 2025-09-17 | Edit this page

Overview

Questions

  • How to access measurements and observations ?

Objectives

  • know that measurements are mainly lab results and other records like pulse rate
  • know observations are other facts obtained through questioning or direct observation
  • understand concept ids identify the measure or observation, values are stored in value_as_number or value_as_concept_id
  • be able to join to the concept table to find a particular measurement or observation concept by name
  • understand that different clinical questions can be answered by querying by patient and/or visit, or summing across all records

Introduction


The OMOP measurement and observation tables contain information collected about a person. The difference between them is that measurement contains numerical or categorical values collected by a standardised process, whereas observation contains less standardised clinical facts. Measurements are predominately lab tests with a few exceptions, like blood pressure or function tests.

A person_id column means that there can be multiple records per person.

Columns are similar between measurement and observation.

Concepts and values


Data are stored as questions and answers. A question (e.g. Pulse rate) is defined by a concept_id and the answer is stored in a value column.

measurement_concept_id or observation_concept_id columns define what has been recorded here are some examples :

Example Measurement concepts Example Observation concepts
Respiratory rate Respiratory function
Pulse rate Wound dressing observable
Hemoglobin saturation with oxygen Mandatory breath rate
Body temperature Body position for blood pressure measurement
Diastolic blood pressure Alcohol intake - finding
Arterial oxygen saturation Tobacco smoking behavior - finding
Body weight Vomit appearance
Leukocytes [#/volume] in Blood State of consciousness and awareness

These value columns store values :

column name data type example concept_name
value_as_number numeric value 1.2 -
unit_concept_id units of the numeric value 9529 kilogram
value_as_concept_id categorical value 4328749 High
operator_concept_id optional operators 4172704 >

Where values are a concept_id, the name of that concept can be looked up in the concept table that is part of the OMOP vocabularies and included in most CDM instances. We show this below.

Looking at numeric measurement values


R


install.packages("MeasurementDiagnostics")
library(dplyr)

cdm <- MeasurementDiagnostics::mockMeasurementDiagnostics()

# first we can see that some concepts have a value_as_number
freq_numeric <- cdm$measurement |>
  filter(!is.na(value_as_number)) |> 
  count(measurement_concept_id, unit_concept_id) |> 
  # join concept names
  left_join(select(cdm$concept, concept_id, concept_name), by=join_by(measurement_concept_id==concept_id)) |> 
  rename(measurement_concept_name=concept_name) |> 
  left_join(select(cdm$concept, concept_id, concept_name), by=join_by(unit_concept_id==concept_id)) |> 
  rename(unit_concept_name=concept_name) |> 
  collect()
  
freq_numeric |> head(3)
  

Output of numeric values


TODO these data from MeasurementDiagnostics aren’t great but they are a start

  • Good that it is a complete CDM with the concept table for joining names.
  • Bad few values and unrealistic units

OUTPUT


  measurement_concept_id unit_concept_id     n measurement_concept_name                         unit_concept_name
                   <int>           <dbl> <dbl> <chr>                                            <chr>
1                3002069            9529    50 Alkaline phosphatase.bone/Alkaline phosphatase.… kilogram
2                3012056            9529    50 Uroporphyrin 3 isomer [Moles/volume] in Urine    kilogram
3                3026074            9529    50 Uroporphyrin 3 isomer [Moles/volume] in Stool    kilogram
Challenge

Challenge 1: Can you adapt the code above to output the frequency of categorical values ?

R

# other concepts have a value_as_concept_id
freq_categorical <- cdm$measurement |>
  filter(!is.na(value_as_concept_id)) |> 
  count(measurement_concept_id, value_as_concept_id) |> 
  # join concept names
  left_join(select(cdm$concept, concept_id, concept_name), by=join_by(measurement_concept_id==concept_id)) |>  
  rename(measurement_concept_name=concept_name) |> 
  left_join(select(cdm$concept, concept_id, concept_name), by=join_by(value_as_concept_id==concept_id)) |> 
  rename(value_as_concept_name=concept_name) |> 
  collect()
  
freq_categorical |> head(3)

OUTPUT

  measurement_concept_id value_as_concept_id     n measurement_concept_name                 value_as_concept_name
                   <int>               <dbl> <dbl> <chr>                                    <chr>
1                3001467             4328749    33 Alkaline phosphatase.bone [Enzymatic ac… High
2                3002069             4267416    33 Alkaline phosphatase.bone/Alkaline phos… Low
3                3011539             4267416    33 Uroporphyrin 3 isomer [Moles/volume] in… Low

When a measurement or observation was done


There are date and time columns indicating when a measurement or observation was performed.

A visit_occurrence_id specifies the visit that it occurred in.

TODO do we want to add more here with example data

Source data columns


There are columns that can store data from the source database that hasn’t been standardised to the same level. Usually it is not a good idea to use these in analyses because any code is unlikely to work on different data. The source columns can be used to check how the standardised data were arrived at.

Source columns in the measurement table :

measurement_source_value
measurement_source_concept_id
unit_source_value
value_source_value

Advanced : Linking measurement and observation tables to other tables e.g. specimen


There are measurement_event_id and observation_event_id columns that can link a record to the primary key in another table e.g. specimen_id. If these are used then meas_event_field_concept_id or obs_event_field_concept_id need to contain the concept_id corresponding to the linked table (in this case specimen).

For microbiology results observation_concept_id can store the genus & species of an organism & measurement_concept_id can store its growth.

TODO question in the above do they need to be linked by fact_relationship to deal with multiple organisms from the same specimen ? TODO do we want to add more here with example data

Key Points
  • know that measurements are mainly lab results and other records like pulse rate
  • know observations are other facts obtained through questioning or direct observation
  • understand concept ids identify the measure or observation, values are stored in value_as_number or value_as_concept_id
  • be able to join to the concept table to find a particular measurement or observation concept by name
  • understand that different clinical questions can be answered by querying by patient and/or visit, or summing across all records

Content from Visits


Last updated on 2025-09-17 | Edit this page

Overview

Questions

  • What are visits and how can they be used ?

Objectives

  • Know that a visit is a period of time and patients can have multiple visits
  • Understand that multiple measurements, conditions etc. can occur within and between visits
  • Understand that for some analyses you will want to look within visits and for other analyses to sum across visits
  • Know that visits are recorded in the visit_occurrence table
  • Know each visit is unique to a person
  • Understand that other tables link to visits
  • Understand how visits can be used to find co-occurrence of other events

Introduction


TODO not sure if this wants to be in a separate episode or a more general one about linking tables. It could always be renamed later.

The visit_occurrence table contains events where Persons engage with the healthcare system for a duration of time.

The main clinical tables condition_occurrence, measurement, observation and drug_exposure contain a visit_occurrence_id that links to this table.

visit_concept_id specifies the kind of visit that took place using standardised OMOP concepts. These include Inpatient visit, Emergency Room Visit and Outpatient Visit. Inpatient visits can last for longer than one day.

The visit_detail table can contain information about time periods shorter than the visit (for example transfer between wards) but we will not cover that further here.

Generating some example data


Firstly here is some code (from ChatGPT) to create some interesting example data. TODO this code is lengthy, we may later want to hide it or save the output as .Rdata in the repo

R

#person, visit_occurrence, measurement, drug_exposure, condition_occurrence, and concept tables
#realistic blood pressure trends over time
#conditions and drugs co-occurring within visits
#concept table for joining concept names

# Install and load required packages
#install.packages(c("dplyr", "tibble", "lubridate", "uuid"), dependencies = TRUE)

library(dplyr)
library(tibble)
library(lubridate)
library(uuid)
library(tidyr)

set.seed(123)

# 1. Create 100 synthetic patients
n_patients <- 100

person <- tibble(
  person_id = 1:n_patients,
  gender_concept_id = sample(c(8507, 8532), n_patients, replace = TRUE),  # Male / Female
  year_of_birth = sample(1940:2000, n_patients, replace = TRUE)
)

# 2. Create multiple visits per patient
visits_per_person <- sample(2:5, n_patients, replace = TRUE)

visit_occurrence <- tibble(
  person_id = rep(person$person_id, times = visits_per_person)
) |>
  mutate(
    visit_occurrence_id = row_number(),
    visit_start_date = as_date("2020-01-01") + sample(0:1000, n(), replace = TRUE),
    visit_concept_id = sample(c(9201, 9202, 9203), n(), replace = TRUE),  # Inpatient, ER, Outpatient
    #ER & outpatient just a day
    visit_end_date = if_else( visit_concept_id == 9201,
                              visit_start_date + sample(1:3, n(), replace = TRUE),
                              visit_start_date),
  )

# Baseline + slope per patient in one tibble
bp_params <- tibble(
  person_id = 1:n_patients,
  bp_slope = rnorm(n_patients, mean = 0, sd = 0.3),   # mmHg per hour
  systolic_base = rnorm(n_patients, mean = 120, sd = 10)
)

measurement <- visit_occurrence |>
  left_join(bp_params, by = "person_id") |>
  rowwise() |>
  mutate(
    n_hours = as.numeric(interval(visit_start_date, visit_end_date) / hours(1)),
    hours_seq = list(visit_start_date + hours(0:n_hours))
  ) |>
  ungroup() |>
  select(person_id, visit_occurrence_id, bp_slope,
         systolic_base, visit_start_date, hours_seq) |>
  unnest(hours_seq) |>
  mutate(
    hours_since_start = as.numeric(difftime(hours_seq, visit_start_date, units = "hours")),
    systolic = systolic_base +
               (hours_since_start * bp_slope) +   # per-hour slope
               rnorm(n(), 0, 5),                 # noise
    measurement_id = UUIDgenerate(n = n()),
    measurement_concept_id = 3004249,  # systolic
    unit_concept_id = 8510,            # mmHg
    measurement_date = hours_seq
  ) |>
  select(measurement_id, person_id, visit_occurrence_id,
         measurement_date, measurement_concept_id,
         systolic, unit_concept_id) |>
  rename(value_as_number = systolic)


# 6. Define condition-drug co-occurrence map
condition_drug_map <- tribble(
  ~condition_concept_id, ~condition_name,            ~drug_concept_id, ~drug_name,
  201826,                "Type 2 diabetes mellitus",  1124300,          "Metformin",
  320128,                "Essential hypertension",    1112807,          "Lisinopril",
  319835,                "Hyperlipidemia",            19019073,         "Atorvastatin"
)

# 7. Generate conditions per visit
condition_occurrence <- visit_occurrence |>
  rowwise() |>
  do({
    #n_conditions <- sample(1:2, 1)
    #change n_conditions to 1 so that co-occurrence example works
    n_conditions <- 1
    selected_conditions <- condition_drug_map |>
      slice_sample(n = n_conditions)

    tibble(
      condition_occurrence_id = UUIDgenerate(n_conditions),
      person_id = .$person_id,
      visit_occurrence_id = .$visit_occurrence_id,
      condition_concept_id = selected_conditions$condition_concept_id,
      condition_start_date = .$visit_start_date
    )
  }) |>
  bind_rows()

# 8. Generate drug exposures that match conditions
drug_exposure <- condition_occurrence |>
  left_join(condition_drug_map, by = "condition_concept_id") |>
  group_by(person_id, visit_occurrence_id, drug_concept_id) |>
  summarise(
    drug_exposure_id = UUIDgenerate(1),
    drug_exposure_start_date = min(condition_start_date),
    .groups = "drop"
  ) |>
  mutate(days_supply = sample(30:90, n(), replace = TRUE))

# 9. Build concept table (with all used concepts)
concept <- tribble(
  ~concept_id, ~concept_name,
  3004249, "Systolic Blood Pressure",
  3012888, "Diastolic Blood Pressure",
  3027114, "Glucose",
  3016502, "Creatinine",
  1124300, "Metformin",
  1112807, "Lisinopril",
  19019073, "Atorvastatin",
  201826, "Type 2 diabetes mellitus",
  320128, "Essential hypertension",
  319835, "Hyperlipidemia",
  8510, "mm[Hg]",
  8713, "mg/dL",
  8840, "mmol/L",
  9201, "Inpatient Visit",
  9202, "Emergency Room Visit",
  9203, "Outpatient Visit",
  8507, "Male",
  8532, "Female"
)

# 10. Combine into synthetic CDM object
cdm <- list(
  person = person,
  visit_occurrence = visit_occurrence,
  measurement = measurement,
  condition_occurrence = condition_occurrence,
  drug_exposure = drug_exposure,
  concept = concept
)

When do we need to consider visits ?


As we have seen we don’t need to consider visits to answer all questions. For example if we can count the number of patients with a particular condition without considering visits.

Using visits can help us with :

  1. different types of visits
  2. selecting data from an indivual or selected visits
  3. finding co-occurrence of events within a visit

We can query the visit_occurrence table to see what kinds of visits there are in the data.

R

library(dplyr)

cdm$visit_occurrence |> 
  count(visit_concept_id) |> 
  left_join(cdm$concept |> select(concept_id, concept_name), 
            by = c("visit_concept_id" = "concept_id"))

OUTPUT

# A tibble: 3 × 3
  visit_concept_id     n concept_name
             <dbl> <int> <chr>
1             9201   105 Inpatient Visit
2             9202   105 Emergency Room Visit
3             9203   125 Outpatient Visit    

R

cdm$visit_occurrence

OUTPUT

# A tibble: 335 × 5
   person_id visit_occurrence_id visit_start_date visit_concept_id
       <int>               <int> <date>                      <dbl>
 1         1                   1 2020-12-02                   9201
 2         1                   2 2022-03-07                   9203
 3         1                   3 2020-01-26                   9203
 4         1                   4 2021-06-22                   9201
 5         1                   5 2022-09-07                   9201
 6         2                   6 2021-06-02                   9203
 7         2                   7 2022-08-13                   9203
 8         2                   8 2022-01-26                   9202
 9         2                   9 2021-10-27                   9202
10         2                  10 2021-07-06                   9203
# ℹ 325 more rows
# ℹ 1 more variable: visit_end_date <date>

Can you use person_id in the visit_occurrence table to find patients with more than one visit ?

R

visit_counts <- cdm$visit_occurrence |>
  group_by(person_id) |>
  summarise(n_visits = n()) |>
  filter(n_visits > 1) |>
  collect()

visit_counts |> head(4)

OUTPUT

# A tibble: 4 × 2
  person_id n_visits
      <int>    <int>
1         1        5
2         2        5
3         3        3
4         4        4

Now we can choose one of the patients from the previous table and look at all of their visits.

R

example_person_id <- visit_counts$person_id[1]

patient_visits <- cdm$visit_occurrence |>
  filter(person_id == example_person_id) |>
  left_join(cdm$concept, by = c("visit_concept_id" = "concept_id")) |>
  select(
    visit_occurrence_id, visit_start_date, visit_end_date, concept_name
  ) |>
  arrange(visit_start_date) |>
  collect()

patient_visits

OUTPUT

# A tibble: 5 × 4
  visit_occurrence_id visit_start_date visit_end_date concept_name
                <int> <date>           <date>         <chr>
1                   3 2020-01-26       2020-01-26     Outpatient Visit
2                   1 2020-12-02       2020-12-04     Inpatient Visit
3                   4 2021-06-22       2021-06-24     Inpatient Visit
4                   2 2022-03-07       2022-03-07     Outpatient Visit
5                   5 2022-09-07       2022-09-10     Inpatient Visit 

Outpatient and Emergency Room Visits usually end on the same day, where Inpatient visits can last longer.

Here is how we can plot a measurement (in this case blood pressure) over time for a patient. You may remember from a previous lesson that numeric measurements are stored in the column value_as_number.

First we filter the data.

R

# Define concept IDs
systolic_id <- 3004249
selected_patient <- 1

bp <- cdm$measurement |> 
  filter(person_id == selected_patient) |> 
  filter(measurement_concept_id %in% c(systolic_id))

head(bp,3)

OUTPUT

# A tibble: 3 × 7
  measurement_id               person_id visit_occurrence_id measurement_date
  <chr>                            <int>               <int> <dttm>
1 729e5403-5326-44a1-a538-1a0…         1                   1 2020-12-02 00:00:00
2 80633537-2883-49d6-9d22-94a…         1                   1 2020-12-02 01:00:00
3 7fd089fc-a1f1-44b1-adb0-09e…         1                   1 2020-12-02 02:00:00
# ℹ 3 more variables: measurement_concept_id <dbl>, value_as_number <dbl>,
#   unit_concept_id <dbl>

Then we can plot using ggplot2.

R

library(ggplot2)

ggplot(bp, aes(x = measurement_date, 
               y = value_as_number)) +
  geom_point() +
  theme_minimal()
plot of blood pressure measurements for all visits
plot of blood pressure measurements for all visits

You should be able to see some dates with single measurements and some with a few measurements very close to each other. Each separate group is a single visit.

Challenge

Challenge

Can you modify the query and plot to show the type of visit ?

You can join to visit_occurrence to get visit_concept_id, and from that join to the concept table to get concept_name. You can add shape = to the aes() statement in the plot.

R

# Define concept IDs
systolic_id <- 3004249
selected_patient <- 1

bp2 <- cdm$measurement |> 
  filter(person_id == selected_patient) |> 
  filter(measurement_concept_id %in% c(systolic_id)) |> 
  left_join(cdm$visit_occurrence, by = c("visit_occurrence_id", "person_id")) |> 
  left_join(cdm$concept, by = join_by(visit_concept_id == concept_id)) |> 
  rename(visit_concept_name = concept_name)

ggplot(bp2, aes(x = measurement_date, y = value_as_number, shape = visit_concept_name)) +
  geom_point() +
  theme_minimal()
plot of blood pressure measurements for all visits
plot of blood pressure measurements for all visits

To indicate the type of visit we need to join to visit_occurrence to get the visit_concept_id and then join to the concept table to get a name for that concept.

Note that we join the visit_occurrence table to the measurement table using both ("visit_occurrence_id", "person_id"). If we didn’t include both, one of the columns would get duplicated and renamed making it difficult to select later. Se with person_id below.

R

  cdm$measurement |> 
     left_join(cdm$visit_occurrence, by = c("visit_occurrence_id")) |> 
     names()

OUTPUT

 [1] "measurement_id"         "person_id.x"            "visit_occurrence_id"
 [4] "measurement_date"       "measurement_concept_id" "value_as_number"
 [7] "unit_concept_id"        "person_id.y"            "visit_start_date"
[10] "visit_concept_id"       "visit_end_date"        
Challenge

Challenge

Can you plot the measurements for one of the Inpatient visits ?

You can filter bp created in the previous challenge by one of the visit_occurrence_id from the table earlier.

R

bp3 <- bp2 |> filter( visit_occurrence_id == 1 )

ggplot(bp3, aes(x = measurement_date, y = value_as_number, shape = visit_concept_name)) +
  geom_point() +
  theme_minimal()
plot of blood pressure measurements for one visit
plot of blood pressure measurements for one visit

You should be able to see hourly blood pressure measurements within a visit.

Here is an example where the visit can make a difference to an analysis. Imagine we want to look at which drugs are associated with which conditions for patients.

If we were to join tables by person_id as we have seen in previous exercises we could get conditions and drugs that were separated by many years. Instead we can include visit_occurrence_id in the join to get co-occurrence of conditions and drugs in the same visit.

R

co_occurrence_by_visit <- cdm$condition_occurrence |>
  # Join with drug_exposure on person and visit
  inner_join(cdm$drug_exposure, by = c("person_id", "visit_occurrence_id")) |>
  # Count how often each condition–drug pair co-occurs in a visit
  group_by(condition_concept_id, drug_concept_id) |>
  summarise(co_occurrences = n(), .groups = "drop") |>
  
  # Join to get condition name
  left_join(
    cdm$concept |> select(concept_id, concept_name),
    by = c("condition_concept_id" = "concept_id")
  ) |>
  rename(condition_name = concept_name) |>
  
  # Join to get drug name
  left_join(
    cdm$concept |> select(concept_id, concept_name),
    by = c("drug_concept_id" = "concept_id")
  ) |>
  rename(drug_name = concept_name) |>
  select(condition_name, drug_name, co_occurrences)
  
  co_occurrence_by_visit

OUTPUT

# A tibble: 3 × 3
  condition_name           drug_name    co_occurrences
  <chr>                    <chr>                 <int>
1 Type 2 diabetes mellitus Metformin               127
2 Hyperlipidemia           Atorvastatin             95
3 Essential hypertension   Lisinopril              113

Here we see a perfect co-occurrence between conditions & expected drugs for treating that condition (because we generated the example data that way). If instead you didn’t use the visit_occurrence_id you would see more unexpected associations occurring across widely separated visits for the same patient.

Challenge

Challenge

Can you repeat the query without using visit_occurrence_id to see what results you get ?

R

co_occurrence <- cdm$condition_occurrence |>
  # Join with drug_exposure on person
  inner_join(cdm$drug_exposure, by = c("person_id")) |>
  # Count how often each condition–drug pair co-occurs
  group_by(condition_concept_id, drug_concept_id) |>
  summarise(co_occurrences = n(), .groups = "drop") |>
  
  # Join to get condition name
  left_join(
    cdm$concept |> select(concept_id, concept_name),
    by = c("condition_concept_id" = "concept_id")
  ) |>
  rename(condition_name = concept_name) |>
  
  # Join to get drug name
  left_join(
    cdm$concept |> select(concept_id, concept_name),
    by = c("drug_concept_id" = "concept_id")
  ) |>
  rename(drug_name = concept_name) |>
  select(condition_name, drug_name, co_occurrences)

co_occurrence

OUTPUT

# A tibble: 9 × 3
  condition_name           drug_name    co_occurrences
  <chr>                    <chr>                 <int>
1 Type 2 diabetes mellitus Lisinopril              116
2 Type 2 diabetes mellitus Metformin               257
3 Type 2 diabetes mellitus Atorvastatin            103
4 Hyperlipidemia           Lisinopril               89
5 Hyperlipidemia           Metformin               103
6 Hyperlipidemia           Atorvastatin            179
7 Essential hypertension   Lisinopril              205
8 Essential hypertension   Metformin               116
9 Essential hypertension   Atorvastatin             89

Now we see that conditions co-occur with drugs unlikely to be used in their treatment because they came from visits further apart in time.

Key Points
  • Know that a visit is a period of time and patients can have multiple visits
  • Understand that multiple measurements, conditions etc. can occur within and between visits
  • Understand that for some analyses you will want to look within visits and for other analyses to sum across visits
  • Know that visits are recorded in the visit_occurrence table
  • Know each visit is unique to a person
  • Understand that other tables link to visits
  • Understand how visits can be used to find co-occurrence of other events

Content from Medications


Last updated on 2025-09-17 | Edit this page

Overview

Questions

  • Where are medications stored ?

Objectives

  • Know that exposure of a patient to medications is mainly stored in the drug_exposure table
  • Understand that drug concepts can be at different levels of granularity
  • Understand that source values are mapped to a standard vocabulary

Introduction


The OMOP drug_exposure table stores exposure of a patient to medications.

The main columns are :

column name content
drug_exposure_id unique identifier given that person can get multiple exposures per visit
person_id the patient
drug_concept_id standard drug identifier, can be at different levels of granularity
drug_exposure_start_date may also be an optional start_datetime
drug_exposure_end_date may also be an optional end_datetime
drug_type_concept_id where the record came from e.g. EHR administration record
drug_source_value in UCLH the NHS dm+d Dictionary of Medicines and Devices ID
drug_source_concept_id OMOP concept ID for the source value

Drug data can be very complicated, as can the process of converting from the source data to OMOP. You may not find what you expect depending on this and the quality of the source data.

Drug concepts


The standard OHDSI drug vocabularies are called RxNorm and RxNormExtension. RxNorm contains all drugs currently on the US market. RxNormExtension is maintained by the OHDSI community and contains all other drugs.

A particular concept_id can be at one of a number of different levels in a drug hierarchy. These are the main levels in RxNorm which are stored in concept_class_id in the concept table.

RxNorm concept_class_id Description
Ingredient A base active drug ingredient, without strength or dose form (e.g., Ibuprofen).
Clinical Drug Component A drug component with strength but no form (e.g., Ibuprofen 200 mg).
Clinical Drug A combination of an ingredient, strength, and dose form (e.g., Ibuprofen 200 mg Oral Tablet).

Drug mapping in the NHS


Drugs in the NHS are standardised to the NHS Dictionary of Medicines and Devices (dm+d). dm+d is included in OMOP so there are values of OMOP concept_id for each dm+d. However because dm+d is not a standard vocabulary in OMOP it is translate once more to get to a standard OMOP concept id in RxNorm or RxNormExtension that can be used in collaborative studies. If there is a drug_concept_id value of 0 and there are source codes this can be because that drug doesn’t map to a standard ID. Reminder that the source values are stored in these columns.

column name content
drug_source_value in UCLH the NHS dm+d Dictionary of Medicines and Devices ID
drug_source_concept_id OMOP concept ID for the source value

TODO do we want to include about classifying medications by type ? It is not straightforward. I did something in omopcept but probably too tricky to include here.

R

# TODO later use the examples below or similar to create exercises
# OR do we want to make our own example data, e.g. including dm+d
# explore some drug_exposure data

library(CDMConnector)
library(DBI)
library(duckdb)
library(dplyr)


dbName <- "GiBleed" #smaller than other datasets quick to download

requireEunomia(datasetName = dbName)
con <- dbConnect(duckdb(), eunomiaDir(datasetName = dbName))
cdm <- cdmFromCon(con = con, cdmSchema = "main", writeSchema = "main", cdmName = dbName)


#67k records
#cdm$drug_exposure |> tally()

drugs_and_concept <- cdm$drug_exposure |>
  left_join(cdm$concept, join_by(drug_concept_id==concept_id)) |> 
  collect()
  
drugs_and_concept |> 
  count(concept_class_id, sort=TRUE) |> 
  head(5)
  
#  concept_class_id        n
#1 Clinical Drug       36218
#2 CVX                 25710
#3 Ingredient           2764
#4 Branded Pack         1640
#5 Quant Clinical Drug  1019  

#note that CVX is an RxNorm id for vaccines

# top 5 clinical drugs
drugs_and_concept |> 
  filter(concept_class_id=="Clinical Drug") |> 
  count(drug_concept_id, concept_name, sort=TRUE) |> 
  head(5)  
  
#  drug_concept_id concept_name                                                   n
#1         1127433 Acetaminophen 325 MG Oral Tablet                            9365
#2        19059056 Aspirin 81 MG Oral Tablet                                   4380
#3         1713671 Amoxicillin 250 MG / Clavulanate 125 MG Oral Tablet         3851
#4         1127078 Acetaminophen 160 MG Oral Tablet                            2158
#5        40229134 Acetaminophen 21.7 MG/ML / Dextromethorphan Hydrobromide …  1993
 
  
# top 5 ingredients
# in this case probably drugs that couldn't be converted to a more granular level
drugs_and_concept |> 
  filter(concept_class_id=="Ingredient") |> 
  count(drug_concept_id, concept_name, sort=TRUE) |> 
  head(3)    

#  drug_concept_id concept_name     n
#1         1118084 celecoxib     1844
#2         1124300 Diclofenac     850
#3         1367571 heparin         35

# TODO decide if we want to include drug quantity. Would need different example data.
# no drug quantity information in GiBleed
drugs_and_concept |> count(quantity, sort=TRUE)
# but there are plenty of days_supply
# oh dear, drug_strength has no rows so can't do the query below

# If you do have quantity data it can be converted to standard units using the
# drug_strength table that is part of the vocabularies.
# This will depend on the dose form which can be obtained from the relationship table.
# where the relationship_id is ‘Has dose form’.
# probably don't want to go into further here
# e.g. Tablets can be looked up 
#"Acetaminophen 325 MG Oral Tablet" from above 
cdm$drug_strength |>  filter(drug_concept_id==1127433L) |> 
  #add name for the ingredient
  left_join(select(cdm$concept,concept_id,concept_name), by=join_by(ingredient_concept_id==concept_id)) |> 
  rename(ingredient_concept_name=concept_name) |> 
  #add name for the amount unit
  left_join(select(cdm$concept,concept_id,concept_name), by=join_by(amount_unit_concept_id==concept_id)) |> 
  rename(amount_unit_concept_name=concept_name) |>   
  select(drug_concept_id, ingredient_concept_name, amount_value, amount_unit_concept_name) |> 
  collect()

# to get all drugs containing a particular ingredient
# you can also use the drug_strength table
Key Points
  • Know that exposure of a patient to medications is mainly stored in the drug_exposure table
  • Understand that drug concepts can be at different levels of granularity
  • Understand that source values are mapped to a standard vocabulary

Content from Community provided software for working with OMOP data


Last updated on 2025-09-17 | Edit this page

Overview

Questions

  • What are some of the R tools that can work on OMOP CDM instances ?

Objectives

  • Brief outline of some R tools that will be useful for new OMOP users.

Introduction


There are a range of community provided R tools that can help you work with instances of OMOP data.

We are going to show you a brief summary of some that are likely to be of use to new users.

With each you may need to balance the need to learn some new syntax with the benefits of the extra functionality that the package provides.

TODO maybe add a table with links to packages & brief descriptions.

OmopSketch To summarise key information about an OMOP database. To provide a broad characterisation of the data and to allow users to evaluate whether they are suitable for particular research.

First we can install the package and its dependencies and connect to some mock data.

R

#without dependencies=TRUE it failed needing omock & VisOmopResults
#with dependencies it installed 91 packages in 1.8 minutes
install.packages("OmopSketch", dependencies=TRUE, quiet=TRUE)

library(dplyr)
library(OmopSketch)

# Connect to mock database
cdm <- mockOmopSketch()

summarise* and table* functions

The package has the following types of functions :

function start what it does
summarise* generate results objects.
table* convert results objects to tables for display.

Snapshot

Snapshot creates a broad summary of the database including person count, temporal extent and other metadata.

R

summariseOmopSnapshot(cdm) |>
  tableOmopSnapshot(type = "gt")
Estimate
Database name
mockOmopSketch
General
Snapshot date 2025-09-17
Person count 100
Vocabulary version v5.0 18-JAN-19
Observation period
N 100
Start date 1958-01-22
End date 2019-12-24
Cdm
Source name eunomia
Version 5.3
Holder name -
Release date -
Description -
Documentation reference -
Source type duckdb

Missing data

Summarise missing data in each column of one or many cdm tables.

R

missingData <- summariseMissingData(cdm, c("drug_exposure"))
tableMissingData(missingData)
Column name Estimate name
Database name
mockOmopSketch
drug_exposure
drug_exposure_id N missing data (%) 0 (0.00%)
N zeros (%) 0 (0.00%)
person_id N missing data (%) 0 (0.00%)
N zeros (%) 0 (0.00%)
drug_concept_id N missing data (%) 0 (0.00%)
N zeros (%) 0 (0.00%)
drug_exposure_start_date N missing data (%) 0 (0.00%)
drug_exposure_start_datetime N missing data (%) 21,600 (100.00%)
drug_exposure_end_date N missing data (%) 0 (0.00%)
drug_exposure_end_datetime N missing data (%) 21,600 (100.00%)
verbatim_end_date N missing data (%) 21,600 (100.00%)
drug_type_concept_id N missing data (%) 0 (0.00%)
N zeros (%) 0 (0.00%)
stop_reason N missing data (%) 21,600 (100.00%)
refills N missing data (%) 21,600 (100.00%)
quantity N missing data (%) 21,600 (100.00%)
days_supply N missing data (%) 21,600 (100.00%)
sig N missing data (%) 21,600 (100.00%)
route_concept_id N missing data (%) 21,600 (100.00%)
N zeros (%) 0 (0.00%)
lot_number N missing data (%) 21,600 (100.00%)
provider_id N missing data (%) 21,600 (100.00%)
N zeros (%) 0 (0.00%)
visit_occurrence_id N missing data (%) 0 (0.00%)
N zeros (%) 0 (0.00%)
visit_detail_id N missing data (%) 21,600 (100.00%)
N zeros (%) 0 (0.00%)
drug_source_value N missing data (%) 21,600 (100.00%)
drug_source_concept_id N missing data (%) 21,600 (100.00%)
N zeros (%) 0 (0.00%)
route_source_value N missing data (%) 21,600 (100.00%)
dose_unit_source_value N missing data (%) 21,600 (100.00%)

Clinical Records

Allows you to summarise omop tables from a cdm. By default it gives measures including records per person, how many concepts are standard and source vocabularies.

R

summariseClinicalRecords(cdm, "condition_occurrence") |>
  tableClinicalRecords(type = "gt")
Variable name Variable level Estimate name
Database name
mockOmopSketch
condition_occurrence
Number records - N 8,400
Number subjects - N (%) 100 (100.00%)
Records per person - Mean (SD) 84.00 (9.83)
Median [Q25 - Q75] 84 [77 - 91]
Range [min to max] [65 to 107]
In observation Yes N (%) 8,400 (100.00%)
Domain Condition N (%) 8,400 (100.00%)
Source vocabulary No matching concept N (%) 8,400 (100.00%)
Standard concept S N (%) 8,400 (100.00%)
Type concept id Unknown type concept: 1 N (%) 8,400 (100.00%)

You can also

  • apply to more than one table at a time
  • reduce the measures of records per person
  • reduce the number of rows by setting options to FALSE
  • stratify by sex and/or age
  • set a date range

R

summariseClinicalRecords(cdm, c("drug_exposure","measurement"),
                         recordsPerPerson = c("mean", "sd"),
                         inObservation = FALSE,
                         standardConcept = FALSE,
                         sourceVocabulary = FALSE,
                         domainId = FALSE,
                         typeConcept = FALSE,
                         sex = TRUE) |> 
                     tableClinicalRecords(type = "gt")
Variable name Variable level Estimate name
Database name
mockOmopSketch
drug_exposure; overall
Number records - N 21,600.00
Number subjects - N (%) 100 (100.00%)
Records per person - Mean (SD) 216.00 (14.47)
drug_exposure; Female
Number records - N 13,526.00
Number subjects - N (%) 63 (100.00%)
Records per person - Mean (SD) 214.70 (15.68)
drug_exposure; Male
Number records - N 8,074.00
Number subjects - N (%) 37 (100.00%)
Records per person - Mean (SD) 218.22 (12.01)
measurement; overall
Number records - N 5,900.00
Number subjects - N (%) 100 (100.00%)
Records per person - Mean (SD) 59.00 (7.85)
measurement; Female
Number records - N 3,742.00
Number subjects - N (%) 63 (100.00%)
Records per person - Mean (SD) 59.40 (8.37)
measurement; Male
Number records - N 2,158.00
Number subjects - N (%) 37 (100.00%)
Records per person - Mean (SD) 58.32 (6.93)

Record counts over time

You can plot the number of records over time for any cdm table also stratified by age and sex (so behind the scenes this will be joining clinical tables to the person table).

R

recordCount <- summariseRecordCount(cdm, 
                                    omopTableName =  "drug_exposure",
                                    interval = "years",
                                    sex = TRUE,
                                    ageGroup =  list("<40" = c(0,39), ">=40" = c(40, Inf)),
                                    dateRange = as.Date(c("2002-01-01", NA))) 

plotRecordCount(recordCount, facet = "sex", colour = "age_group")

Concept Id counts

You can get a summary of the numbers of concept ids in a cdm table. Unfortuntaely the summary doesn’t display here but you can copy and paste the code into your console to see it.

R

result <- summariseConceptIdCounts(cdm = cdm, omopTableName = "condition_occurrence")
tableConceptIdCounts(head(result,5), display = "standard", type = "datatable")
Key Points
  • Brief outline of some R tools that will be useful for new OMOP users.

Content from Placeholder Chapter 8


Last updated on 2025-09-12 | Edit this page

Overview

Questions

  • What

Objectives

  • 1

Introduction


This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.

What you need to know is that there are three sections required for a valid Carpentries lesson template:

  1. questions are displayed at the beginning of the episode to prime the learner for the content.
  2. objectives are the learning objectives for an episode displayed with the questions.
  3. keypoints are displayed at the end of the episode to reinforce the objectives.
Challenge

Challenge 1: Can you do it?

What is the output of this command?

R

paste("This", "new", "lesson", "looks", "good")

OUTPUT

[1] "This new lesson looks good"
Challenge

Challenge 2: how do you nest solutions within challenge blocks?

You can add a line with at least three colons and a solution tag.

Figures


You can also include figures generated from R Markdown:

R

pie(
  c(Sky = 78, "Sunny side of pyramid" = 17, "Shady side of pyramid" = 5), 
  init.angle = 315, 
  col = c("deepskyblue", "yellow", "yellow3"), 
  border = FALSE
)
pie chart illusion of a pyramid
Sun arise each and every morning

Or you can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

Blue Carpentries hex person logo with no text.
You belong in The Carpentries!
Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Math


One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:

$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)

Cool, right?

Key Points
  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally