Parquet files

Last updated on 2026-02-08 | Edit this page

Overview

Questions

  • What is a parquet file?

  • How to explore and open parquet files in R?

Objectives

  • Understand the structure of parquet files.

  • Learn how to read parquet files in R.

Introduction


In this episode, we will explore parquet files, a popular file format for storing large datasets efficiently. We will learn how to read parquet files in R and understand their structure.

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in the same form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample database. If you have not done this yet, please refer to the setup instructions provided earlier in the course. For now, we will assume that you have the sample OMOP CDM database available on your local machine at the following path: ../workshop/data/public/ and the functions in a folder ../workshop/code/parquet_dataset.

Parquet files


Parquet is a columnar storage file format that is optimized for use with big data processing frameworks. It is designed to be efficient in terms of both storage space and read/write performance. Parquet files are often used in data warehousing and big data analytics applications.

Exploring Parquet files


We have provided a function that will allow you to browse the structure of the data in the same way as we did with the database in the previous episode. This code is available below or in the downloaded workshop/code/open_omop_dataset.R file. You can source this file to load the function into your R environment.

R

open_omop_dataset <- function(dir) {
  open_omop_schema <- function(path) {
    # iterate table level folders
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path
      # and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-open" list of parquet files
      # from specified folder
      purrr::map(arrow::open_dataset)
  }
  # iterate top-level folders
  list.dirs(dir, recursive = FALSE) |>
    # exclude folder name from path
    # and use it as index for named list
    purrr::set_names(~ basename(.)) |>
    purrr::map(open_omop_schema)
}

CODING_NOTE: This function uses the arrow package to read in the parquet files. The open_dataset() function from the arrow package allows us to read in the parquet files without having to load the entire dataset into memory. This is particularly useful when working with large datasets. The function is reasonably complex but it is designed to be flexible and work with any OMOP CDM dataset that is structured in the same way as the one we are using for this course. It will read in all the parquet files in the specified directory and create a nested list structure that allows us to easily access the different tables in the dataset. We leave it to you to explore the code and understand how it works.

Now we can use this function to open the sample OMOP CDM dataset located in the workshop/data/public/ directory and explore it in the same way as we did with the database in the previous episode.

R

omop <- open_omop_dataset("./data/")

Explore the data using the following:

R

omop$public

OUTPUT

$concept
FileSystemDataset with 1 Parquet file
6 columns
concept_id: int32
concept_name: string
domain_id: string
vocabulary_id: string
standard_concept: string
concept_class_id: string

See $metadata for additional Schema metadata

$condition_occurrence
FileSystemDataset with 1 Parquet file
10 columns
condition_occurrence_id: int32
person_id: int32
condition_concept_id: int32
condition_start_date: string
condition_end_date: string
condition_type_concept_id: int32
condition_status_concept_id: int32
visit_occurrence_id: int32
condition_source_value: string
condition_source_concept_id: int32

See $metadata for additional Schema metadata

$drug_exposure
FileSystemDataset with 1 Parquet file
12 columns
drug_exposure_id: int32
person_id: int32
drug_concept_id: int32
drug_exposure_start_date: string
drug_exposure_start_datetime: string
drug_exposure_end_date: string
drug_exposure_end_datetime: string
drug_type_concept_id: int32
quantity: double
route_concept_id: int32
visit_occurrence_id: int32
drug_source_concept_id: int32

See $metadata for additional Schema metadata

$measurement
FileSystemDataset with 1 Parquet file
12 columns
measurement_id: int32
person_id: int32
measurement_concept_id: int32
measurement_date: string
measurement_datetime: string
operator_concept_id: int32
value_as_number: double
value_as_concept_id: int32
unit_concept_id: int32
range_low: int32
range_high: int32
visit_occurrence_id: int32

See $metadata for additional Schema metadata

$observation
FileSystemDataset with 1 Parquet file
9 columns
observation_id: int32
person_id: int32
observation_concept_id: int32
observation_date: string
observation_datetime: string
value_as_number: int32
value_as_string: string
value_as_concept_id: int32
visit_occurrence_id: int32

See $metadata for additional Schema metadata

$person
FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string

See $metadata for additional Schema metadata

$procedure_occurrence
FileSystemDataset with 1 Parquet file
7 columns
procedure_occurrence_id: int32
person_id: int32
procedure_concept_id: int32
procedure_date: string
procedure_datetime: string
procedure_type_concept_id: int32
visit_occurrence_id: int32

See $metadata for additional Schema metadata

$visit_occurrence
FileSystemDataset with 1 Parquet file
10 columns
visit_occurrence_id: int32
person_id: int32
visit_concept_id: int32
visit_start_date: string
visit_start_datetime: string
visit_end_date: string
visit_end_datetime: string
visit_type_concept_id: int32
discharged_to_concept_id: int32
preceding_visit_occurrence_id: int32

See $metadata for additional Schema metadata

You will see that this gives you a list of all the tables in this dataset and what columns they contain. It is obviously a much smaller dataset! You can explore individual tables which will also give you the column names and the data type of the entry.

R

omop$public$person

OUTPUT

FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string

See $metadata for additional Schema metadata

To actually open each table we can use

R

library(dplyr)
person <- omop$public$person |> collect()
person

OUTPUT

# A tibble: 8 × 8
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1      1111              8507          1993              6           15
2      1112              8532          1970              6           15
3      1113              8507          1983              6           15
4     34567              8532          2015              6           15
5     78901              8532          1989              6           15
6        31              8532          1987              0            0
7         2              8532          2008              0            0
8        58              8507          1985              0            0
# ℹ 3 more variables: race_concept_id <int>, gender_source_value <chr>,
#   race_source_value <chr>

CODING_NOTE: The collect() function is used to actually read the data from the parquet files into memory. This is necessary because the open_dataset() function creates a reference to the data rather than loading it into memory. By using collect(), we can work with the data as a regular data frame in R.

Or we can use the specific functions from the arrow package to read in the parquet files directly.

R

library(arrow)
person <- read_parquet("./data/public/person/person.parquet")
person

OUTPUT

# A tibble: 8 × 8
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1      1111              8507          1993              6           15
2      1112              8532          1970              6           15
3      1113              8507          1983              6           15
4     34567              8532          2015              6           15
5     78901              8532          1989              6           15
6        31              8532          1987              0            0
7         2              8532          2008              0            0
8        58              8507          1985              0            0
# ℹ 3 more variables: race_concept_id <int>, gender_source_value <chr>,
#   race_source_value <chr>

CODING_NOTE: The read_parquet() function from the arrow package allows us to read in a specific parquet file directly into R. This can be useful if we only want to work with a specific table from the dataset and do not want to load the entire dataset into memory.

Callout

As part of the privacy preserving policies around health data, dates of birth are often de-identified to only show the year of birth. This is why in this dataset the day and month of birth are set to 15/6 or 0/0 for all individuals.

You can also see that in some cases the gender_source_value and race_source_value columns are populated while in others they are not. This depends on the policy of the individual hospital. Note the data in this dataset is a number of different sources combined together to form a single OMOP CDM database.

Challenge

Challenge

Adapt the code we had developed for the get_concept_name function in the previous episode to work with this parquet file dataset.

R

library(arrow)
library(dplyr)
get_concept_name <- function(id) {
  omop$public$concept |>
    filter(concept_id == !!id) |>
    select(concept_name) |>
    collect()
}

CODING_NOTE: The get_concept_name() function is adapted to work with the parquet file dataset. It uses the filter() and select() functions from the dplyr package to query the concept table in the parquet dataset. The collect() function is used to read the result into memory so that we can work with it as a regular data frame in R.

Now we can use this function to look up concept names by their concept_id.

R

get_concept_name(8507)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Male        

Answer: The concept_id 8507 corresponds to the concept “Male”.

Key Points
  • Parquet files are a columnar storage file format optimized for big data processing.

  • The arrow package in R can be used to read and manipulate parquet files.