| Title: | Fast Conversion and Querying of Danish Registers with 'Parquet' |
|---|---|
| Description: | Converts large Danish register files ('sas7bdat') into 'Parquet' format with year-based 'Hive' partitioning and chunked reading for larger-than-memory files. Supports parallel conversion with a 'targets' pipeline and reading those registers into 'DuckDB' tables for faster querying and analyses. |
| Authors: | Signe Kirk Brødbæk [aut, cre] (ORCID: <https://orcid.org/0009-0000-2208-7088>), Luke Johnston [aut] (ORCID: <https://orcid.org/0000-0003-4169-2616>), Steno Diabetes Center Aarhus [cph], Aarhus University [cph] |
| Maintainer: | Signe Kirk Brødbæk <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.12.5 |
| Built: | 2026-06-02 14:55:33 UTC |
| Source: | https://github.com/dp-next/fastreg |
To be able to handle larger-than-memory files, the SAS file is converted in chunks. It does not check for existing files in the output directory. Existing data will not be overwritten, but might be duplicated if it already exists in the directory, since files are saved with UUIDs in their names.
convert(path, output_dir, chunk_size = 10000000L)convert(path, output_dir, chunk_size = 10000000L)
path |
Path to a single SAS file. |
output_dir |
Directory to save the Parquet output to. Must not include
the register name as this will be extracted from |
chunk_size |
Number of rows to read and convert at a time. |
A tibble with a conversion log about each written chunk.
sas_file <- fs::path_package("fastreg", "extdata", "test.sas7bdat") convert( path = sas_file, output_dir = fs::path_temp("path/to/output/file") )sas_file <- fs::path_package("fastreg", "extdata", "test.sas7bdat") convert( path = sas_file, output_dir = fs::path_temp("path/to/output/file") )
Only lists Parquet files that end in part-*.parquet. For datasets,
it will only look for Parquet files with a year=YYYY in its path.
This function will search the whole system for the project ID, so it might
be slow sometimes.
list_parquet_datasets() list_parquet_files()list_parquet_datasets() list_parquet_files()
The path(s) to the Parquet datasets (as directories) or files.
list_parquet_datasets(): List all Parquet (Hive partitioned by year) datasets.
list_parquet_files(): List all Parquet files within a project.
Lists all SAS register files (with the extension .sas7bdat
case-insensitively) in the specified directory and its subdirectories.
list_sas_files(path)list_sas_files(path)
path |
Directory to search. |
The path(s) to the found SAS file(s).
list_sas_files(fs::path_package("fastreg", "extdata"))list_sas_files(fs::path_package("fastreg", "extdata"))
Turns the log information returned by convert() into a pretty
table, showing relative input/output paths and row counts.
print_log_row_count(log)print_log_row_count(log)
log |
A tibble returned by |
log invisibly.
sas_file <- fs::path_package("fastreg", "extdata", "test.sas7bdat") conversion_log <- convert(sas_file, output_dir = fs::path_temp("output")) print_log_row_count(conversion_log)sas_file <- fs::path_package("fastreg", "extdata", "test.sas7bdat") conversion_log <- convert(sas_file, output_dir = fs::path_temp("output")) print_log_row_count(conversion_log)
This is useful when the read_register() incorrectly guesses or can't find
the register.
read_parquet_dataset(path) read_parquet_file(path)read_parquet_dataset(path) read_parquet_file(path)
path |
Path to a directory with the Parquet files within or a path to a Parquet file. |
A DuckDB table.
read_parquet_dataset(): Reads a Parquet partitioned directory.
read_parquet_file(): Reads a single Parquet file.
This function uses the options fastreg.project_rawdata_dir and
fastreg.project_workdata_dir when set in options() or will try to guess
the path by using the project ID and the base directories
E:/<project-id>/rawdata/ and E:/<project-id>/workdata/. It only reads
Parquet datasets (those that are partitioned with the pattern year=). If
this function doesn't work, use read_parquet_dataset() or
read_parquet_file() instead.
read_register(name)read_register(name)
name |
Name of the Parquet dataset (i.e, the register name). See a list of available datasets with
|
A DuckDB table.
A helper function that simulates data using
osdc::simulate_registers(). It's used in vignettes and tests.
It simulates data for one or more registers and years.
simulate_registers_with_paths( registers, years = "", n = 1000, output_dir = fs::path_temp("E/rawdata/701010/") )simulate_registers_with_paths( registers, years = "", n = 1000, output_dir = fs::path_temp("E/rawdata/701010/") )
registers |
Name of one or more registers. Must be a register that
|
years |
One or more years to save the simulated data under. The year is
used as a suffix in the file name. For example for register "bef" and year
"1999", the file will be named |
n |
Number of rows of data to simulate per year. |
output_dir |
The root directory appended to the created SAS paths.
By default, the output_dir is a temp path that mimics the paths on DST,
|
A nested tibble with a column data containing the simulated data
and a column output_path containing the path where the SAS file should
be saved to. Pipe to purrr::pwalk(write_to_sas) or purrr::pmap(write_to_sas)
to write each simulated dataset to a SAS file.
sim_regs <- simulate_registers_with_paths( registers = c("bef", "lmdb"), years = c("1999", "2000"), n = 10, ) sim_regs sim_regs |> purrr::pwalk(write_to_sas)sim_regs <- simulate_registers_with_paths( registers = c("bef", "lmdb"), years = c("1999", "2000"), n = 10, ) sim_regs sim_regs |> purrr::pwalk(write_to_sas)
Copies a _targets.R template and a conversion log Quarto Markdown file to
the given directory.
use_template(path = ".", open = rlang::is_interactive())use_template(path = ".", open = rlang::is_interactive())
path |
Path to the directory where the targets pipeline and conversion log will be created. Defaults to the current directory. |
open |
Whether to open the file for editing. |
The path to the created _targets.R file, invisibly.
use_template(path = fs::path_temp(""))use_template(path = fs::path_temp(""))
A helper function that writes a data frame to a SAS file. It's used
mainly in fastreg's vignettes and tests. Pipe the output of
simulate_registers_with_paths() with purrr::pwalk() followed by this function
to write each simulated dataset to a SAS file.
write_to_sas(data, output_path)write_to_sas(data, output_path)
data |
A tibble containing the simulated data. |
output_path |
A string of the path to where the SAS file should be saved. |
Invisibly gives the path to the saved SAS file.