Notes for developers
Source:DEVELOPERS.md
Data pipeline
Users call get_*(), which reads tidy_data_sources.yml for the current version and URL, downloads the pre-built parquet if not already cached, and caches it to ~/.cache/R/nhstt/tidy/{dataset}.parquet. Subsequent calls return the cached file instantly.
Developers run build_tidy_data(dataset) (just build-data for all), which downloads raw source files, applies the tidy pipeline for every period, and writes a combined parquet to data-raw/. Parquets are published as GitHub Release assets and are directly usable from Python, Julia, or any language that reads parquet.
Available datasets
| Function | Frequency | First period | Last period | Periods | Version |
|---|---|---|---|---|---|
get_key_measures_annual() |
annual | 2017-18 | 2024-25 | 8 | 0.2.0 |
get_proms_annual() |
annual | 2019-20 | 2024-25 | 6 | 0.1.0 |
get_therapy_position_annual() |
annual | 2019-20 | 2024-25 | 6 | 0.1.0 |
get_activity_performance_monthly() |
monthly | 2023-05 | 2026-03 | 35 | 0.3.0 |
get_metadata_measures_annual() |
annual | 2024-25 | 2024-25 | 1 | 0.1.0 |
get_metadata_variables_annual() |
annual | 2024-25 | 2024-25 | 1 | 0.1.0 |
get_metadata_monthly() |
monthly | 2026-05 | 2026-05 | 1 | 0.1.0 |
get_metadata_providers() |
live | current | current | 1 | 0.1.0 |
Approximate build times
One period timed per dataset; extrapolated to full build. Re-run with just render-developers.
| Dataset | Timed period | Periods | Time for one (s) | Est. full build (min) |
|---|---|---|---|---|
key_measures_annual |
2024-25 | 8 | 61.4 | 8.2 |
proms_annual |
2024-25 | 6 | 116.5 | 11.6 |
therapy_position_annual |
2024-25 | 6 | 0.3 | 0.0 |
activity_performance_monthly |
2026-03 | 35 | 5.4 | 3.2 |
Publishing new data
Each dataset is released independently — updating monthly data does not affect annual dataset versions or caches.
Update raw config (e.g.
raw_monthly_data_config.yml) with new sourcesjust build-data— rebuilds all parquets and writes todata-raw/-
Create a GitHub Release using
just release(run after merging to main): Update
versionfor that dataset only ininst/config/tidy_data_sources.yml
Provider metadata is published the same way, but its parquet is built from the live ODS API snapshot:
Configuration files
All in inst/config/:
-
tidy_data_sources.yml— version per released dataset; GitHub Release URLs are derived from the dataset name and version -
raw_*_config.yml— raw source archives, URLs, and periods -
tidy_*_config.yml— tidy transformations (filters, derivations, columns)
After changing raw config column structure, run just update-schemas to update inst/schemas/.
Dev utilities
Internal functions in R/dev-utils.R (load with devtools::load_all()):
-
list_archives_periods()— all archives and available periods -
list_archive_files("annual_main", "2024-25")— CSV files inside an archive -
read_archive_file("annual_main", "2024-25", "main")— read a specific file -
compare_schemas("main")— column changes across periods -
build_tidy_data("dataset_name")— build combined parquet for one dataset