This function converts the OA (OpenAlex) snapshot data to Parquet format,
processing each .gz file individually. Existing output files are skipped,
allowing interrupted conversions to resume. On macOS, a .metadata_never_index
file is created in the output directory to prevent Spotlight from indexing
the parquet files.
Arguments
- snapshot_dir
The directory path of the OA snapshot data. Default is
"Volumes/openalex/openalex-snapshot".- parquet_dir
The directory path where the Parquet files will be saved. Default is
"Volumes/openalex/parquet".- data_sets
A character vector specifying the data sets to process. Default is
NULL, which processes all data sets.- sample_size
Number of
.gzfiles to sample for unified schema inference. Higher values give more accurate schemas but take longer. Default is20. UseNULLor0to use all files.- temp_directory
Location of the temporary directory for DuckDB. Passed to each worker's DuckDB connection. Default is
NULL(system default).- memory_limit
DuckDB memory limit per worker (e.g.,
"8GB"). Default isNULL(DuckDB default).- workers
Number of parallel workers for file conversion via
future.apply::future_lapply(). Default isNULL(sequential processing).
Details
The conversion proceeds in two stages for each data set:
Schema inference: A sample of
.gzfiles is read using DuckDB'sread_json_auto()withunion_by_name = trueto infer a unified schema. This ensures all output parquet files have consistent column types.Per-file conversion: Each
.gzfile is converted individually to a.parquetfile. Whenworkers > 1, files are processed in parallel using future::multisession, with each worker creating its own DuckDB connection.
Already-converted files (those with a matching .parquet output) are
automatically skipped, so the function can resume after interruption.
Examples
if (FALSE) { # \dontrun{
# Convert all data sets in the default snapshot directory
snapshot_to_parquet()
# Convert specific data sets with parallel processing
snapshot_to_parquet(
snapshot_dir = "/path/to/snapshot",
data_sets = c("authors", "works"),
workers = 4,
memory_limit = "8GB"
)
} # }