Convert OA snapshot to Parquet format — snapshot_to

This function converts the OA (OpenAlex) snapshot data to Parquet format, processing each .gz file individually. Existing output files are skipped, allowing interrupted conversions to resume. On macOS, a .metadata_never_index file is created in the output directory to prevent Spotlight from indexing the parquet files.

Usage

snapshot_to_parquet(
  snapshot_dir = file.path("", "Volumes", "openalex", "openalex-snapshot"),
  parquet_dir = file.path("", "Volumes", "openalex", "parquet"),
  data_sets = NULL,
  sample_size = 20,
  temp_directory = NULL,
  memory_limit = NULL,
  workers = NULL
)

Arguments

snapshot_dir: The directory path of the OA snapshot data. Default is "Volumes/openalex/openalex-snapshot".
parquet_dir: The directory path where the Parquet files will be saved. Default is "Volumes/openalex/parquet".
data_sets: A character vector specifying the data sets to process. Default is NULL, which processes all data sets.
sample_size: Number of .gz files to sample for unified schema inference. Higher values give more accurate schemas but take longer. Default is 20. Use NULL or 0 to use all files.
temp_directory: Location of the temporary directory for DuckDB. Passed to each worker's DuckDB connection. Default is NULL (system default).
memory_limit: DuckDB memory limit per worker (e.g., "8GB"). Default is NULL (DuckDB default).
workers: Number of parallel workers for file conversion via future.apply::future_lapply(). Default is NULL (sequential processing).

Details

The conversion proceeds in two stages for each data set:

Schema inference: A sample of .gz files is read using DuckDB's read_json_auto() with union_by_name = true to infer a unified schema. This ensures all output parquet files have consistent column types.
Per-file conversion: Each .gz file is converted individually to a .parquet file. When workers > 1, files are processed in parallel using future::multisession, with each worker creating its own DuckDB connection.

Already-converted files (those with a matching .parquet output) are automatically skipped, so the function can resume after interruption.

Examples

if (FALSE) { # \dontrun{
# Convert all data sets in the default snapshot directory
snapshot_to_parquet()

# Convert specific data sets with parallel processing
snapshot_to_parquet(
  snapshot_dir = "/path/to/snapshot",
  data_sets = c("authors", "works"),
  workers = 4,
  memory_limit = "8GB"
)
} # }