Skip to content

Command: convert

Convert OpenAlex snapshot .json.gz files into parquet while preserving relative structure.

Usage

openalex-snapshot convert \
  --root-dir /data \
  --dataset works \
  --profile balanced \
  --workers 4

Key behavior

  • 1 input .gz maps to 1 output .parquet (unless --split-size is set; see below)
  • Resume-safe output skipping
  • Verification is separate via verify_convert
  • Supports selected-file conversion via repeated --input-file

Profile / tuning

--profile controls the global DuckDB memory budget shared across all in-process worker connections (derived from 80% of usable RAM × fraction, clamped to a min/max range).

Profile Workers cap Memory fraction Memory range
auto (none) 65% of usable RAM 4 – 32 GiB
balanced (none) 65% of usable RAM 4 – 32 GiB
safe max 2 15% of usable RAM 1 – 8 GiB
fast (none) 80% of usable RAM 8 – 48 GiB

auto is the default and behaves identically to balanced.

Fallback when RAM cannot be detected: safe=2 GiB, balanced/auto=6 GiB, fast=12 GiB.

Use --max-memory-mb to override the profile memory calculation entirely. Workers set via --workers or config are respected unless safe clamps them.

Large-file handling

By default (--split-size 0) large files are processed directly by in-process DuckDB, which streams and spills to disk as needed within the global memory budget.

Set --split-size <SIZE> (e.g. 256mb, 512mb) to pre-split gz files larger than that threshold into chunks before conversion. Each chunk produces a numbered parquet file (e.g. part_0000_001.parquet, part_0000_002.parquet). Use this only if you observe OOM despite a generous profile, or when running on a machine with very limited RAM.