Command: convert
Convert OpenAlex snapshot .json.gz files into parquet while preserving relative structure.
Usage
openalex-snapshot convert \
--root-dir /data \
--dataset works \
--profile balanced \
--workers 4
Key behavior
- 1 input
.gzmaps to 1 output.parquet(unless--split-sizeis set; see below) - Resume-safe output skipping
- Verification is separate via
verify_convert - Supports selected-file conversion via repeated
--input-file
Profile / tuning
--profile controls the global DuckDB memory budget shared across all in-process worker
connections (derived from 80% of usable RAM × fraction, clamped to a min/max range).
| Profile | Workers cap | Memory fraction | Memory range |
|---|---|---|---|
auto |
(none) | 65% of usable RAM | 4 – 32 GiB |
balanced |
(none) | 65% of usable RAM | 4 – 32 GiB |
safe |
max 2 | 15% of usable RAM | 1 – 8 GiB |
fast |
(none) | 80% of usable RAM | 8 – 48 GiB |
auto is the default and behaves identically to balanced.
Fallback when RAM cannot be detected: safe=2 GiB, balanced/auto=6 GiB, fast=12 GiB.
Use --max-memory-mb to override the profile memory calculation entirely.
Workers set via --workers or config are respected unless safe clamps them.
Large-file handling
By default (--split-size 0) large files are processed directly by in-process DuckDB, which
streams and spills to disk as needed within the global memory budget.
Set --split-size <SIZE> (e.g. 256mb, 512mb) to pre-split gz files larger than that
threshold into chunks before conversion. Each chunk produces a numbered parquet file
(e.g. part_0000_001.parquet, part_0000_002.parquet). Use this only if you observe OOM
despite a generous profile, or when running on a machine with very limited RAM.