Skip to content

Command: extract

Extract records by OpenAlex IDs from CSV using per-dataset parquet indexes.

The command auto-routes each ID to a dataset using:

  • entity prefixes (W, A, S, I, T, K, P, F, G, C)
  • taxonomy namespaces (countries/, continents/, languages/, domains/, fields/, subfields/, sdgs/, work-types/, source-types/, licenses/, institution-types/)

It writes one output file per resolved dataset as:

  • <output_base>_<dataset>.parquet

Unknown/unmapped IDs are skipped and reported.

Example

openalex-snapshot extract \
  --root-dir /Volumes/openalex \
  --ids /Volumes/openalex/ids.csv \
  --output /Volumes/openalex/extract.parquet \
  --profile balanced

Notes

  • Requires per-dataset index files (for example parquet/works_id_idx.parquet).
  • Build indexes first with openalex-snapshot index.
  • Use --dataset <name> to restrict extraction to one dataset.