Skip to contents

This function creates a Parquet index that maps OpenAlex IDs to their physical location in the parquet corpus. This enables fast random access to specific records without scanning entire partitions.

Usage

build_corpus_index(corpus_dir, memory_limit = NULL, workers = NULL)

Arguments

corpus_dir

Path to the parquet corpus directory.

memory_limit

DuckDB memory limit (e.g., "20GB"). Default is NULL.

workers

Number of parallel workers for Stage 1 indexing and DuckDB threads for Stage 2. Default is NULL (use all cores).

Value

Invisibly returns the path to the created index.

Details

The index file will be created in the same directory as the corpus_dir and has to stay there for the lookup to function. Together with the corpus_dir, the index file can be moved to any location.

The function is memory-efficient and can handle 300M+ records by using a two-stage approach: first indexing each parquet file individually (bounded memory per file), then combining into a single parquet index file. This avoids loading the entire dataset at once. Stage 1 is parallelized using future.apply::future_lapply() and supports resuming if interrupted. On macOS, a .metadata_never_index file is created in the temporary directory to prevent Spotlight from indexing the parquet files during building.

The index contains the following columns:

id

The OpenAlex ID

id_block

Block number computed as floor(numeric_id / 10000)

parquet_file

Relative path to the parquet file in the corpus

file_row_number

Row number within the file (0-indexed)

Examples

if (FALSE) { # \dontrun{
# Build partitioned index for OpenAlex IDs (fast O(1) lookup)
build_corpus_index(
  corpus_dir = "/Volumes/openalex/parquet/works",
  memory_limit = "20GB"
)
} # }