library(openalexPro)
# Prepare a directory for your snapshot
prepare_snapshot("/path/to/openalex-data")Introduction
OpenAlex provides complete data snapshots that can be downloaded and processed locally. This approach offers several advantages over API queries: - No rate limits: Process millions of records without API restrictions - Reproducibility: Work with a fixed point-in-time dataset - Speed: Local queries are significantly faster than API calls (in many cases, especially what larger result corp[ora concerns)]) - Offline access: No internet connection required after download
This vignette guides you through:
- Setting up your environment
- Downloading the OpenAlex snapshot
- Converting to parquet format
- Building indexes for fast lookups
- Querying your local corpus
Prerequisites
Hardware Requirements
Working with the full OpenAlex snapshot requires significant resources:
| Resource | Minimum | Recommended |
|---|---|---|
| Disk space | 2.5 TB | 3+ TB |
| RAM | 16 GB | 32+ GB |
| CPU | 2 cores | 4+ cores |
The snapshot download is approximately 1.2 TB. The parquet conversion adds another 100-200 GB depending on compression settings.
Software Requirements
- R (>= 4.1.2)
- AWS CLI: For downloading from S3
- GNU Make: For running the Makefile targets
- openalexPro: This package
Installing AWS CLI
# macOS (with Homebrew)
brew install awscli
# Ubuntu/Debian
sudo apt install awscli
# Windows (with winget)
winget install Amazon.AWSCLIVerify installation:
aws --versionNo AWS account is required - the OpenAlex bucket allows anonymous access.
Quick Start
The easiest way to get started is using prepare_snapshot():
This creates:
-
Makefile: Contains all commands for snapshot management -
snapshot_guide.html: This documentation
Then navigate to the directory and use Make:
cd /path/to/openalex-data
# See available commands and current variable values
make help
# Download snapshot (WARNING: ~1TB, takes hours)
make snapshot
# Convert to parquet format
make parquet
# Build search indexes
make parquet_indexStep-by-Step Guide
1. Prepare Your Directory
Choose a location with sufficient disk space:
library(openalexPro)
# Create and prepare the snapshot directory
snapshot_dir <- "/Volumes/external-drive/openalex"
prepare_snapshot(snapshot_dir)2. Download the Snapshot
The OpenAlex snapshot is hosted on Amazon S3 and updated regularly. The download uses aws s3 sync which:
- Downloads only new or changed files
- Resumes interrupted downloads
- Removes files deleted from the source
cd /Volumes/external-drive/openalex
make snapshotExpected time: 2-12 hours depending on your internet connection.
Tip: You can monitor progress in another terminal:
watch -n 60 'du -sh openalex-snapshot/*'Snapshot Structure
After download, you’ll have:
openalex-snapshot/
├── RELEASE_NOTES.txt
├── data/
│ ├── works/
│ ├── authors/
│ ├── institutions/
│ ├── sources/
│ ├── publishers/
│ ├── funders/
│ ├── topics/
│ ├── fields/
│ ├── subfields/
│ └── domains/
└── ...
Each entity type contains gzipped JSON files partitioned by update date.
3. Convert to Parquet Format
The JSON snapshot is not efficient for analytical queries. Converting to parquet format provides:
- Columnar storage: Read only the columns you need
- Compression: Smaller files on disk
- Fast filtering: Predicate pushdown for efficient queries
- Type safety: Proper data types for each field
make parquet MEMORY_LIMIT=20GB WORKERS=4 SAMPLE_SIZE=10000This runs snapshot_to_parquet() which:
-
Deletes the existing parquet directory first (depends on
clean_parquet) - Infers a unified schema by sampling up to
SAMPLE_SIZE.gzfiles from each entity type - Converts each
.gzfile individually to a.parquetfile using DuckDB - Supports parallel processing via
WORKERS(number of parallelfutureworkers)
NB: Set MEMORY_LIMIT (per-worker DuckDB limit), WORKERS (number of parallel workers), and SAMPLE_SIZE (files to sample for schema inference) to values that work with your system. Since parquet unconditionally deletes the existing parquet directory, use snapshot_to_parquet() directly in R if you want resume behaviour.
Expected time: 2-8 hours depending on CPU and disk speed.
Parquet Structure
After conversion, each entity directory contains one .parquet file per input .gz file:
parquet/
├── works/
│ ├── part_000.parquet
│ ├── part_001.parquet
│ └── ...
├── authors/
├── institutions/
└── ...
4. Build Search Indexes
For fast ID-based lookups, build indexes:
make parquet_index WORKERS=8This creates a single index parquet file for each entity type, enabling fast lookups by OpenAlex ID:
parquet/
├── works/
├── works_id_idx.parquet
└── ...
The index maps each ID to its physical location (file and row number) in the corpus, so lookup_by_id() can retrieve specific records without scanning the entire corpus.
NB: Indexing is less memory-intensive than parquet conversion and profits from higher WORKERS. To rebuild indexes without touching the parquet files, use make clean_index followed by make parquet_index.
Using Your Local Corpus
Looking Up Records by ID
library(openalexPro)
# Look up specific works by OpenAlex ID (returns data frame)
works <- lookup_by_id(
index_file = "/Volumes/external-drive/openalex/parquet/works_id_idx.parquet",
ids = c("W2741809807", "W2100837269")
)
# For millions of IDs, write directly to parquet instead of loading into memory
lookup_by_id(
index_file = "/Volumes/external-drive/openalex/parquet/works_id_idx.parquet",
ids = large_id_vector,
output = "filtered_works",
workers = 3
)Direct DuckDB Queries
For complex analytical queries, use DuckDB directly:
library(duckdb)
library(DBI)
con <- dbConnect(duckdb())
# Register parquet files as a view
dbExecute(
con,
"
CREATE VIEW works AS
SELECT * FROM parquet_scan('/Volumes/external-drive/openalex/parquet/works/*.parquet')
"
)
# Run analytical queries
result <- dbGetQuery(
con,
"
SELECT publication_year, COUNT(*) as n_works
FROM works
WHERE publication_year >= 2020
GROUP BY publication_year
ORDER BY publication_year
"
)
dbDisconnect(con)Managing Updates
OpenAlex releases new snapshots regularly. The Makefile supports update workflows via dedicated timestamp targets.
Checking for Updates
make snapshot_infoThis shows the current state of the S3 bucket including the total size and file count.
make helpThis shows the current values of all configuration variables, including RELEASE_DATE read from RELEASE_NOTES.txt.
Update Workflow
# 1. Archive the current snapshot with its release date
# (renames openalex-snapshot → openalex-snapshot-YYYY-MM-DD)
make snapshot_timestamp
# 2. Download the new snapshot
make snapshot
# 3. Archive the current parquet directory with the old release date
make parquet_timestamp
# 4. Re-convert to parquet (deletes parquet dir first, then converts)
make parquet SAMPLE_SIZE=10000
# 5. Rebuild indexes
make parquet_indexNote: snapshot_timestamp and parquet_timestamp read the release date from RELEASE_NOTES.txt in the snapshot directory. Run snapshot_timestamp before downloading the new snapshot, and parquet_timestamp before re-converting, so the old release date is still available.
Troubleshooting
Download Issues
Interrupted download: Simply run make snapshot again. The sync will resume where it left off.
Slow download: OpenAlex uses CloudFront CDN. Speed depends on your location and internet connection. Consider running overnight.
Disk space errors: Ensure you have at least 400 GB free before starting.
Conversion Issues
Out of memory: Reduce memory usage by overriding variables on the command line:
make parquet MEMORY_LIMIT=4GB WORKERS=1DuckDB temp file errors (IO Error: Could not read enough bytes from file ".tmp/..."): DuckDB spills temporary data to disk. By default it writes to .tmp/ in the current directory. If that location runs out of space or is on a slow filesystem, point it elsewhere:
make parquet TEMP_DIR=/tmpDuckDB errors: Ensure you have the latest version of the duckdb R package.
Makefile Reference
| Target | Description |
|---|---|
help |
Show available targets and current variable values |
all |
Clean, download snapshot, and convert to parquet |
snapshot_info |
Display S3 bucket size and file count |
snapshot_timestamp |
Rename existing snapshot directory with its release date |
snapshot |
Download/sync snapshot from S3 |
parquet_timestamp |
Rename existing parquet directory with its release date |
parquet |
Delete parquet dir then convert snapshot to parquet |
parquet_index |
Build ID indexes for all datasets |
clean_index |
Remove index files (*_idx.parquet) from parquet directory |
clean_parquet |
Remove parquet directory (includes index files) |
clean_snapshot |
Remove snapshot directory |
clean |
Remove both snapshot and parquet directories |
Customizing the Makefile
Variables can be overridden on the command line or by editing the defaults at the top of the Makefile:
SNAPSHOTDIR=./openalex-snapshot # Where to download JSON
PARQUETDIR=./parquet # Where to write parquet
MEMORY_LIMIT=15GB # DuckDB memory limit per worker
WORKERS=3 # Number of parallel workers
SAMPLE_SIZE=100 # Files sampled for schema inference
TEMP_DIR=/tmp # DuckDB temporary directory (spill-to-disk)Override on the command line without editing the file:
make parquet MEMORY_LIMIT=20GB WORKERS=4 SAMPLE_SIZE=10000 TEMP_DIR=/tmp