Skip to contents

The package is provided as is and the authors do not take any responsibility for any damages or losses arising from its use. The software is provided without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement. Use at your own risk.

The authors are not affiliated with OpenAlex in any way.

LLM Usage Disclosure

Code and documentation in this project have been generated with the assistance of the codex LLM tools as well as Claude code in Positron. All content and code is based on conceptualisation by the authors and has been thoroughly reviewed and edited by humans afterwards.

Introduction

This package builds on the package openalexR but provides a more advanced approach to retrieve works from OpenAlex. In contrast to openalexR, which does all processing and conversions in memory, openalexPro uses an on-disc processing approach where the data is processed by number of records returned per call, i.e. a per-page processing approach. Doing all processing in memory has advantages for smaller numbers of records retrieved from OpenAlex, but limits the number of works which can be retrieved due to memory limitations. Even before the limit is reached, the often occurring new allocation of memory slows down the processing.

Quickstart

Installation

The latest “stable” version is available via r-universe

install.packages('openalexPro', repos = c('https://rkrug.r-universe.dev', 'https://cloud.r-project.org'))

The “development” version can be installed from github. This is generally not recommended! Unless you need bleeding edge functionality and can deal with changing function definitions, or want to test new functionality, this is not recommended.

remotes::install_github("rkrug/openalexPro", ref = "dev")

openalexPro can run without an API key, but OpenAlex applies much stricter limits in unauthenticated mode. For real workloads, set openalexPro.apikey in .Renviron or your current session.

# Recommended: set once in .Renviron
# openalexPro.apikey=your-api-key

# Or for the current session only:
Sys.setenv(openalexPro.apikey = "your-api-key")

# Check current key status
pro_validate_credentials()

# Inspect budget/usage (requires a valid key)
pro_rate_limit_status()

Live API Contract Tests (Optional)

The default test suite is cassette-based and deterministic.
To run online contract checks against the live OpenAlex API:

Sys.setenv(OPENALEXPRO_LIVE_TESTS = "true")
Sys.setenv(openalexPro.apikey = "your-api-key")
devtools::test(filter = "900-live")

Simplest Approach: pro_fetch()

For most use cases, pro_fetch() handles everything in one call:

library(openalexPro)

# Build query
url <- pro_query(
  entity = "works",
  search = "climate change",
  from_publication_date = "2023-01-01",
  type = "article",
  select = c("ids", "title", "publication_year", "cited_by_count")
)

# Download, transform, and convert to Parquet in one step
pro_fetch(
  query_url = url,
  project_folder = "my_climate_data",
  progress = TRUE
)

Your data is now ready in my_climate_data/parquet/.

Advanced Workflow (Individual Functions)

For more control over the pipeline, use the individual functions:

1. Define query (openalexPro::pro_query())

The query is defined using the function openalexPro::pro_query(). It follows the logic and arguments of openalexR::oa_query(). In addition to openalexR::oa_query(), the names of filters as well as fields selected for retrieval are verified before sending them to OpenAlex.

The supported filter names can be retrieved by running

and supported select fields by running

This defines a basic query.

query <- pro_query(
  entity = "works",
  search = "biodiversity AND conservation AND IPBES"
)

This returns a URL, which one can open in the browser.

If, however, for example 100 DOIs are given to be retrieved, the query is chunked into chunks of a maximum of the value of the argument chunk_limit, default is 50. In this case, the functions returns a list() with each element named Chunk_x and containing the URL as a character vector.

2. Retrieving records (openalexPro::pro_request())

openalexPro::pro_request(
  query_url = query,
  output = "json",
  verbose = TRUE
)

Will retrieve the records and save them into the folder specified in output. One important difference is now between the query being a single URL or a list: if it is a list, the future and future.apply packages are used to process the URLs in the list in parallel.

3. Processing json files (openalexPro::pro_request_jsonl())

This step prepares the json files for the final ingestion into a parquet database:

openalex_jsonl_folder <- openalexPro::pro_request_jsonl(
  input_json = "json_files",
  output = "jsonl_files",
  verbose = TRUE
)

The resulting json files can be found in the folder as specified in output.

4. Convert to parquet database (openalexPro::pro_request_jsonl_parquet())

Here the files are converted into a parquet page partitioned dataset saved as individual parquet files in the folder provided by the output argument.

parquet <- "./parquet"
openalexPro::pro_request_jsonl_parquet(
  input_jsonl = "jsonl_files",
  output = parquet,
  verbose = TRUE
)

Convenience Function to Read the Retrieved Data (openalexPro::read_corpus())

The read_corpus() function reads the corpus either as a arrow Dataset object if return_data = FALSE, which is essentially metadata to the dataset, or a data.frame, i.e. a data table, if return_data = TRUE, in which case the whole dataset is loaded into memory.

Design Principles

The retrieval of works and the initial processing / preparation can be split into these three steps:

In a first step (openalexPro::pro_request()), each page from the API call is saved into an individual json file as returned by the API. The number of retrieved records is effectively only limited by the space on the drive where the json files are saved. As the complete responses including metadata are saved, one could end here and use custom made code to further process the responses, i.e. ingest it into a database.

In a second step (openalexPro::pro_request_jsonl()), the json files are processed on a per file basis using the jq command-line json processor. In this step the abstract text is re-constructed, a citation string for each work is generated, and optionally add a page field is added. It writes the resulting json file as a newline-delimited JSON (.jsonl), suitable for further processing using arrow or DuckDB.

In the third (and final) step (openalexPro::pro_request_jsonl_parquet()) converts the jsonl files into a parquet database partitioned by page using the duckdb package. Again, as the processing is done per page as well, the conversion is not limited by memory.

This approach results in a stable pipeline which works for the retrieval of small as well as large to huge corpora. As the processing is done per page (which have a maximum of 200 works), the scaling should be more or less linear (in one application, more than 4 million works were retrieved without problems).

One point which needs to be taken into consideration when retrieving huge corpora, are rate limits by OpenAlex (see here and here for further details). Use pro_rate_limit_status() to inspect your current daily budget, usage, and remaining allowance at any time.

The final format which is used in this package to save the retrieved data is the parquet format which is space efficient and allows on disc processing, therefor there is no need to load the complete data into memory (see here for a detailed description of the format as well as the r-package arrow). To use the on disc processing in R, the arrow packages interfaces directly with dplyr, so that one can do a lot of processing before retrieving the actual data into memory (see the section on dplyr and arrow as well more general the arrow chapter in Hadley Wickham’s R for Data Science (2e) book.

Snowball Searches

Snowball search functionality has moved to the separate openalexSnowball package, which depends on openalexPro for the underlying pipeline.