Changelog
Source:NEWS.md
openalexPro 0.6.1
Bug Fixes
- Manual add the
idfield to theopt_select_names()as it is missing from the returned list from OpenAlex
Changes
Normalized
api_keyhandling across API-calling functions:pro_request(),pro_fetch(),pro_count(), andpro_download_content()now acceptapi_key = NULLorapi_key = "". In that case, requests are sent without an API key (subject to OpenAlex’s unauthenticated limits).Added explicit
api_keytype validation in API-calling functions. Accepted inputs are now limited toNULLor a length-1 character string.Updated
pro_rate_limit_status()to handleapi_key = NULLsafely (informational message +FALSEreturn), and aligned documentation.
Testing and Tooling
Added opt-in live API contract tests (
tests/testthat/test-900-live_api_contracts.R) gated byOPENALEXPRO_LIVE_TESTS=trueand a non-dummyopenalexPro.apikey.Added
inst/scripts/record_cassettes.Rand recording safeguards to prevent accidental re-recording with invalid credentials.Reduced warning noise in test runs by cleaning up deprecated-search warning handling and removing unused cassette hooks.
openalexPro 0.6.0
New Features
Added
pro_rate_limit_status()to query the OpenAlex rate-limit endpoint (GET /rate-limit). Returns the full rate-limit JSON invisibly (daily budget, used, remaining, prepaid balance, per-endpoint costs, reset time). Prints a human-readable summary viamessage()whenverbose = TRUE(the default). ReturnsFALSEfor a missing or invalid API key, andNULLon a network error, so callers can distinguish auth problems from transient failures.pro_validate_credentials()refactored to usepro_rate_limit_status()internally instead of making a separatepro_count()request. Behaviour and return value are unchanged.Added
pro_download_content()to download full-text PDFs (format = "pdf") or TEI XML (format = "grobid-xml") from the OpenAlex content endpoint (content.openalex.org). Accepts a vector of work IDs, supports parallel downloads viaworkers, and returns a data frame with per-file status ("ok"/"not_found"/"error"). Note: content downloads cost $0.01 per file.-
Added
search.exactandsearch.semanticparameters topro_query(), matching the new OpenAlex search API:-
search.exact: searches without stemming or stop-word removal; supports boolean operators, quoted phrases, proximity (~N), and wildcards. -
search.semantic: AI embedding-based search that matches by conceptual meaning rather than keywords (max 50 results, max 1 req/sec). -
search: now documented to support the full boolean/phrase/wildcard syntax in addition to its existing stemmed matching.
-
Exported
infer_json_schema()for direct use. Infers a unified DuckDB columns clause from a set of JSON/NDJSON files via per-fileDESCRIBEqueries with type-widening and optional two-level disk caching (schema_cache_dir).
Internal Changes
-
pro_rate_limit_status()andpro_download_content()now route their HTTP requests through the internalapi_call()helper, unifying retry logic and error handling across all real API call sites.suppressMessages()is used to suppressapi_call()’s internal logging so each function emits its own user-facing messages.pro_download_content()now also sends aUser-Agentheader (previously omitted).
Deprecations
- Filter arguments with a
.searchsuffix (e.g.title_and_abstract.search = "...") are deprecated by the OpenAlex API. They still work but now emit a warning. Use thesearchparameter ofpro_query()instead:pro_query(entity = "works", search = "your terms"). See https://developers.openalex.org/guides/searching for details.
Bug Fixes
-
Fixed Windows path-normalization failures in
snapshot_to_parquet(),build_corpus_index(),lookup_by_id(), andpro_request_jsonl_parquet(). On Windows,normalizePath()can return 8.3 short names (e.g.RUNNER~1) fortempdir()-derived paths whilelist.files()and DuckDB resolve to long names (runneradmin). Resume detection insnapshot_to_parquet()used%in%on paths with mixed separators (\vs/), causing already-converted files to be reconverted.build_corpus_index()embeddedsnapshot_dir(with\) inside a DuckDBregexp_replacepattern, which never matched — so the full absolute path was stored in the index and later doubled bylookup_by_id().pro_request_jsonl_parquet()usednormalizePathstring comparison to detect subdirectories, which always failed, placing every output file in a spuriousquery=<dirname>subdirectory.Fixes: (1) normalize separators to
/withgsub("\\\\", "/", ...)on both sides of%in%comparisons; (2) compute relative paths in R using path-depth counting (strsplit(path, "/")then indexed extraction) rather than string-matching absolute paths — immune to 8.3 vs long-name differences;- pass the relative path as a SQL literal in
build_corpus_index()instead of computing it inside DuckDB with a regex.
- pass the relative path as a SQL literal in
Changes
- Schema cache per-file CSVs renamed from
%06d_<basename>.schema.csvto<update_date>_<part_name>.csv(e.g.2024-01-15_part_001.csv), making each cache file directly traceable to its source.gz.
Breaking Changes
- Removed
mailtoparameter from all API functions (pro_request(),pro_fetch(),pro_count(),pro_validate_credentials()). OpenAlex no longer uses email addresses for polite-pool access. -
api_keyhandling was tightened in 0.6.0 forpro_request(),pro_fetch(), andpro_count().
Note: this was later relaxed again in development; current development allowsapi_key = NULL/""and runs in unauthenticated mode. - Simplified User-Agent string from
openalexPro v[VERSION] (mailto:[EMAIL])toopenalexPro/[VERSION].
openalexPro 0.5.0
New Features
Snapshot Handling
- Added
prepare_snapshot()function for setting up a directory with Makefile and documentation for managing OpenAlex snapshots. - Added
Makefile.snapshotininst/for automating snapshot download, conversion, and indexing. Includes targets forsnapshot,parquet,parquet_index, and automatic renaming of existing data with release dates. - Added
snapshot_to_parquet()function for converting OpenAlex snapshot NDJSON files to Parquet format using DuckDB. Processes each.gzfile individually with per-file resume support. Supports parallel processing viaworkers(usingfuture_lapply()) and unified schema inference viasample_size. - Added
build_corpus_index()function for creating memory-efficient Parquet indexes for fast ID lookups. Handles 300M+ records by processing parquet files individually, with optional parallelization viaworkersand progress reporting viaprogressr. The index file is auto-named and placed alongside the corpus directory. - Added
lookup_by_id()function for fast record retrieval from a parquet corpus using pre-built indexes. Uses Arrow for index filtering with automatic ID normalization. Supports parallel reads viaworkersand streaming to parquet viaoutputfor millions of IDs without loading into memory. - Added
snapshot_filter_ids()function for filtering snapshot data by ID lists. - Added
id_block()helper function for computing ID block partitions.
Documentation
- Added
snapshot.qmdvignette with comprehensive guide on downloading, converting, and querying OpenAlex snapshots locally.
Changes
- Refactored
snapshot_to_parquet()to process each.gzfile individually instead of all at once. This reduces memory usage, enables per-file resume on interruption, and shows progress with ETA. Theworkersparameter now controls parallelfutureworkers instead of DuckDB threads. Addedsample_sizeparameter for schema inference. - Extracted
infer_json_schema()andconvert_json_to_parquet()internal helpers, shared by bothsnapshot_to_parquet()andpro_request_jsonl_parquet(). - Refactored
pro_request_jsonl_parquet()to per-file conversion withfuture_lapply()parallelization. Removes hive partitioning bypage; subfolder structure is preserved directly. Addedworkersparameter. Removedprogressparameter (replaced byprogressr).
Bug Fixes
- Fixed vignette parse errors in
pro_query.qmd(malformed code block closings). - Fixed out-of-memory crash in
snapshot_to_parquet()whensample_sizeexceeded the number of available files (e.g.sample_size = 10000with 1981 works files). Schema inference now processes one file at a time instead of a single bulk DuckDB query. - Fixed
duplicate key "as"crash when converting theworksdataset.abstract_inverted_indexis now stored asVARCHAR(raw JSON string) rather than aSTRUCT. DuckDB folds struct field names to lowercase, causing a collision between the valid JSON keys"as"and"As"in this field. Storing asVARCHARavoids struct parsing entirely and preserves the data. Parse individual values withjsonlite::fromJSON()when needed. - Fixed DuckDB temp file IO errors during
snapshot_to_parquet()by exposing aTEMP_DIRvariable inMakefile.snapshot(default/tmp).
Changes
-
snapshot_to_parquet()schema inference now runs one DuckDBDESCRIBEper file instead of a single query across all sampled files. Results are cached in<parquet_ds>/.schema_cache/: per-file CSVs (<update_date>_<part_name>.csv) enable mid-run resume; a unifiedunified_schema.csvis loaded on subsequent runs to skip inference entirely. Deleteunified_schema.csvto force re-inference.
Tests
- Added comprehensive tests for
snapshot_to_parquet(),build_corpus_index(), andlookup_by_id(). - Added tests for schema caching, unified schema reuse, and works
abstract_inverted_indexVARCHAR round-trip.
openalexPro 0.4.1
- Standardised progressbar handling
- Changed default pages from 1,000 to 10,000
- Refactored
pro_queryand removedmultiple_idsargument using Claude and expanded tests and added vignette. - Added creation of
00_completedin output directory ofjson,jsonlandparquetfolders upon successful completion - Changed api key and email handling. Removed oap_mail()_ and oap_apikey() and simplified handling of api key and email to only use environmental variables
openalexPro.emailandopenalexPro.apikey - Added unified schema inference to
pro_request_jsonl_parquet()to prevent schema conflicts when reading combined Parquet datasets. Newsample_sizeparameter controls schema inference sampling. This fixes “Unsupported cast from string to struct” errors when fields have different types across JSONL files (e.g.,apc_paidbeingnullin some files and a struct in others). - Removed
harmonize_parquet_schemata()as it is no longer needed with the new unified schema inference. - Increased default n umber of pages to be read by
request_json()from 1000 to 10000 to allow the initially planned 2,000,000 work download.
openalexPro 0.4.0
CI and coverage tweaks for CRAN readiness.
splitting snowball functionality into openalexSnowball
openalexPro 0.3.1
- Added
pro_fetch()withproject_foldersupport for structured outputs. - Added progress reporting and parallelization for
pro_request_jsonl(). - Added
sample_parquet_n()random sampling utilities withselectsupport. - Improved
count_onlyoutput to return a data frame with an error column.
openalexPro 0.3.0
- Added
count_onlysupport forpro_request()and related helpers. - Added DOI handling improvements and API call fixes.
openalexPro 0.2.0
- Introduced
pro_query()as the package-native query builder with chunking. - Added snowball search utilities and citation edge extraction workflow.
- Expanded conversion pipeline tests and VCR-based API fixtures.
- Added
extract_doi()helpers and compatibility reporting artifacts.