Introduction
The package openalexPro2 is an extension of the package openalexR. The main difference is that openalexR does all processing in memory and returns the result as an R object, while openalexPro2 will write the results to disk. This allows for processing very large corpora.
Most of the processing is done using the duckdb package which makes it possible tro do complex queries of the corpora without the need to load the whole corpus into memory.
Workflow
The workflow when using openalexPro2 is essentially as follows:
- the function
openalexPro2::pro_query()is used to build the API query for the OpenAlex API - the function
openalexPro2::pro_request()is used to retrieve the results from the OpenAlex API and to store them in a folder in the format oifjsonfiles as returned by OpenAlex. - the function
openalexPro2::pro_request_jsonl()is doing some cleaning and editing of the json files into Json Lines, aka NDJSON, which is atext file where each line is a separate JSON object. For details see - the function
openalexPro2::pro_request_jsonl_parquet()is used to convert thejsonfiles to a parquet database
TODO: Add plantuml graph TODO add more info and examles