The function takes a directory of JSON files as written from a call to
pro_request(...) and is preparing the json files to be
processed further using DuckDB by converting them to jsonl files.
The subfolders in input_json are preserved in output, i.e.
results of a list of initial queries passed to pro_request() are maintained.
Usage
pro_request_jsonl(
input_json = NULL,
output = NULL,
add_columns = list(),
overwrite = FALSE,
verbose = TRUE,
delete_input = FALSE
)Arguments
- input_json
The directory of JSON files returned from
pro_request(..., json_dir = "FOLDER").- output
output directory for the jsonl files as created by calls to `jq_execute().
- add_columns
List of additional fields to be added to the output. They nave to be provided as a named list, e./g.
list(column_1 = "value_1", column_2 = 2). Only Scalar values are supported.- overwrite
Logical indicating whether to overwrite
output.- verbose
Logical indicating whether to show a verbose information. Defaults to
TRUE- delete_input
Determines if the
input_jsonshould be deleted afterwards. Defaults toFALSE.
Details
See jq_execute or the vignette("jq",
package = "openalexPro") for more information on the conversion of the
JSON files.
The folder/filename is converted to a value named page
As an example:
the subfolder in the
outputfolder is calledChunk_1the page othe json file represents is
2The resulting cvalus for
pagewill beChunk_1_2
The function uses DuckDB to read the JSON files and to create the Apache Parquet files. The function creates a DuckDB connection in memory and readsds the JSON files into DuckDB when needed. Then it creates a SQL query to convert the JSON files to Apache Parquet files and to copy the result to the specified directory.