Convert JSON files to Apache Parquet files
Source:R/pro_request_jsonl_parquet.R
pro_request_jsonl_parquet.RdThe function takes a directory of JSONL files as written from a call to
pro_request_jsonl(...) and converts it to a Apache Parquet files. Each
jsonl is processed individually, so there is no limit of the number of records.
Usage
pro_request_jsonl_parquet(
input_jsonl = NULL,
output = NULL,
overwrite = FALSE,
verbose = TRUE,
delete_input = FALSE
)Arguments
- input_jsonl
The directory of JSON files returned from
pro_request(..., json_dir = "FOLDER").- output
output directory for the parquet dataset; default: temporary directory.
- overwrite
Logical indicating whether to overwrite
output.- verbose
Logical indicating whether to show a verbose information. Defaults to
TRUE- delete_input
Determines if the
input_jsonlshould be deleted afterwards. Defaults toFALSE.
Details
The value page as created in pro_request_jsonl() is used for partitioning.
All jsonl files are combined into a single Apache Parquet dataset, but can be
filtered out by using the "page". As an example:
the subfolder in the
outputfolder is calledChunk_1the page othe json file represents is
2The resulting values for
pagewill beChunk_1_2
The function uses DuckDB to read the JSON files and to create the Apache Parquet files. The function creates a DuckDB connection in memory and readsds the JSON files into DuckDB when needed. Then it creates a SQL query to convert the JSON files to Apache Parquet files and to copy the result to the specified directory.