Convert a directory of JSON files written by kagi_request() into an
Apache Parquet dataset. JSON files are processed one-by-one and written as
hive-partitioned parquet by query.
Usage
kagi_request_parquet(
input_json = NULL,
output = NULL,
add_columns = list(),
overwrite = FALSE,
append = FALSE,
verbose = TRUE,
delete_input = FALSE
)Arguments
- input_json
Directory containing JSON files from
kagi_request().- output
output directory for the parquet dataset; default: temporary directory.
- add_columns
List of additional fields to be added to the output. They have to be provided as a named list, e.g.
list(column_1 = "value_1", column_2 = 2). Only Scalar values are supported.- overwrite
Logical indicating whether to overwrite
output.- append
Logical indicating whether to append/update query partitions in an existing
outputdirectory without deleting untouched queries.- verbose
Logical indicating whether to print progress information. Defaults to
TRUE- delete_input
Determines if the
input_jsonshould be deleted afterwards. Defaults toFALSE.
Details
The function uses DuckDB to read the JSON files and to create the
Apache Parquet files. It creates an in-memory DuckDB connection, reads each
JSON response, and writes endpoint-specific tabular data into the parquet
dataset. Files with data = null are skipped.
Output parquet rows include an id column for traceability:
Search:
SEARCH_<hash>from normalizedurlwhen available.Enrich web:
ENRICH_WEB_<hash>from normalizedurlwhen available.Enrich news:
ENRICH_NEWS_<hash>from normalizedurlwhen available.Summarize:
SUMMARIZE_<hash>from request metadata.FastGPT:
FASTGPT_<hash>from request metadata.