User Corpus Workflow
Use this skill for corpus-building tasks aligned with vignettes/corpus-workflow.qmd.
Required Workflow Order
- Create
kagi_connection(). - Build one or more endpoint queries (typically
query_search()). - Run
kagi_fetch()(orkagi_request()+kagi_request_parquet()). - Run
download_content(). - Run
content_markdown(). - Run
markdown_abstract(). - Read with
read_corpus(abstracts = TRUE)when needed.
Allowed Function Set
Selector Rules
-
endpoint = NULLmeans process all supported endpoints. -
query_name = NULLmeans process all queries in selected endpoint(s). - Keep file layout explicit:
<project>/<endpoint>/{json,parquet,content,markdown,abstract}.
Error Handling Rules
- Keep row-level failures as status/error outputs where supported.
- Use strict mode in CI-like runs; resilient mode for long batches.
- Do not invent fallback extraction behavior beyond package implementation.
References
Read and apply: - references/workflow.md - references/examples.md
References
Workflow
Corpus Workflow
- Build queries with endpoint constructors.
- Fetch to project folders (
kagi_fetch) or request + parquet manually. - Download source content (
download_content). - Convert content to markdown (
content_markdown). - Summarize markdown to abstract parquet (
markdown_abstract). - Read datasets with optional abstract linking (
read_corpus(abstracts = TRUE)).
Folder Contract
<project>/<endpoint>/json<project>/<endpoint>/parquet<project>/<endpoint>/content/query=<query><project>/<endpoint>/markdown/query=<query><project>/<endpoint>/abstract/query=<query>
Provider Guidance
- Prefer
summarize_with_openai()for general text quality. - Use
summarize_with_kagi()when staying inside Kagi API stack. - Use conservative concurrency for OpenAI due to rate limits.
Examples
Corpus Examples
conn <- kagi_connection(api_key = function() keyring::key_get("API_kagi"))
queries <- list(
bio_reports = query_search("biodiversity annual report", expand = FALSE)[[1]],
ecosystem_methods = query_search("ecosystem services valuation methods", expand = FALSE)[[1]]
)
kagi_fetch(
connection = conn,
query = queries,
project_folder = "tests_complex",
overwrite = TRUE
)
download_content(
project_folder = "tests_complex",
endpoint = "search",
query_name = NULL,
workers = 4
)
content_markdown(
project_folder = "tests_complex",
endpoint = "search",
query_name = NULL,
workers = 4
)
markdown_abstract(
project_folder = "tests_complex",
endpoint = "search",
query_name = NULL,
summarizer_fn = summarize_with_openai,
model = "gpt-4.1-mini",
workers = 1
)
ds <- read_corpus(
project_folder = "tests_complex",
endpoint = "search",
corpus = "parquet",
abstracts = TRUE,
silent = TRUE
)