Maintainer Corpus Pipeline
Use this skill for internals of the modular corpus pipeline.
Scope
Apply when changing any of:
download_content()content_markdown()markdown_abstract()summarize_with_openai()summarize_with_kagi()-
read_corpus(abstracts = TRUE)linking behavior
Required Contracts
- Preserve folder layout by endpoint and query partition.
- Keep
id + queryas the abstract linking key. - Keep abstract schema with lowercase
abstract. - Keep row-level status/error reporting for partial failures.
- Keep provider functions pluggable via function argument.
Retry and Concurrency Rules
- OpenAI provider should default to conservative request concurrency.
- Retry behavior must remain explicit and documented.
- Progress output should reflect file-level work, not only worker completion.
Documentation Sync Rules
When behavior changes, update together:
-
README.mdpipeline section vignettes/corpus-workflow.qmdPROJECT_DESIGN.md- this skill’s references
References
Read and apply: - references/contracts.md - references/testing.md
References
Contracts
Corpus Pipeline Contracts
File Layout
<project>/<endpoint>/parquet<project>/<endpoint>/content/query=<query><project>/<endpoint>/markdown/query=<query><project>/<endpoint>/abstract/query=<query>
Data Contracts
- Join key:
id + query - Abstract field:
abstract(lowercase) - Multi-selector behavior:
-
endpoint = NULLexpands across supported endpoints. -
query_name = NULLexpands across all queries.
-
Failure Contracts
- Per-row failures should yield status/error outputs.
- Pipeline should avoid whole-run termination for single-record extraction/summarization failures unless strict mode is explicitly requested.
Testing
Corpus Pipeline Testing
- Validate selector expansion (
endpoint/query_nameNULL behavior). - Validate file placement for
content,markdown, andabstract. - Validate
read_corpus(abstracts = TRUE)lazy-link behavior byid + query. - Validate schema expectations (
abstractlowercase; no staleAbstract). - Validate provider failures produce row-level status/error instead of silent drops.
- Validate progress messaging remains usable under parallel runs.