Stream a corpus dataset, embed in batches, and write Parquets
Source:R/embed_corpus.R
embed_corpus.RdProcesses a Parquet dataset without loading it fully in memory. Reads Arrow
record batches, builds canonical text from title + abstract, calls the
configured embedding backend, and writes Parquet batch files.
Usage
embed_corpus(
project_dir = NULL,
backend = backend_config(),
corpus_name = "corpus",
batch_size = 5000,
delete_existing = FALSE,
text_preprocessor = clean_abstract_for_embedding,
cleaner_args = list(),
save_text = TRUE,
label = corpus_name,
dry_run = FALSE,
verbose = TRUE
)Arguments
- project_dir
Project root directory. Must contain
project_dir/<corpus_name>with columnsid,title,abstract.- backend
Backend configuration created with
backend_config().- corpus_name
Folder name under
project_dircontaining the corpus parquet dataset. Defaults to"corpus".- batch_size
Number of corpus rows per Arrow scan batch.
- delete_existing
If
TRUE, old embeddings for the target model are deleted before processing. IfFALSE, unchanged rows are skipped usingid + text_hash.- text_preprocessor
Function that prepares embedding text from a batch data frame and returns at least columns
id,text,text_hash. Defaults toclean_abstract_for_embedding().- cleaner_args
Named list of additional arguments passed to
text_preprocessor.- save_text
Logical; if
TRUE(default), store the cleaned embedding text in output Parquet files as columntext. IfFALSE, onlytext_hashis stored.- label
Partition label written under
project_dir/embeddings/model_id=<...>/label=<label>/batch=<n>/. Defaults tocorpus_name.- dry_run
Logical; if
TRUE, run preprocessing and unchanged-row filtering without requesting embeddings or writing output files. In this mode, a Parquet preview file is written toproject_dir/<corpus_name>_dryrun.parquet.- verbose
Logical; print progress and summary messages.