flowchart LR A[Large corpus] --> B[Deterministic preprocessing] B --> C[Backend-neutral embedding] C --> D[Reproducible storage] D --> E[Distance and classification] E --> F[Threshold calibration] F --> G[Operational decisions]
Purpose
This vignette provides a full package-level overview of openalexVectorComp:
- design philosophy,
- core function groups,
- end-to-end workflow,
- storage layout,
- practical usage patterns,
- extension points.
It is intended as the orientation document before using the technical vignettes.
Package Philosophy
The package follows five core principles:
-
Pipeline-first orchestration
- Handle large corpora in batches with Arrow/Parquet.
-
Backend-neutral embedding interface
- Use one config/dispatch API for HF, OpenAI, or TEI.
-
Deterministic resume behavior
- Use
id + text_hashto skip unchanged rows safely.
- Use
-
Pluggable text preparation
- Let users inject custom cleaners while preserving contracts.
-
Transparent scoring workflow
- Keep similarity, prototype distance, ridge scoring, and calibration separable and inspectable.
Function Groups
The package API is organized into these groups.
1) Embedding backend abstraction
These functions isolate provider-specific details and expose a stable interface.
2) Corpus embedding orchestration
This is the main batch pipeline driver for production embedding runs.
3) Text preparation
Default implementation for title/abstract cleaning and canonical text creation.
4) Similarity and distance
These functions quantify embedding-space relevance from geometric perspectives.
Internal helper note:
-
distances()exists as a non-exported helper for joining distance datasets.
5) Supervised scoring and calibration
These functions produce calibrated decision-ready scores.
6) Embedding-space visualization
These functions support diagnostics and qualitative checks.
flowchart TB A[Backend Abstraction] --> B[embed_corpus] C[Text Preparation] --> B B --> D[Stored embeddings] D --> E[Similarity and distance] D --> F[Ridge scoring] E --> G[Threshold calibration] F --> G D --> H[PCA/UMAP diagnostics]
Canonical Workflow
Typical workflow for one project:
- Configure backend (
backend_config()). - Embed corpus (
embed_corpus()). - Compute distance signal (
distance_reference_cosine()and/ordistance_ridge()), then score (e.g.score_reference_cosine(),score_ridge()). - Calibrate operating threshold (
calibrate_threshold()). - Validate with plots (
plot_embeddings_pca()/plot_embeddings_umap()).
sequenceDiagram participant U as User participant BC as backend_config participant EC as embed_corpus participant ES as embedding_store participant DP as distance_reference_cosine participant SCP as score_reference_cosine participant DR as distance_ridge participant SR as score_ridge participant CT as calibrate_threshold U->>BC: backend_config(...) U->>EC: embed_corpus(project_dir, backend) EC->>ES: write embeddings parquet U->>DP: distance_reference_cosine(...) U->>SCP: score_reference_cosine(...) U->>DR: distance_ridge(...) U->>SR: score_ridge(...) U->>CT: calibrate_threshold(...)
Data and Storage Model
Corpus input
embed_corpus() expects:
-
project_dir/corpusas Arrow dataset - columns:
id,title,abstract
Embeddings output
Embeddings are written under:
project_dir/embeddings/model_id=<model>/label=<label>/batch=*/embeddings-*.parquet- metadata in:
project_dir/embeddings/model_id=<model>/embed_model.yaml
Dry-run output
When dry_run = TRUE, no embeddings are written. Instead:
project_dir/<corpus_name>_dryrun.parquet
This file supports auditing preprocessing behavior (including custom cleaners).
flowchart TB A[project_dir/corpus] --> B[embed_corpus] B --> C[model_id=.../embed_model.yaml] B --> D[model_id=.../label=.../batch=*/embeddings-*.parquet] B --> E[<corpus_name>_dryrun.parquet]
Text-Preparation Contract
embed_corpus() accepts:
-
text_preprocessor(function) -
cleaner_args(named list)
The preprocessor must return a data frame with:
idtexttext_hash
Optional columns are preserved and can be persisted (e.g., quality flags).
Why this contract matters
- ensures skip/resume determinism,
- enables provider-independent cleaning strategies,
- allows quality provenance without altering backend adapters.
Decision Model: Geometric + Supervised
The package supports two complementary relevance styles:
-
Prototype distance
- pairwise cosine distances between reference and corpus label partitions.
-
Reference-area distance + score (
distance_ridge+score_ridge)- Mahalanobis-style distance to the reference label area (
area_distance) - optional conversion to
relevance_score.
- Mahalanobis-style distance to the reference label area (
Use one or both, then calibrate threshold for target operating behavior.
flowchart LR A[Embeddings] --> B[Prototype distance] A --> C[Reference-area score] B --> D[Ranked candidates] C --> D D --> E[calibrate_threshold] E --> F[Precision/recall operating point]
Minimal End-to-End Example
library(openalexVectorComp)
backend <- backend_config(
provider = "hf",
model = "BAAI/bge-small-en-v1.5",
max_batch_size = 64
)
model_dir <- embed_corpus(
project_dir = "my_project",
backend = backend,
batch_size = 5000,
delete_existing = FALSE,
label = "corpus",
save_text = TRUE
)
embed_corpus(
project_dir = "my_project",
backend = backend,
batch_size = 5000,
delete_existing = FALSE,
label = "reference",
save_text = TRUE
)
pairwise_dir <- distance_reference_cosine(
project_dir = "my_project",
embeddings_dir = basename(model_dir),
corpus_label = "corpus",
reference_label = "reference"
)
pairwise_score_dir <- score_reference_cosine(
distance_parquet = pairwise_dir,
method = "linear"
)
# Optional supervised scoring path
scores_dir <- distance_ridge(
project_dir = "my_project",
reference_label = "reference",
corpus_label = "corpus"
)
scores_dir <- score_ridge(scores_dir)
best <- calibrate_threshold(
scores_parquet = scores_dir,
score_col = "relevance_score",
labels_parquet = "labels.parquet"
)Operational Guidance
- Start with HF defaults for initial setup.
- Keep
save_text = TRUEin alpha/review phases for auditability. - Use
dry_run = TRUEto validate custom cleaners before API spend. - Prefer
mode = "balanced"inclean_abstract_for_embedding()unless you have measured reasons to go stricter. - Re-calibrate thresholds whenever model, cleaner policy, or label set changes.
Related Vignettes
-
backend-architecture: provider/dispatch implementation details. -
abstract-cleaning: cleaning rules and examples in depth. -
tei-server-operations: local TEI operational handling. -
simplestart: quick start usage path.
Summary
openalexVectorComp is a composable embedding-and-scoring pipeline package:
- backend-neutral for embeddings,
- deterministic for resume and reproducibility,
- pluggable for text cleaning,
- explicit for distance and calibration decisions.
Use this overview as the map, then dive into the specialized vignettes for implementation-level details.