abstract-cleaning

Purpose

This vignette documents the abstract-cleaning stage used before embedding in openalexVectorComp.

It explains:

what the default cleaner does,
why each cleaning step exists,
how different strictness modes change behavior,
how missing abstracts are handled,
how to provide your own custom cleaner.

The cleaner is central to embedding quality because it determines the final text that is sent to the embedding backend.

Where Cleaning Happens

embed_corpus() now accepts a pluggable preprocessor:

text_preprocessor (function)
cleaner_args (named list passed to that function)

By default:

text_preprocessor = clean_abstract_for_embedding

So this call:

embed_corpus(project_dir = "my_project")

implicitly applies default abstract cleaning before embeddings are requested.

Cleaner Contract

clean_abstract_for_embedding(df, ...) expects a data frame with:

id
title
abstract

It returns a data frame containing at least:

id
text (final embedding input)
text_hash (xxhash64 hash of text)

By default it also returns provenance columns:

text_quality ("title_abstract" or "title_only")
abstract_raw_present (logical)
abstract_kept (logical)
discard_reason (character)
cleaning_mode (lenient/balanced/strict)

embed_corpus() validates this contract and fails early if violated.

Cleaning Pipeline (Step by Step)

The default cleaner applies a rule-based sequence:

Normalize text:
- collapse repeated whitespace
- trim leading/trailing whitespace
- convert NA to empty strings for robust handling
Remove inline markup:
- strips HTML-like tags (<...>) from abstracts
Remove duplicated title prefix in abstract:
- if abstract starts with the title, remove that repeated prefix
- avoids over-weighting title words twice
Detect low-quality abstract content:
- placeholder phrases ("no abstract available", "n/a", etc.)
- boilerplate-like text ("copyright", publisher fragments)
- HTML/XML artifact hints
- too-short abstract length
- low alphabetic character ratio
Apply policy for invalid/missing abstract:
- keep_title_only (default): keep record with Title: ...
- discard: drop row from embedding input
- conditional: keep title-only only when title quality is acceptable
Build final embedding input text:
- with valid abstract:
  - Title: {title}\nAbstract: {abstract_clean}
- fallback:
  - Title: {title}
Hash final text:
- text_hash = digest(text, algo = "xxhash64")
- used by embed_corpus() resume logic (id + text_hash)

Cleaning Flow Diagram

flowchart TD
  A[Input rows: id, title, abstract] --> B[Normalize whitespace and NA handling]
  B --> C[Strip simple HTML/XML tags]
  C --> D[Remove duplicated title prefix from abstract]
  D --> E[Compute quality signals]
  E --> E1[Placeholder phrase]
  E --> E2[Boilerplate/artifact hint]
  E --> E3[Length threshold]
  E --> E4[Alphabetic-ratio threshold]
  E1 --> F{Abstract valid?}
  E2 --> F
  E3 --> F
  E4 --> F
  F -->|yes| G[Build text: Title + Abstract]
  F -->|no| H{no_abstract_policy}
  H -->|keep_title_only| I[Build text: Title only]
  H -->|conditional| J[Title quality check]
  J -->|pass| I
  J -->|fail| K[Discard row]
  H -->|discard| K
  G --> L[Compute text_hash]
  I --> L
  K --> M[No embedding request]
  L --> N[Return id, text, text_hash + flags]

Why These Rules

The implementation deliberately avoids aggressive NLP preprocessing (stemming/stopword removal/full punctuation stripping), because modern embedding models generally perform better on natural text.

The default approach targets obvious noise while preserving semantics:

Keep recall by default (title-only fallback).
Protect precision by suppressing known junk patterns.
Preserve reproducibility via deterministic text_hash.
Enable auditing via provenance columns.

Mode Behavior (`lenient`, `balanced`, `strict`)

mode controls threshold aggressiveness.

Conceptually:

lenient: minimal filtering, retain more abstracts
balanced (default): moderate filtering
strict: stronger filtering, discard more weak abstracts

In practice, mode adjusts defaults such as:

minimum abstract length
minimum alphabetic character ratio

You can override thresholds explicitly via:

min_chars
min_alpha_ratio

Example: Default Cleaning

library(openalexVectorComp)

df <- data.frame(
  id = c("W1", "W2", "W3"),
  title = c(
    "Biodiversity and ecosystem resilience",
    "Ocean circulation dynamics",
    "Land-use transition analysis"
  ),
  abstract = c(
    "Biodiversity and ecosystem resilience are central to adaptation planning.",
    "No abstract available",
    "<p>Published by Example Press</p>"
  ),
  stringsAsFactors = FALSE
)

cleaned <- clean_abstract_for_embedding(df)
cleaned[, c("id", "text_quality", "abstract_kept", "discard_reason", "text")]

Expected interpretation:

W1: likely title_abstract
W2: placeholder -> fallback title_only
W3: boilerplate/artifact may become title_only (mode-dependent)

Example: Before/After on Individual Cases

The following example demonstrates three outcomes in one run:

kept as title_abstract
fallback to title_only
removed (discard policy)

examples <- data.frame(
  id = c("A1", "A2", "A3", "A4"),
  title = c(
    "Ecosystem service valuation",
    "Marine heatwave impacts",
    "Forest carbon accounting",
    "Urban biodiversity monitoring"
  ),
  abstract = c(
    # duplicated title prefix -> should be removed from abstract start
    "Ecosystem service valuation examines policy trade-offs and uncertainty.",
    # placeholder -> invalid abstract
    "No abstract available",
    # boilerplate/short artifact -> likely invalid
    "<p>Copyright 2025 Elsevier. All rights reserved.</p>",
    # clean abstract
    "We present a field protocol for repeat biodiversity monitoring in cities."
  ),
  stringsAsFactors = FALSE
)

# Keep title-only fallback for invalid abstracts
keep_case <- clean_abstract_for_embedding(
  examples,
  mode = "balanced",
  no_abstract_policy = "keep_title_only"
)
keep_case[, c("id", "text_quality", "abstract_kept", "discard_reason", "text")]

# Discard invalid abstracts entirely
discard_case <- clean_abstract_for_embedding(
  examples,
  mode = "balanced",
  no_abstract_policy = "discard"
)
discard_case[, c(
  "id",
  "text_quality",
  "abstract_kept",
  "discard_reason",
  "text"
)]

Interpretation:

In keep_case, A2/A3 are retained as title_only.
In discard_case, invalid rows are not returned at all.
A1 shows title-prefix cleanup before building final text.

Example: Change Missing-Abstract Policy

drop_missing <- clean_abstract_for_embedding(
  df,
  no_abstract_policy = "discard"
)
nrow(drop_missing)

With discard, rows with invalid/missing abstracts are removed from embedding input entirely.

Example: Use a Stricter Cleaning Configuration

strict_clean <- clean_abstract_for_embedding(
  df,
  mode = "strict",
  min_chars = 140,
  min_alpha_ratio = 0.70
)

This configuration is useful for high-precision workflows where noisy abstracts must be aggressively filtered.

Using Custom Patterns

You can override detection patterns for placeholders/boilerplate/artifacts:

custom_clean <- clean_abstract_for_embedding(
  df,
  placeholder_patterns = c("abstract unavailable", "^none$"),
  boilerplate_patterns = c("all rights reserved", "publisher notice"),
  html_patterns = c("<[^>]+>", "&nbsp;")
)

Integration with `embed_corpus()`

Default integration:

embed_corpus(
  project_dir = "my_project",
  cleaner_args = list(
    mode = "balanced",
    no_abstract_policy = "keep_title_only"
  )
)

Custom preprocessor integration:

my_preprocessor <- function(df, suffix = "") {
  text <- paste0("Title: ", df$title, " ", suffix)
  data.frame(
    id = as.character(df$id),
    text = text,
    text_hash = vapply(
      text,
      digest::digest,
      character(1),
      algo = "xxhash64",
      serialize = FALSE
    ),
    text_quality = "custom",
    stringsAsFactors = FALSE
  )
}

embed_corpus(
  project_dir = "my_project",
  text_preprocessor = my_preprocessor,
  cleaner_args = list(suffix = "[custom]")
)

Preprocessor Validation in `embed_corpus()`

embed_corpus() validates custom preprocessor output:

must return a data frame
must contain id, text, text_hash
no duplicated id
no ids outside the current input batch

Rows with empty/NA id or text are dropped before embedding.

Output Provenance and Auditability

When using default cleaner flags, embedding output parquet includes cleaning provenance columns in addition to embedding metadata (provider, model_id, created_at, text_hash, V1..Vd).

This makes it straightforward to:

analyze quality tiers (title_abstract vs title_only)
filter rows for downstream scoring/calibration
debug why some abstracts were not kept

Recommended Operational Defaults

For most large OpenAlex-style corpora:

mode = "balanced"
no_abstract_policy = "keep_title_only"
keep provenance columns enabled (return_flags = TRUE)

Then tune stricter settings only if evaluation shows too much noise in downstream relevance scoring.

Summary

clean_abstract_for_embedding() provides a practical, auditable, and pluggable cleaning layer for embedding pipelines:

conservative semantic-preserving cleaning
configurable strictness and missing-abstract policy
deterministic hashing for resume
clear provenance for analysis and debugging