Clean title/abstract rows into embedding-ready text
Source:R/clean_abstract_for_embedding.R
clean_abstract_for_embedding.RdApplies lightweight rule-based cleaning to title/abstract rows and
returns embedding-ready text plus a deterministic text_hash.
Arguments
- df
Data frame with columns
id,title, andabstract.- mode
Cleaning intensity:
"lenient","balanced"(default), or"strict".- no_abstract_policy
Policy when abstract is missing/invalid:
"keep_title_only"(default),"discard", or"conditional".- min_chars
Optional minimum abstract length in characters after cleaning. If
NULL, mode-specific defaults are used.- min_alpha_ratio
Optional minimum ratio of alphabetic characters in the cleaned abstract. If
NULL, mode-specific defaults are used.- placeholder_patterns
Optional regex vector for placeholder abstract detection.
- boilerplate_patterns
Optional regex vector for publisher boilerplate detection.
- html_patterns
Optional regex vector for HTML/XML artifact detection.
- return_flags
If
TRUE, include provenance/quality columns.