BM25 Indexing and Retrieval¶

Retrieval Objective¶

Moraine retrieval is designed around one principle: ranking should be cheap at query time because expensive corpus transformation has already been paid incrementally at ingest time. This is the operational equivalent of BM25S thinking, but implemented with ClickHouse tables and materialized views instead of in-memory sparse matrices. The consequence is a thin MCP process that computes scores and formats responses while delegating index freshness and persistence to the database layer. 004_search_index.sql:L28 004_search_index.sql:L100 main.rs:L337

The retrieval stack has three cooperating stages. Stage one is document projection from canonical events into search_documents. Stage two is sparse postings and corpus statistics maintenance. Stage three is query-time BM25 scoring with runtime filters in codex-mcp. Understanding this separation is essential for debugging: if search quality degrades, determine whether the issue is in projection, indexing, or ranking policy before tuning constants. 004_search_index.sql:L1 004_search_index.sql:L82 main.rs:L482

Index Construction in ClickHouse¶

search_documents is the document surface. It stores event UID, session metadata, class/type fields, payload JSON, and textual content plus computed doc_len. doc_len is materialized using regex token extraction over lowercased text. This materialization means average document length and BM25 denominator terms are cheap to compute later, with no repeated tokenization in the MCP process. 004_search_index.sql:L1 004_search_index.sql:L22

Documents are produced by one materialized view, mv_search_documents_from_events, sourced from events. The projection maps canonical event fields into retrieval columns, carries origin_event_id as compacted_parent_uid, and filters out rows whose text is only whitespace. This keeps corpus size aligned with retrieval utility and avoids ranking overhead on structural events without textual payloads. 004_search_index.sql:L42 004_search_index.sql:L49 004_search_index.sql:L68 004_search_index.sql:L69

search_postings is the sparse index. The MV explodes each document into tokens via arrayJoin(extractAll(...)), filters term lengths to 2..64, groups by (term, doc), and stores tf along with document and context metadata. This table is partitioned by hashed term buckets and ordered by (term, doc_id), making term-constrained scans efficient even as corpus grows. 004_search_index.sql:L97 004_search_index.sql:L100 004_search_index.sql:L129 004_search_index.sql:L133

Two stats views are defined over indexed data. search_term_stats computes per-term document counts from search_postings FINAL, while search_corpus_stats computes corpus-wide document count and summed doc length from search_documents FINAL. This keeps stats derivation aligned with the current index state without separate additive-maintenance tables. 004_search_index.sql:L151 004_search_index.sql:L156 004_search_index.sql:L161 004_search_index.sql:L167

Query Processing in `codex-mcp`¶

Search starts with query tokenization using [A-Za-z0-9_]+, lowercasing, and length limits. Term count is capped by config (max_query_terms). The service preserves token order and term frequency in the tokenizer output, but current SQL scoring path treats each unique token once through term-level postings and IDF maps. Query validation rejects empty or non-searchable inputs early. main.rs:L346 main.rs:L989 main.rs:L1000

Runtime bounds are applied next. Requested limit is clamped to max_results, min_should_match is clamped to [1, term_count], and default flags for tool-event inclusion and codex-mcp exclusion are applied from config. Optional session_id is validated against a strict safe-character regex before SQL generation, reducing injection and malformed-filter risks. main.rs:L353 main.rs:L359 main.rs:L366 main.rs:L373 main.rs:L995

The service fetches corpus totals and term DF values before building ranking SQL. Corpus stats are read from search_corpus_stats, with fallback aggregation from search_documents if stats are absent. DF values are read from search_term_stats, with fallback counts from search_postings for missing terms. These fallbacks make retrieval resilient during bootstrap and partial index repair. main.rs:L578 main.rs:L582 main.rs:L588 main.rs:L597

BM25 Formula and SQL Realization¶

IDF is computed per query term in-process using an Okapi BM25-style smoothing expression. For unseen terms (df=0), the service uses a high fallback IDF derived from corpus size; for seen terms it uses ln(1 + ((N - df + 0.5)/(df + 0.5))) with non-negative clamping. This keeps ranking numerically stable and prevents negative-term contributions from high-frequency terms. main.rs:L398 main.rs:L401 main.rs:L405

The SQL query embeds k1, b, avgdl, term array, and aligned IDF array in a WITH clause. For each posting row, term-specific IDF is selected with transform(term, q_terms, q_idf, 0.0), and BM25 contribution is computed as tf*(k1+1)/(tf + k1*(1-b+b*doc_len/avgdl)). Contributions are summed per document; matched_terms and score filters are applied in HAVING, then results are ordered by score and limited. main.rs:L532 main.rs:L551 main.rs:L554 main.rs:L564

Because ranking is executed over postings constrained by query terms, dominant cost is posting fanout, not corpus cardinality. This is the key performance behavior that enables real-time local search at scale within one node, provided token normalization keeps term selectivity reasonable. main.rs:L505 main.rs:L560

Retrieval Policy Filters¶

By default, retrieval excludes several operationally noisy payloads and prefers semantically meaningful event classes (message, reasoning, event_msg). When include_tool_events is false, additional payload-type exclusions remove lifecycle chatter such as task_started and turn_aborted. This default policy is tuned for agent consumption quality rather than maximal recall of low-signal events. main.rs:L513 main.rs:L515 main.rs:L517

A second policy filter optionally excludes codex-mcp self-reference. It removes rows whose payload mentions codex-mcp and rows with tool names search or open. This prevents retrieval loops where prior search/open traces dominate subsequent search results. The filter can be disabled when self-observation is intentionally desired. main.rs:L523 main.rs:L525 moraine.toml:L41

Session scoping is supported through exact session_id filtering in postings query conditions. In scoped mode, ranking is still BM25-based but corpus statistics remain global in current implementation. That means scores are comparable within scoped results for ranking order, but absolute values should not be interpreted as session-local calibrated relevance probabilities. main.rs:L507 main.rs:L572

`open` Tool and Context Reconstruction¶

open resolves one event UID to a session and event order using v_conversation_trace, then fetches an ordered context window around that order. This is intentionally separate from lexical ranking and relies on the trace view’s deterministic ordering semantics. If UID is not found, the tool returns found=false instead of an error payload. main.rs:L728 main.rs:L733 main.rs:L735

Returned rows include both concise fields (actor, class, payload type, text) and full payload/token JSON for deep inspection. In prose mode, context is rendered in deterministic order and partitioned into before/target/after blocks to improve agent readability while preserving event order metadata. main.rs:L745 main.rs:L755 main.rs:L905

Freshness and Rebuild Behavior¶

Steady-state freshness is push-driven: ingestor writes canonical rows, MVs update documents and postings, and MCP queries read latest committed state. No periodic full-corpus reindex is required for normal operation. This architecture is robust under continuous append workloads because index maintenance is amortized over ingest writes. 004_search_index.sql:L28 004_search_index.sql:L100 main.rs:L422

For schema changes or index corruption repair, bin/backfill-search-index truncates search tables and rehydrates documents from canonical event tables. Postings and stats repopulate through MVs after inserts. Operators should run this explicitly after tokenization/projection changes to avoid mixed-semantics corpora. backfill-search-index:L74 backfill-search-index:L79 backfill-search-index:L81

Query and Interaction Logging¶

Each search writes a search_query_log row with normalized terms, filter settings, response latency, result count, and BM25 metadata (docs, avgdl, k1, b). Ranked results are written to search_hit_log with per-hit rank, score, and contextual metadata. These writes can be synchronous or async, controlled by config. main.rs:L621 main.rs:L637 main.rs:L667 main.rs:L689

search_interaction_log is reserved for external feedback capture and is not currently auto-populated by MCP. Keeping this table in baseline schema is strategic: it allows later relevance-learning loops to ingest click/selection/annotation events without schema migration pressure. 004_search_index.sql:L220

Performance and Quality Tuning¶

High-impact knobs are k1, b, min_should_match, result limit, and inclusion filters. Raising min_should_match increases precision by requiring broader term overlap; lowering it increases recall for short queries or sparse terms. k1 and b should be tuned against corpus characteristics, but defaults (1.2, 0.75) are reasonable starting points for mixed conversational and tool text. moraine.toml:L46-47 moraine.toml:L49

If ranking quality looks noisy, first inspect corpus inputs, not formula constants. Common root causes are weak text_content extraction, inclusion of operational chatter, or stale/misaligned index tables after schema changes. Constants cannot recover information that never entered search_documents correctly. 004_search_index.sql:L53 normalize.rs:L651 backfill-search-index:L72

If latency regresses, inspect posting fanout and query term shape. Extremely broad terms and long query token lists increase candidate set size. The max-query-terms cap provides a hard guardrail; if you raise it, do so with observed workload data and not by default. main.rs:L346 main.rs:L1016 moraine.toml:L50

Known Limits¶

The current implementation uses simple regex tokenization with no stemming, lemmatization, phrase scoring, or field weighting. This keeps indexing fast and predictable for code-like and operational text, but leaves semantic recall on the table for morphology-heavy language domains. Advanced linguistic normalization can be layered later, but should only be introduced with explicit rebuild and relevance-evaluation plans. 004_search_index.sql:L22 004_search_index.sql:L129 main.rs:L989

Score interpretation should remain relative, not absolute. BM25 scores are useful for rank ordering within one query, but cross-query score comparisons are weak without calibration. Downstream agents should prefer rank and contextual text over raw score thresholds unless query distributions are controlled.