Replay Search Latency Benchmark¶
Purpose¶
Use scripts/bench/replay_search_latency.py to replay real, high-latency search requests from telemetry and measure current lookup latency against that baseline.
The benchmark script is a self-declared uv script and replays through the local maturin-built moraine-conversations Python package (ConversationClient.search_events_json) so the run exercises the in-repo search implementation directly.
The script:
- selects top
Nrows frommoraine.search_query_logbyresponse_msin a time window, - excludes prior benchmark replay rows (
source='benchmark-replay') by default, - runs
maturin developforbindings/python/moraine_conversations(unless skipped), - replays each query through
moraine_conversations.ConversationClient.search_events_jsonwithsource='benchmark-replay', - runs warmup and measured repeats per query,
- reports per-query and aggregate latency stats,
- compares optimized replay results against an
oracle_exactSQL strategy to detect ranking regressions, - records timeout/error counts,
- optionally writes a JSON artifact.
Example Usage¶
Default benchmark (top 20 in last 24h):
Custom window/sample and JSON output:
uv run --script scripts/bench/replay_search_latency.py \
--config config/moraine.toml \
--window 7d \
--top-n 40 \
--warmup 1 \
--repeats 5 \
--timeout-seconds 20 \
--output-json /tmp/replay-latency.json
Inspect selected workload without replay:
uv run --script scripts/bench/replay_search_latency.py \
--config config/moraine.toml \
--window 24h \
--top-n 20 \
--dry-run
Run latency-only replay without oracle quality gates:
uv run --script scripts/bench/replay_search_latency.py \
--config config/moraine.toml \
--no-oracle-quality-check
CLI Flags¶
--config <path>: Moraine config file used for ClickHouse connectivity.--window <duration>: telemetry lookback window. Supported suffixes:s,m,h,d,w.--top-n <int>: number of highest-latency rows to select.--warmup <int>: warmup runs per selected query.--repeats <int>: measured runs per selected query.--timeout-seconds <int>: timeout for each replayed search request.--skip-maturin-develop: skip local binding rebuild before replay.--include-benchmark-replays: includesource='benchmark-replay'rows in selection.--query-variant-mode <none|subset_scramble>: query variant expansion mode before replay.--max-query-terms <int>: max normalized terms used for variant generation.--use-search-cache: allowmoraine-conversationssearch result cache during replay (default no-cache).--parse-json-response: parse replay response JSON in the benchmark process during measured runs.--oracle-quality-check/--no-oracle-quality-check: enable or disable oracle quality validation (default enabled).--oracle-k <int>: top-K for oracle quality metrics;0uses each query's replaylimit.--oracle-recall-at-k-threshold <0..1>: Recall@K quality gate (default1.0).--oracle-ndcg-at-k-threshold <0..1>: NDCG@K quality gate (default0.99).--oracle-min-stability-recall <0..1>: minimum oracle-vs-oracle Recall@K stability for strict gating (default0.95).--oracle-min-stability-ndcg <0..1>: minimum oracle-vs-oracle NDCG@K stability for strict gating (default0.98).--output-json <path>: write machine-readable benchmark output.--print-sql: print generated ClickHouse selection SQL.--dry-run: select and validate rows, but skip replay.
Defaults:
window=24htop_n=20warmup=1repeats=5timeout_seconds=20
Output¶
Console output includes:
- selection metadata and selected time range,
- per-query table with baseline vs replay (
p50,p95,min,max,delta_p50), - aggregate summary (
min,p50,p95,p99,max,avg), - oracle quality summary (Recall@K, NDCG@K, pass/regression/error counts),
- oracle stability summary (
unstable_case_count) to isolate high-ingest drift windows, - timeout/error totals.
JSON output includes:
meta(timestamp, git SHA, config path, parameters),selected_queries(baseline rows + replay eligibility),replay_results(samples, per-query stats, failures),aggregate(overall stats + counts),aggregate.quality(oracle thresholds and quality summary stats),failures(timeout/error totals).
Exit code behavior:
0: all replay attempts succeeded.- non-zero: fatal setup error, empty selection window, replay timeout/error failures, or oracle quality regressions/errors.
Troubleshooting¶
- No rows selected:
- increase
--window(for example7d), - verify recent
search_query_logwrites are present, - if you intentionally want replay-generated rows, pass
--include-benchmark-replays. - Local binding build/import failure:
- ensure
uv, Python, Rust/Cargo, and a C toolchain are available, - run
uv run --script scripts/bench/replay_search_latency.py --config ...somaturinis auto-installed, - if iterating rapidly and the package is already built in the environment, use
--skip-maturin-develop. - Replay timeouts:
- raise
--timeout-seconds, - reduce
--top-n/--repeatsfor quicker local checks, - inspect ClickHouse logs for transient slowness.
- Oracle quality regressions:
- inspect per-query
oracle_qualityoutput for missing/unexpected top-K event IDs, - verify whether the rank divergence is expected for intentional retrieval changes,
- only relax quality thresholds after explicit search-quality validation.
- Invalid selected rows:
- run
--dry-runto inspect skip reasons, - confirm telemetry fields (
result_limit,min_should_match, flags) are valid.