Skip to content

Replay Search Latency Benchmark

Purpose

Use scripts/bench/replay_search_latency.py to replay real, high-latency search requests from telemetry and measure current lookup latency against that baseline.

The benchmark script is a self-declared uv script and replays through the local maturin-built moraine-conversations Python package (ConversationClient.search_events_json) so the run exercises the in-repo search implementation directly.

The script:

  • selects top N rows from moraine.search_query_log by response_ms in a time window,
  • excludes prior benchmark replay rows (source='benchmark-replay') by default,
  • runs maturin develop for bindings/python/moraine_conversations (unless skipped),
  • replays each query through moraine_conversations.ConversationClient.search_events_json with source='benchmark-replay',
  • runs warmup and measured repeats per query,
  • reports per-query and aggregate latency stats,
  • compares optimized replay results against an oracle_exact SQL strategy to detect ranking regressions,
  • records timeout/error counts,
  • optionally writes a JSON artifact.

Example Usage

Default benchmark (top 20 in last 24h):

uv run --script scripts/bench/replay_search_latency.py \
  --config config/moraine.toml

Custom window/sample and JSON output:

uv run --script scripts/bench/replay_search_latency.py \
  --config config/moraine.toml \
  --window 7d \
  --top-n 40 \
  --warmup 1 \
  --repeats 5 \
  --timeout-seconds 20 \
  --output-json /tmp/replay-latency.json

Inspect selected workload without replay:

uv run --script scripts/bench/replay_search_latency.py \
  --config config/moraine.toml \
  --window 24h \
  --top-n 20 \
  --dry-run

Run latency-only replay without oracle quality gates:

uv run --script scripts/bench/replay_search_latency.py \
  --config config/moraine.toml \
  --no-oracle-quality-check

CLI Flags

  • --config <path>: Moraine config file used for ClickHouse connectivity.
  • --window <duration>: telemetry lookback window. Supported suffixes: s, m, h, d, w.
  • --top-n <int>: number of highest-latency rows to select.
  • --warmup <int>: warmup runs per selected query.
  • --repeats <int>: measured runs per selected query.
  • --timeout-seconds <int>: timeout for each replayed search request.
  • --skip-maturin-develop: skip local binding rebuild before replay.
  • --include-benchmark-replays: include source='benchmark-replay' rows in selection.
  • --query-variant-mode <none|subset_scramble>: query variant expansion mode before replay.
  • --max-query-terms <int>: max normalized terms used for variant generation.
  • --use-search-cache: allow moraine-conversations search result cache during replay (default no-cache).
  • --parse-json-response: parse replay response JSON in the benchmark process during measured runs.
  • --oracle-quality-check / --no-oracle-quality-check: enable or disable oracle quality validation (default enabled).
  • --oracle-k <int>: top-K for oracle quality metrics; 0 uses each query's replay limit.
  • --oracle-recall-at-k-threshold <0..1>: Recall@K quality gate (default 1.0).
  • --oracle-ndcg-at-k-threshold <0..1>: NDCG@K quality gate (default 0.99).
  • --oracle-min-stability-recall <0..1>: minimum oracle-vs-oracle Recall@K stability for strict gating (default 0.95).
  • --oracle-min-stability-ndcg <0..1>: minimum oracle-vs-oracle NDCG@K stability for strict gating (default 0.98).
  • --output-json <path>: write machine-readable benchmark output.
  • --print-sql: print generated ClickHouse selection SQL.
  • --dry-run: select and validate rows, but skip replay.

Defaults:

  • window=24h
  • top_n=20
  • warmup=1
  • repeats=5
  • timeout_seconds=20

Output

Console output includes:

  • selection metadata and selected time range,
  • per-query table with baseline vs replay (p50, p95, min, max, delta_p50),
  • aggregate summary (min, p50, p95, p99, max, avg),
  • oracle quality summary (Recall@K, NDCG@K, pass/regression/error counts),
  • oracle stability summary (unstable_case_count) to isolate high-ingest drift windows,
  • timeout/error totals.

JSON output includes:

  • meta (timestamp, git SHA, config path, parameters),
  • selected_queries (baseline rows + replay eligibility),
  • replay_results (samples, per-query stats, failures),
  • aggregate (overall stats + counts),
  • aggregate.quality (oracle thresholds and quality summary stats),
  • failures (timeout/error totals).

Exit code behavior:

  • 0: all replay attempts succeeded.
  • non-zero: fatal setup error, empty selection window, replay timeout/error failures, or oracle quality regressions/errors.

Troubleshooting

  • No rows selected:
  • increase --window (for example 7d),
  • verify recent search_query_log writes are present,
  • if you intentionally want replay-generated rows, pass --include-benchmark-replays.
  • Local binding build/import failure:
  • ensure uv, Python, Rust/Cargo, and a C toolchain are available,
  • run uv run --script scripts/bench/replay_search_latency.py --config ... so maturin is auto-installed,
  • if iterating rapidly and the package is already built in the environment, use --skip-maturin-develop.
  • Replay timeouts:
  • raise --timeout-seconds,
  • reduce --top-n/--repeats for quicker local checks,
  • inspect ClickHouse logs for transient slowness.
  • Oracle quality regressions:
  • inspect per-query oracle_quality output for missing/unexpected top-K event IDs,
  • verify whether the rank divergence is expected for intentional retrieval changes,
  • only relax quality thresholds after explicit search-quality validation.
  • Invalid selected rows:
  • run --dry-run to inspect skip reasons,
  • confirm telemetry fields (result_limit, min_should_match, flags) are valid.