Search router: recall_grep via ripgrep crates, recall_resolve via nucleo #31

Open
opened 2026-05-23 20:55:24 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-17T20:50:34Z - https://github.com/coilysiren/repo-recall/issues/187

Problem

repo-recall ships one search surface today: tantivy-backed recall_search, indexed lexical match over the corpus. That's the right default but it doesn't cover two adjacent query shapes:

  • Regex content search across the live filesystem. "Find every closes #\d+ across my repo radius right now." tantivy can't answer regex; the index would have to be re-shaped per query.
  • Fuzzy identifier resolution. "The user typed 'aok' or a partial session UUID, which of these N known identifiers did they mean." tantivy ranks by token relevance over the corpus, not by edit distance over a known candidate set.

These compose rather than cascade. Each answers a different question with a different input shape. Adding them gives agents two new query shapes without disturbing the existing tantivy surface.

Proposal

Two new MCP tools and matching HTTP endpoints:

  • recall_grep backed by the ripgrep crates (grep, grep-regex, grep-searcher, grep-printer, grep-matcher). Regex content search over the configured repo radius, in-process, no subprocess fork. Honors the same ignore walker as the rest of repo-recall.
  • recall_resolve backed by nucleo. Fuzzy match against already-loaded identifier sets: repo IDs, session UUIDs, file paths within indexed repos. Returns the top-N candidates with scores.

The composition story is "tantivy is the default lexical search; ripgrep is the regex escape hatch; nucleo is the identifier resolver." Three endpoints, no router logic, agents pick by query shape.

Why merged

Both are cheap to add (no language-specific work, no upstream coverage gaps), and together they round out the search surface in one increment rather than two. Splitting would create artificial sequencing between two independent endpoints.

Out of scope

  • Cross-tool result blending. Each endpoint stands alone.
  • Replacing tantivy for any existing query path.
  • Semantic / embedding-based search.

Open sub-questions

  • Sequencing: ship recall_grep first, then recall_resolve if identifier-resolution friction actually shows up. Or both at once. Lightly leaning ship-both since they're each small.
  • Whether nucleo's candidate set should include file paths from the (sibling-issue) structural-facts pass once that lands. Probably yes, but only after structural-facts is live.

Origin

Conversation 2026-05-17. Sibling issues: structural-facts pass, code-metrics (rust-code-analysis), per-source refresh rates.

_Originally filed by @coilysiren on 2026-05-17T20:50:34Z - [https://github.com/coilysiren/repo-recall/issues/187](https://github.com/coilysiren/repo-recall/issues/187)_ **Problem** repo-recall ships one search surface today: tantivy-backed `recall_search`, indexed lexical match over the corpus. That's the right default but it doesn't cover two adjacent query shapes: - **Regex content search across the live filesystem.** "Find every `closes #\d+` across my repo radius right now." tantivy can't answer regex; the index would have to be re-shaped per query. - **Fuzzy identifier resolution.** "The user typed 'aok' or a partial session UUID, which of these N known identifiers did they mean." tantivy ranks by token relevance over the corpus, not by edit distance over a known candidate set. These compose rather than cascade. Each answers a different question with a different input shape. Adding them gives agents two new query shapes without disturbing the existing tantivy surface. **Proposal** Two new MCP tools and matching HTTP endpoints: - **`recall_grep`** backed by the **ripgrep crates** (`grep`, `grep-regex`, `grep-searcher`, `grep-printer`, `grep-matcher`). Regex content search over the configured repo radius, in-process, no subprocess fork. Honors the same `ignore` walker as the rest of repo-recall. - **`recall_resolve`** backed by **nucleo**. Fuzzy match against already-loaded identifier sets: repo IDs, session UUIDs, file paths within indexed repos. Returns the top-N candidates with scores. The composition story is "tantivy is the default lexical search; ripgrep is the regex escape hatch; nucleo is the identifier resolver." Three endpoints, no router logic, agents pick by query shape. **Why merged** Both are cheap to add (no language-specific work, no upstream coverage gaps), and together they round out the search surface in one increment rather than two. Splitting would create artificial sequencing between two independent endpoints. **Out of scope** - Cross-tool result blending. Each endpoint stands alone. - Replacing tantivy for any existing query path. - Semantic / embedding-based search. **Open sub-questions** - Sequencing: ship `recall_grep` first, then `recall_resolve` if identifier-resolution friction actually shows up. Or both at once. Lightly leaning ship-both since they're each small. - Whether nucleo's candidate set should include file paths from the (sibling-issue) structural-facts pass once that lands. Probably yes, but only after structural-facts is live. **Origin** Conversation 2026-05-17. Sibling issues: structural-facts pass, code-metrics (rust-code-analysis), per-source refresh rates.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:01:15 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/repo-recall#31
No description provided.