Search router: recall_grep via ripgrep crates, recall_resolve via nucleo #31

New issue

Closed

opened 2026-05-23 20:55:24 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-23 20:55:24 +00:00

Owner

Originally filed by @coilysiren on 2026-05-17T20:50:34Z - https://github.com/coilysiren/repo-recall/issues/187

Problem

repo-recall ships one search surface today: tantivy-backed recall_search, indexed lexical match over the corpus. That's the right default but it doesn't cover two adjacent query shapes:

Regex content search across the live filesystem. "Find every closes #\d+ across my repo radius right now." tantivy can't answer regex; the index would have to be re-shaped per query.
Fuzzy identifier resolution. "The user typed 'aok' or a partial session UUID, which of these N known identifiers did they mean." tantivy ranks by token relevance over the corpus, not by edit distance over a known candidate set.

These compose rather than cascade. Each answers a different question with a different input shape. Adding them gives agents two new query shapes without disturbing the existing tantivy surface.

Proposal

Two new MCP tools and matching HTTP endpoints:

recall_grep backed by the ripgrep crates (grep, grep-regex, grep-searcher, grep-printer, grep-matcher). Regex content search over the configured repo radius, in-process, no subprocess fork. Honors the same ignore walker as the rest of repo-recall.
recall_resolve backed by nucleo. Fuzzy match against already-loaded identifier sets: repo IDs, session UUIDs, file paths within indexed repos. Returns the top-N candidates with scores.

The composition story is "tantivy is the default lexical search; ripgrep is the regex escape hatch; nucleo is the identifier resolver." Three endpoints, no router logic, agents pick by query shape.

Why merged

Both are cheap to add (no language-specific work, no upstream coverage gaps), and together they round out the search surface in one increment rather than two. Splitting would create artificial sequencing between two independent endpoints.

Out of scope

Cross-tool result blending. Each endpoint stands alone.
Replacing tantivy for any existing query path.
Semantic / embedding-based search.

Open sub-questions

Sequencing: ship recall_grep first, then recall_resolve if identifier-resolution friction actually shows up. Or both at once. Lightly leaning ship-both since they're each small.
Whether nucleo's candidate set should include file paths from the (sibling-issue) structural-facts pass once that lands. Probably yes, but only after structural-facts is live.

Origin

Conversation 2026-05-17. Sibling issues: structural-facts pass, code-metrics (rust-code-analysis), per-source refresh rates.

_Originally filed by @coilysiren on 2026-05-17T20:50:34Z - [https://github.com/coilysiren/repo-recall/issues/187](https://github.com/coilysiren/repo-recall/issues/187)_ **Problem** repo-recall ships one search surface today: tantivy-backed `recall_search`, indexed lexical match over the corpus. That's the right default but it doesn't cover two adjacent query shapes: - **Regex content search across the live filesystem.** "Find every `closes #\d+` across my repo radius right now." tantivy can't answer regex; the index would have to be re-shaped per query. - **Fuzzy identifier resolution.** "The user typed 'aok' or a partial session UUID, which of these N known identifiers did they mean." tantivy ranks by token relevance over the corpus, not by edit distance over a known candidate set. These compose rather than cascade. Each answers a different question with a different input shape. Adding them gives agents two new query shapes without disturbing the existing tantivy surface. **Proposal** Two new MCP tools and matching HTTP endpoints: - **`recall_grep`** backed by the **ripgrep crates** (`grep`, `grep-regex`, `grep-searcher`, `grep-printer`, `grep-matcher`). Regex content search over the configured repo radius, in-process, no subprocess fork. Honors the same `ignore` walker as the rest of repo-recall. - **`recall_resolve`** backed by **nucleo**. Fuzzy match against already-loaded identifier sets: repo IDs, session UUIDs, file paths within indexed repos. Returns the top-N candidates with scores. The composition story is "tantivy is the default lexical search; ripgrep is the regex escape hatch; nucleo is the identifier resolver." Three endpoints, no router logic, agents pick by query shape. **Why merged** Both are cheap to add (no language-specific work, no upstream coverage gaps), and together they round out the search surface in one increment rather than two. Splitting would create artificial sequencing between two independent endpoints. **Out of scope** - Cross-tool result blending. Each endpoint stands alone. - Replacing tantivy for any existing query path. - Semantic / embedding-based search. **Open sub-questions** - Sequencing: ship `recall_grep` first, then `recall_resolve` if identifier-resolution friction actually shows up. Or both at once. Lightly leaning ship-both since they're each small. - Whether nucleo's candidate set should include file paths from the (sibling-issue) structural-facts pass once that lands. Probably yes, but only after structural-facts is live. **Origin** Conversation 2026-05-17. Sibling issues: structural-facts pass, code-metrics (rust-code-analysis), per-source refresh rates.

coilysiren added the

label

2026-05-31 01:55:04 +00:00

coilysiren commented

2026-06-17 08:24:31 +00:00

Author

Owner

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: burndown-2026-06.

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: `burndown-2026-06`.

coilysiren

2026-06-17 08:24:31 +00:00