Loop 2 layer 2.5: move snippet-frequency mining into repo-recall + Luca #1

New issue

Closed

opened 2026-05-23 20:55:36 +00:00 by coilysiren · 0 comments

coilysiren commented

2026-05-23 20:55:36 +00:00

Owner

Originally filed by @coilysiren on 2026-05-20T12:54:38Z - https://github.com/coilysiren/voice-flow-learning-loop/issues/17

🤖 Filed by Claude Code on Kai's behalf.

Problem

#16 landed a standalone Python script (scripts/snippet-frequency.py) that walks ~/.claude/projects/**/*.jsonl directly, applies filters, and counts word-n-grams. The script works, but it duplicates capability that already exists in Kai's stack:

repo-recall already indexes session JSONL across every project. The script re-walks raw files instead of querying the index.
Luca is the natural-language consumer layer over repo-recall data. The frequency-mining behavior should be a Luca tool / dispatch route, not a per-repo script.

Building it standalone fragments the substrate: every consumer that wants snippet candidates has to re-implement the JSONL walk, the harness-wrapper stripping, the subagent-prompt heuristics. Putting it in repo-recall (for the indexing primitive) + Luca (for the natural-language surface) reuses Kai's existing infrastructure and earns its place in the Luca-asker fleet.

This effectively collapses layers 2 and 3 of the original README plan: layer 2 was the standalone script, layer 3 was the Luca dispatch route wrapping that script. The right design is layer 2 = the repo-recall primitive + the Luca tool.

Scope

In repo-recall: add (or surface) a primitive for "give me user-role plain-string content across all sessions, with optional repo / cwd / time filters." This is the indexer's job. The harness-wrapper stripping and subagent-prompt filtering belong here so every downstream asker gets clean signal.
In Luca: add a snippet_candidates tool (or close-named) that consumes the repo-recall primitive, counts n-grams above a length floor, collapses substring overlaps, and returns ranked candidates. The natural-language surface is whatever Luca already does (dispatch route, MCP tool, ask phrasing).
Migrate this repo's runs: the next mining run uses Luca, not the standalone script. The script becomes either a reference implementation kept for archive, or is deleted once the Luca path is in place.

Done when

repo-recall exposes the indexing primitive (whatever shape - MCP tool, JSON query, dispatch route).
Luca exposes the snippet-candidate tool.
A 2026-05-20+1 mining run lands in docs/snippets-mining-runs/ produced via Luca, with shape comparable to the layer-2 run already committed.
scripts/snippet-frequency.py is either deleted or marked as a reference implementation in its own header.

Why now, before layer 3

The remaining layers (4-8) are all natural-language-shaped and operate on candidate lists - length-weighted ranking, session-aware weighting, phrase-shape extractors, trigger-name proposals via LLM, conflict dedup. Every one of those is a Luca tool that takes "the candidates" as input. If the candidate-generation step lives in a per-repo Python script, every later layer has to either reimplement it or shell out. Moving it into Luca up front means layers 4-8 are pure Luca compositions, not script wrappers.

Depends on: #16 (delivered the reference implementation that defines what the Luca tool needs to produce). Unblocks: layers 4-8 (renumbered as Luca-composition layers in a followup).

_Originally filed by @coilysiren on 2026-05-20T12:54:38Z - [https://github.com/coilysiren/voice-flow-learning-loop/issues/17](https://github.com/coilysiren/voice-flow-learning-loop/issues/17)_ > 🤖 Filed by Claude Code on Kai's behalf. **Problem** #16 landed a standalone Python script (`scripts/snippet-frequency.py`) that walks `~/.claude/projects/**/*.jsonl` directly, applies filters, and counts word-n-grams. The script works, but it duplicates capability that already exists in Kai's stack: - **repo-recall** already indexes session JSONL across every project. The script re-walks raw files instead of querying the index. - **Luca** is the natural-language consumer layer over repo-recall data. The frequency-mining behavior should be a Luca tool / dispatch route, not a per-repo script. Building it standalone fragments the substrate: every consumer that wants snippet candidates has to re-implement the JSONL walk, the harness-wrapper stripping, the subagent-prompt heuristics. Putting it in repo-recall (for the indexing primitive) + Luca (for the natural-language surface) reuses Kai's existing infrastructure and earns its place in the Luca-asker fleet. This effectively collapses layers 2 and 3 of the original README plan: layer 2 was the standalone script, layer 3 was the Luca dispatch route wrapping that script. The right design is layer 2 = the repo-recall primitive + the Luca tool. **Scope** 1. **In repo-recall:** add (or surface) a primitive for "give me user-role plain-string content across all sessions, with optional repo / cwd / time filters." This is the indexer's job. The harness-wrapper stripping and subagent-prompt filtering belong here so every downstream asker gets clean signal. 2. **In Luca:** add a `snippet_candidates` tool (or close-named) that consumes the repo-recall primitive, counts n-grams above a length floor, collapses substring overlaps, and returns ranked candidates. The natural-language surface is whatever Luca already does (dispatch route, MCP tool, ask phrasing). 3. **Migrate this repo's runs:** the next mining run uses Luca, not the standalone script. The script becomes either a reference implementation kept for archive, or is deleted once the Luca path is in place. **Done when** - repo-recall exposes the indexing primitive (whatever shape - MCP tool, JSON query, dispatch route). - Luca exposes the snippet-candidate tool. - A 2026-05-20+1 mining run lands in `docs/snippets-mining-runs/` produced via Luca, with shape comparable to the layer-2 run already committed. - `scripts/snippet-frequency.py` is either deleted or marked as a reference implementation in its own header. **Why now, before layer 3** The remaining layers (4-8) are all natural-language-shaped and operate on candidate lists - length-weighted ranking, session-aware weighting, phrase-shape extractors, trigger-name proposals via LLM, conflict dedup. Every one of those is a Luca tool that takes "the candidates" as input. If the candidate-generation step lives in a per-repo Python script, every later layer has to either reimplement it or shell out. Moving it into Luca up front means layers 4-8 are pure Luca compositions, not script wrappers. **Depends on:** #16 (delivered the reference implementation that defines what the Luca tool needs to produce). **Unblocks:** layers 4-8 (renumbered as Luca-composition layers in a followup).

coilysiren added the

label

2026-05-31 01:55:12 +00:00

coilysiren

2026-06-17 08:24:10 +00:00