Structural-facts ingest pass: tokei + hyperpolyglot + manifest detection #29

New issue

Closed

opened 2026-05-23 20:55:24 +00:00 by coilysiren · 1 comment

coilysiren commented

2026-05-23 20:55:24 +00:00

Owner

Originally filed by @coilysiren on 2026-05-17T20:50:34Z - https://github.com/coilysiren/repo-recall/issues/190

Problem

repo-recall today ingests git log, GitHub state, Claude session JSONL, coily audit, and doc file-health. It does not know what kind of code is in any of the repos it indexes. Every fresh Claude session re-derives "is this Python? where's the entry point? what's the directory shape?" from scratch via Glob/Grep/Read.

That gap matters when the agent is hopping across 30+ cloned repos and wants a snapshot of "what is this repo." The activity ranking ("hot files by churn") was useful when raw but is mostly noise in practice; structural baseline facts would be a strictly more useful first-pass signal.

Proposal

Add a structural-facts pass to repo-recall's ingest tier, file-change-driven. Cheapest possible scope:

ignore for the gitignore-respecting walk.
tokei as a library for LOC and file counts per language, broad bundled language coverage.
hyperpolyglot for per-file language detection where tokei's heuristic isn't sharp enough (vendored detection, .h C-vs-C++ disambiguation).
Manifest detection by file existence: pyproject.toml, Cargo.toml, package.json, go.mod, Gemfile, etc.
README + AGENTS first-paragraph extraction (file_health ingest already reads these for size discipline; reuse).

Output: per-file structural facts joined into the existing schema. Per-repo aggregates derivable in the JOIN layer.

Why this is the cheap structural pass and not tree-sitter / rust-code-analysis

Tree-sitter and code-metrics are a separate, heavier add tracked in a sibling issue. The bet here is that file-extension + manifest + README-first-paragraph is enough structural signal to make "rank files by (sessions-touched * manifest-or-entrypoint-bonus)" a strictly better hot-paths answer than churn alone. If it isn't, the next layer can be added without throwing this work away.

Out of scope

AST parsing of any kind.
Per-function metrics.
Symbol cross-reference.
Tree-sitter or any per-language grammar.

Open sub-questions

Whether to keep churn as a column at all once sessions-touched + structural facts are wired. Probably yes, since the cost is zero and the dashboard already renders it.
Whether to expose a recall_repo_structure MCP tool or fold the facts into the existing recall_repo projection. Probably the latter, since the structural data is per-file and recall_repo already returns per-repo file lists.

Origin

Conversation 2026-05-17. Sibling issues: search-router (ripgrep + nucleo), code-metrics (rust-code-analysis), per-source refresh rates.

_Originally filed by @coilysiren on 2026-05-17T20:50:34Z - [https://github.com/coilysiren/repo-recall/issues/190](https://github.com/coilysiren/repo-recall/issues/190)_ **Problem** repo-recall today ingests git log, GitHub state, Claude session JSONL, coily audit, and doc file-health. It does not know what kind of code is in any of the repos it indexes. Every fresh Claude session re-derives "is this Python? where's the entry point? what's the directory shape?" from scratch via Glob/Grep/Read. That gap matters when the agent is hopping across 30+ cloned repos and wants a snapshot of "what is this repo." The activity ranking ("hot files by churn") was useful when raw but is mostly noise in practice; structural baseline facts would be a strictly more useful first-pass signal. **Proposal** Add a structural-facts pass to repo-recall's ingest tier, file-change-driven. Cheapest possible scope: - **`ignore`** for the gitignore-respecting walk. - **`tokei`** as a library for LOC and file counts per language, broad bundled language coverage. - **`hyperpolyglot`** for per-file language detection where tokei's heuristic isn't sharp enough (vendored detection, `.h` C-vs-C++ disambiguation). - Manifest detection by file existence: `pyproject.toml`, `Cargo.toml`, `package.json`, `go.mod`, `Gemfile`, etc. - README + AGENTS first-paragraph extraction (file_health ingest already reads these for size discipline; reuse). Output: per-file structural facts joined into the existing schema. Per-repo aggregates derivable in the JOIN layer. **Why this is the cheap structural pass and not tree-sitter / rust-code-analysis** Tree-sitter and code-metrics are a separate, heavier add tracked in a sibling issue. The bet here is that file-extension + manifest + README-first-paragraph is enough structural signal to make "rank files by (sessions-touched * manifest-or-entrypoint-bonus)" a strictly better hot-paths answer than churn alone. If it isn't, the next layer can be added without throwing this work away. **Out of scope** - AST parsing of any kind. - Per-function metrics. - Symbol cross-reference. - Tree-sitter or any per-language grammar. **Open sub-questions** - Whether to keep churn as a column at all once sessions-touched + structural facts are wired. Probably yes, since the cost is zero and the dashboard already renders it. - Whether to expose a `recall_repo_structure` MCP tool or fold the facts into the existing `recall_repo` projection. Probably the latter, since the structural data is per-file and `recall_repo` already returns per-repo file lists. **Origin** Conversation 2026-05-17. Sibling issues: search-router (ripgrep + nucleo), code-metrics (rust-code-analysis), per-source refresh rates.

coilysiren referenced this issue

2026-05-23 20:55:30 +00:00

service install/uninstall: built-in cross-platform daemon setup #71

coilysiren added the

label

2026-05-31 01:55:04 +00:00

coilysiren commented

2026-06-17 08:24:32 +00:00

Author

Owner

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: burndown-2026-06.

Backlog burndown 2026-06-17: closing low-priority (P3/P4) to bring the open count to a manageable level. Nothing lost — reopen if this resurfaces. Batch tag: `burndown-2026-06`.

coilysiren

2026-06-17 08:24:32 +00:00