Structural-facts ingest pass: tokei + hyperpolyglot + manifest detection #29

Open
opened 2026-05-23 20:55:24 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-17T20:50:34Z - https://github.com/coilysiren/repo-recall/issues/190

Problem

repo-recall today ingests git log, GitHub state, Claude session JSONL, coily audit, and doc file-health. It does not know what kind of code is in any of the repos it indexes. Every fresh Claude session re-derives "is this Python? where's the entry point? what's the directory shape?" from scratch via Glob/Grep/Read.

That gap matters when the agent is hopping across 30+ cloned repos and wants a snapshot of "what is this repo." The activity ranking ("hot files by churn") was useful when raw but is mostly noise in practice; structural baseline facts would be a strictly more useful first-pass signal.

Proposal

Add a structural-facts pass to repo-recall's ingest tier, file-change-driven. Cheapest possible scope:

  • ignore for the gitignore-respecting walk.
  • tokei as a library for LOC and file counts per language, broad bundled language coverage.
  • hyperpolyglot for per-file language detection where tokei's heuristic isn't sharp enough (vendored detection, .h C-vs-C++ disambiguation).
  • Manifest detection by file existence: pyproject.toml, Cargo.toml, package.json, go.mod, Gemfile, etc.
  • README + AGENTS first-paragraph extraction (file_health ingest already reads these for size discipline; reuse).

Output: per-file structural facts joined into the existing schema. Per-repo aggregates derivable in the JOIN layer.

Why this is the cheap structural pass and not tree-sitter / rust-code-analysis

Tree-sitter and code-metrics are a separate, heavier add tracked in a sibling issue. The bet here is that file-extension + manifest + README-first-paragraph is enough structural signal to make "rank files by (sessions-touched * manifest-or-entrypoint-bonus)" a strictly better hot-paths answer than churn alone. If it isn't, the next layer can be added without throwing this work away.

Out of scope

  • AST parsing of any kind.
  • Per-function metrics.
  • Symbol cross-reference.
  • Tree-sitter or any per-language grammar.

Open sub-questions

  • Whether to keep churn as a column at all once sessions-touched + structural facts are wired. Probably yes, since the cost is zero and the dashboard already renders it.
  • Whether to expose a recall_repo_structure MCP tool or fold the facts into the existing recall_repo projection. Probably the latter, since the structural data is per-file and recall_repo already returns per-repo file lists.

Origin

Conversation 2026-05-17. Sibling issues: search-router (ripgrep + nucleo), code-metrics (rust-code-analysis), per-source refresh rates.

_Originally filed by @coilysiren on 2026-05-17T20:50:34Z - [https://github.com/coilysiren/repo-recall/issues/190](https://github.com/coilysiren/repo-recall/issues/190)_ **Problem** repo-recall today ingests git log, GitHub state, Claude session JSONL, coily audit, and doc file-health. It does not know what kind of code is in any of the repos it indexes. Every fresh Claude session re-derives "is this Python? where's the entry point? what's the directory shape?" from scratch via Glob/Grep/Read. That gap matters when the agent is hopping across 30+ cloned repos and wants a snapshot of "what is this repo." The activity ranking ("hot files by churn") was useful when raw but is mostly noise in practice; structural baseline facts would be a strictly more useful first-pass signal. **Proposal** Add a structural-facts pass to repo-recall's ingest tier, file-change-driven. Cheapest possible scope: - **`ignore`** for the gitignore-respecting walk. - **`tokei`** as a library for LOC and file counts per language, broad bundled language coverage. - **`hyperpolyglot`** for per-file language detection where tokei's heuristic isn't sharp enough (vendored detection, `.h` C-vs-C++ disambiguation). - Manifest detection by file existence: `pyproject.toml`, `Cargo.toml`, `package.json`, `go.mod`, `Gemfile`, etc. - README + AGENTS first-paragraph extraction (file_health ingest already reads these for size discipline; reuse). Output: per-file structural facts joined into the existing schema. Per-repo aggregates derivable in the JOIN layer. **Why this is the cheap structural pass and not tree-sitter / rust-code-analysis** Tree-sitter and code-metrics are a separate, heavier add tracked in a sibling issue. The bet here is that file-extension + manifest + README-first-paragraph is enough structural signal to make "rank files by (sessions-touched * manifest-or-entrypoint-bonus)" a strictly better hot-paths answer than churn alone. If it isn't, the next layer can be added without throwing this work away. **Out of scope** - AST parsing of any kind. - Per-function metrics. - Symbol cross-reference. - Tree-sitter or any per-language grammar. **Open sub-questions** - Whether to keep churn as a column at all once sessions-touched + structural facts are wired. Probably yes, since the cost is zero and the dashboard already renders it. - Whether to expose a `recall_repo_structure` MCP tool or fold the facts into the existing `recall_repo` projection. Probably the latter, since the structural data is per-file and `recall_repo` already returns per-repo file lists. **Origin** Conversation 2026-05-17. Sibling issues: search-router (ripgrep + nucleo), code-metrics (rust-code-analysis), per-source refresh rates.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:01:16 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/repo-recall#29
No description provided.