Per-source refresh rates with notify-driven filesystem ingest #32

Open
opened 2026-05-23 20:55:25 +00:00 by coilysiren · 0 comments
Owner

Originally filed by @coilysiren on 2026-05-17T20:50:34Z - https://github.com/coilysiren/repo-recall/issues/188

Problem

repo-recall today runs a single 150s refresh tick that re-scans every source on the same cadence. That treats all sources as equally fresh and equally expensive, which neither is:

  • Filesystem-shaped sources (working tree, source files, Claude session JSONL) change continuously but each individual change is cheap to ingest incrementally.
  • Git log is effectively reactive to filesystem (HEAD moves are file changes).
  • GitHub state is network-bound, rate-limited, expensive. 150s is too aggressive at a 30-repo radius.
  • coily audit JSONL is append-only; the right ingest pattern is "read from last offset," not "re-scan every tick."
  • Doc file health rarely changes.
  • Any future heavy passes (tree-sitter metrics, code analysis) need to be file-change-driven, not tick-driven.

The single-tick model also forces a global scan-version, which is fine for current consumers but constrains downstream tooling (session-lattice) to a coarse change signal.

Proposal

Refactor the refresh tier so each source declares its own trigger shape:

  • Notify-driven - filesystem sources via the notify crate (inotify on Linux, FSEvents on Mac, ReadDirectoryChangesW on Windows). Working tree, source files, Claude session JSONL, coily audit, doc files. Fall back to mtime-poll where notify is unreliable (network mounts, certain Windows configs).
  • Reactive - git log triggered by "HEAD or refs moved" filesystem signal under .git/.
  • Polled-fast - sources that are cheap and bounded (currently unclear which fall here once the others are notify-driven).
  • Polled-slow - GitHub state, slow tick (5 to 15 min) or webhook-driven if that ever stands up.
  • Append-only - coily audit JSONL, read from last offset on each notify event.

Each source maintains its own version counter. The aggregate scan-version becomes a hash over per-source versions, preserving the existing ETag contract.

Why this is the load-bearing architectural improvement

Every other crate addition on this list (structural-facts pass, code-metrics, search-router) gets meaningfully better with per-source refresh:

  • Structural-facts becomes file-change-driven instead of tick-driven, dropping the refresh-tick budget cost to near-zero.
  • rust-code-analysis only reparses files that actually changed, which is the only way that addition is sustainable across a 30-repo radius.
  • Search indexes update incrementally instead of via full rebuild on each tick.

Without per-source refresh, every new source compounds the single-tick budget. With it, expensive ingest scales by change rate, not by source count.

Out of scope for v1

  • Webhook-driven GitHub ingest. Slow polling is fine; webhook is a future upgrade.
  • A subscription / push interface for downstream consumers (session-lattice). v1 keeps the polled scan-version ETag contract; push is a future upgrade.
  • Cross-source change correlation. Each source ticks independently.

Open sub-questions

  • Whether the aggregate scan-version stays a single integer (hashed over per-source versions) or becomes a vector. Single integer preserves the existing ETag contract; vector lets consumers subscribe to specific sources. Probably single for v1, vector if subscription becomes a real ask.
  • notify behavior on macOS for paths under ~/.claude/projects/ and ~/.coily/audit/. Worth testing early.

Origin

Conversation 2026-05-17. The "single 150s tick is too coarse" framing came up while sizing the cost of adding tree-sitter and rust-code-analysis. Pairs with structural-facts, search-router, and code-metrics issues.

_Originally filed by @coilysiren on 2026-05-17T20:50:34Z - [https://github.com/coilysiren/repo-recall/issues/188](https://github.com/coilysiren/repo-recall/issues/188)_ **Problem** repo-recall today runs a single 150s refresh tick that re-scans every source on the same cadence. That treats all sources as equally fresh and equally expensive, which neither is: - Filesystem-shaped sources (working tree, source files, Claude session JSONL) change continuously but each individual change is cheap to ingest incrementally. - Git log is effectively reactive to filesystem (HEAD moves are file changes). - GitHub state is network-bound, rate-limited, expensive. 150s is too aggressive at a 30-repo radius. - coily audit JSONL is append-only; the right ingest pattern is "read from last offset," not "re-scan every tick." - Doc file health rarely changes. - Any future heavy passes (tree-sitter metrics, code analysis) need to be file-change-driven, not tick-driven. The single-tick model also forces a global scan-version, which is fine for current consumers but constrains downstream tooling (session-lattice) to a coarse change signal. **Proposal** Refactor the refresh tier so each source declares its own trigger shape: - **Notify-driven** - filesystem sources via the `notify` crate (inotify on Linux, FSEvents on Mac, ReadDirectoryChangesW on Windows). Working tree, source files, Claude session JSONL, coily audit, doc files. Fall back to mtime-poll where notify is unreliable (network mounts, certain Windows configs). - **Reactive** - git log triggered by "HEAD or refs moved" filesystem signal under `.git/`. - **Polled-fast** - sources that are cheap and bounded (currently unclear which fall here once the others are notify-driven). - **Polled-slow** - GitHub state, slow tick (5 to 15 min) or webhook-driven if that ever stands up. - **Append-only** - coily audit JSONL, read from last offset on each notify event. Each source maintains its own version counter. The aggregate scan-version becomes a hash over per-source versions, preserving the existing ETag contract. **Why this is the load-bearing architectural improvement** Every other crate addition on this list (structural-facts pass, code-metrics, search-router) gets meaningfully better with per-source refresh: - Structural-facts becomes file-change-driven instead of tick-driven, dropping the refresh-tick budget cost to near-zero. - rust-code-analysis only reparses files that actually changed, which is the only way that addition is sustainable across a 30-repo radius. - Search indexes update incrementally instead of via full rebuild on each tick. Without per-source refresh, every new source compounds the single-tick budget. With it, expensive ingest scales by change rate, not by source count. **Out of scope for v1** - Webhook-driven GitHub ingest. Slow polling is fine; webhook is a future upgrade. - A subscription / push interface for downstream consumers (session-lattice). v1 keeps the polled scan-version ETag contract; push is a future upgrade. - Cross-source change correlation. Each source ticks independently. **Open sub-questions** - Whether the aggregate scan-version stays a single integer (hashed over per-source versions) or becomes a vector. Single integer preserves the existing ETag contract; vector lets consumers subscribe to specific sources. Probably single for v1, vector if subscription becomes a real ask. - notify behavior on macOS for paths under `~/.claude/projects/` and `~/.coily/audit/`. Worth testing early. **Origin** Conversation 2026-05-17. The "single 150s tick is too coarse" framing came up while sizing the cost of adding tree-sitter and rust-code-analysis. Pairs with structural-facts, search-router, and code-metrics issues.
coilysiren added
P4
and removed
P3
labels 2026-05-31 07:01:15 +00:00
Sign in to join this conversation.
No labels
P0
P1
P2
P3
P4
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coilyco-flight-deck/repo-recall#32
No description provided.