Site Discovery Runtime
Raiken discovery now runs as a stateful subsystem rather than a one-shot crawler command. It persists session metadata, queue snapshots, and runtime phase details so teams can operate discovery with pause/continue semantics and better observability.
Runtime model
Discovery runtime tracks explicit phases:
idlerunningpausedcompletederror
For each active project runtime, Raiken tracks:
- current URL and depth
- counters for pages, links, and blockers
- blocker requirements and last error
- bounded event timeline for recent runtime events
Startup flow
A discovery run now boots in this order:
- Session bootstrap (
startNewSessionorresumeSession) - Start-domain validation
- Auth storage-state load (when available)
- Request queue open and optional queue rehydration
- Crawler creation with pre-navigation auth-state application
- Seed handling based on resume queue availability
This ordering ensures session context and auth state are ready before active crawl execution begins.
Pause and continue behavior
When discovery pauses, Raiken persists:
- session ID and counters
- configured limits (
maxPages,maxDepth) - queue snapshot (
queueJson) for unhandled entries
When discovery continues, Raiken restores persisted runtime values and resumes from checkpoint state where possible. Queue restoration is best-effort and local-process scoped (not exactly-once distributed execution).
Authentication lifecycle
Discovery can pause on auth blockers and request user assistance.
After auth capture saves .raiken/auth-state.json, unresolved auth blockers are
marked resolved and discovery can continue with auth state applied at browser context
level (cookies + storage seeding). This prevents stale blocker state after successful
login capture.
Link outcome classification
Discovered links are now classified by navigation outcome:
verifiedfor successful target navigationbrokenfor failed target navigationauth_requiredfor blocked targets (for example401/403)
This improves graph reliability and downstream test-generation quality.
Selector synthesis for discovered links
Selector generation prioritizes stability:
data-testidselectors when available- concrete
hrefselectors - constrained text selectors
- generic anchor selectors as fallback
Snapshot persistence semantics
During discovery, Raiken persists structured page snapshots, not full raw HTML.
Current capture behavior:
- waits for
domcontentloaded - captures
page.locator("body").ariaSnapshot() - stores the payload in
discovered_pages.snapshot_json - stores page metadata (
url,title,depth,parent_url, visit/timestamps)
This keeps storage smaller and emphasizes semantic/navigation structure over raw markup completeness.
Example query for recent snapshot records:
SELECT url, title, depth, LENGTH(snapshot_json) AS snapshot_sizeFROM discovered_pagesWHERE project_path = ?ORDER BY discovered_at DESCLIMIT 20;DB and API contract updates
- Discovery schema (v5) includes persisted
max_pagesandmax_depthindiscovery_sessions - Lower-level DB consumers use
CodeGraphDB.getRawDatabase()as the formal contract - Site knowledge reads map raw rows into typed domain objects for session/page/link consistency
- Pending-link transitions use targeted lookups (for example
getPendingLinksTo(url)) to mark outcomes after page success/failure - Runtime orchestration endpoints are exposed through shared tRPC procedures:
startDiscoverycontinueDiscoverygetDiscoveryRuntimegetDiscoveryTimelineauthAssistclearDiscoveryData
Config and CLI precedence
Project-level defaults are defined in raiken.config.json under discovery.
Per-run CLI flags override config defaults for that invocation.
{ "discovery": { "maxPages": 100, "maxDepth": 4, "maxConcurrency": 3, "timeout": 30000, "excludePatterns": ["**/logout", "**/signout"], "pauseOnAuth": true }}Limitations and trade-offs
- Queue restoration is best-effort and local snapshot based
- Runtime timeline is in-memory and process-local
- Anchor-centric traversal may miss purely client-triggered transitions
- Auth-state continuation depends on freshness/validity of saved state