DOCS

Forge, in six pages.

What's below is enough to get the daemon running and integrated with any host. Deeper references live in the Forge repo (linked from each section). Public docs are minimal by design — Forge is local-first, so the source is the documentation.

Quick start

Install the daemon, start it, point any supported agent at it. Two commands.

cargo install --git https://github.com/chaosmaximus/forge forge-daemon forge-cli
forge-daemon &

Concepts

Decision intelligence (timeline + graph + counterfactuals). Code intelligence (sidecar-first symbol map). 8-layer Manas memory underneath both.

Adapters

Forge runs as one daemon, talks to every host. Claude Code, Codex CLI, Cline, Gemini CLI, Cursor, Hermes. Each adapter ingests transcripts and surfaces context back.

CLI reference

forge-next health · doctor · recall · remember · code-search · blast-radius · project · sessions · extract · stats. Every call project-scoped; pass --project to override cwd auto-detect.

Plugin + skills

Skills and subagents that wrap Forge: forge:forge-think, forge:forge-feature, forge:forge-tdd, forge:forge-debug, forge:forge-review. Generic across hosts.

Architecture

Rust daemon (4 crates), SQLite + WAL, HTTP API on 8420, optional Studio web UI. Local-first, host-neutral, no cloud round-trip.

More on GitHub.

The Forge daemon, plugin, skills, agents, and roadmap docs live in the public repo. Internal product/legal artifacts stay private — what's open is what runs.

github.com/chaosmaximus/forge Request access

APPENDIX · TRAJECTORY EXPORT

Your trajectories are training data. Use them.

A Forge trajectory is structured exactly like a reinforcement learning episode — state, action, reward — which is the shape RL pipelines already expect. ForgeTrajectoryDataset (planned) is the export bridge.

Five training paths

Supervised fine-tuning (SFT)

What you get: Smaller, cheaper model that mimics your big model on YOUR codebase.

What you need: High-scored trajectories filtered to (state → action) pairs.

DPO / preference learning

What you get: Model that prefers your chosen path over rejected alternatives.

What you need: Paired (chosen, rejected) examples from Decision Graph branches.

GRPO

What you get: Strong policy-gradient signal without a separate reward model. DeepSeek-R1 style.

What you need: Groups of multi-attempt trajectories on the same task — exploration mode produces these naturally.

Reward modeling

What you get: A judge model that can score new trajectories — useful for autonomous evals.

What you need: (trajectory, human-or-test-score) pairs.

Distillation

What you get: Haiku that performs like Opus on your tasks. Cost down 10×, latency down 5×.

What you need: Opus trajectories on the codebase + SFT pipeline.

Reward signals

User accept / reject of agent suggestions
Test pass / fail after a change
Build success / failure
Time-to-resolution (faster = better)
Counterfactual win / loss vs alternatives considered

Ecosystem

Atropos (Nous Research) — RL environment framework. Forge exports a trajectory bundle Atropos replays for off-policy training.
Tinker (Thinking Machines) — Managed fine-tuning API. Forge dataset → Tinker JSONL → SFT/RL runs without ops.
OpenAI / Anthropic fine-tuning APIs — Same trajectory shape, hosted by the labs.
Local stacks — axolotl, TRL, unsloth, llama.cpp — for self-hosted SFT on a 4090 or H100.
GRPO frameworks — DeepSeek-R1 style — for getting strong reasoning out of smaller models.

Forge is the data layer; the training stack you pick is your call. For the meta plays — building better tooling around the model rather than retraining it — see tooling for LLMs.

APPENDIX · TOOLING FOR LLMS

Build better tooling for the next model.

Beyond training the model, Forge's data lets you build better tooling for models. Four meta plays the trajectory + decision graph make tractable.

Tool-routing models

Forge sees every tool call across every session. Mining this gives optimal tool sequences per (model, language, intent). When a Sonnet 4.6 agent on a Rust codebase asks 'find usages of X', the right path may be Serena.find_symbol → forge-next blast-radius → grep fallback. Codex on the same task does grep first. Train a small router that picks the right sequence. Bonus: existing LLM evals barely measure tool-call efficiency. Forge data → first proper tool-use leaderboard.

Context compressors

Forge tracks what context was injected vs what was actually cited in the agent's response. That's labeled data for what the agent actually used. Train a small model on it; given a candidate context, predict what the agent will cite, prune the rest. Massive token savings for downstream agents.

Self-improving prompts

When an agent failed (low PathScore, test failure), Forge has the prompt + the failure mode. Mine these for systemic prompt gaps. Output: a model that suggests "your system prompt is missing instructions about X" — feedback loop on the prompts themselves.

Eval / benchmark generation

Every successful trajectory in your Forge instance is a candidate eval problem. "Given this codebase state and this user request, the right output is this patch." A million Forge users → a million labeled eval cases that don't exist anywhere else. Potentially the largest unique training/eval corpus in the agent space.

Each play is downstream of the same substrate as trajectory export: structured decisions plus the code-graph snapshot they were made against. Train the model, or build the tooling around it. Same data, two directions.