Try an idea, measure it, keep what works, discard what doesn't, repeat forever. An extension for pi — an AI coding agent that runs in your terminal. pi-autoresearch gives pi the tools and workflow to run autonomous optimization loops: try an idea, benchmark it, keep improvements, revert regressions, repeat. Inspired by karpathy/autoresearch. Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores.
Install · Usage · How it works
</div>Try an idea, measure it, keep what works, discard what doesn't, repeat forever.
An extension for pi — an AI coding agent that runs in your terminal. pi-autoresearch gives pi the tools and workflow to run autonomous optimization loops: try an idea, benchmark it, keep improvements, revert regressions, repeat.
Inspired by karpathy/autoresearch. Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores.
pi install npm:pi-autoresearch
| | |
|---|---|
| Extension | Tools + live widget + /autoresearch dashboard |
| Skill | Gathers what to optimize, writes session files, starts the loop |
| Tool | Description |
|------|-------------|
| init_experiment | One-time session config — name, metric, unit, direction |
| run_experiment | Runs any command, times wall-clock duration, captures output |
| log_experiment | Records result, auto-commits, updates widget and dashboard |
/autoresearch command| Subcommand | Description |
|------------|-------------|
| /autoresearch <text> | Enter autoresearch mode. If autoresearch.md exists, resumes the loop with <text> as context. Otherwise, sets up a new session. |
| /autoresearch off | Leave autoresearch mode. Stops auto-resume and clears runtime state but keeps autoresearch.jsonl intact. |
| /autoresearch clear | Delete autoresearch.jsonl, reset all state, and turn autoresearch mode off. Use this for a clean start. |
| /autoresearch export | Open a live dashboard in your browser. Auto-updates as experiments run. |
Examples:
/autoresearch optimize unit test runtime, monitor correctness
/autoresearch model training, run 5 minutes of train.py and note the loss ratio as optimization target
/autoresearch export
/autoresearch off
/autoresearch clear
| Shortcut | Description |
|--------------|-------------|
| Ctrl+Shift+T | Toggle dashboard expand/collapse (inline widget ↔ full results table above the editor) |
| Ctrl+Shift+F | Open fullscreen scrollable dashboard overlay. Navigate with ↑/↓/j/k, PageUp/PageDown/u/d, g/G for top/bottom, Escape or q to close. |
To avoid conflicts with other pi extensions, override or disable these shortcuts in
<agent-dir>/extensions/pi-autoresearch.json. <agent-dir> is the active pi profile
config directory (usually ~/.pi/agent, or PI_CODING_AGENT_DIR when set):
{
"shortcuts": {
"toggleDashboard": "ctrl+shift+y",
"fullscreenDashboard": null
}
}
Use null to skip registering a shortcut. Omitted shortcuts keep their defaults.
🔬 autoresearch 12 runs 8 kept │ ★ total_µs: 15,200 (-12.3%) │ conf: 2.1×Ctrl+Shift+T expands the widget into a full results table with columns for commit, metric, status, and description.Ctrl+Shift+F opens a scrollable full-terminal dashboard. Shows a live spinner with elapsed time for running experiments.autoresearch-create asks a few questions (or infers from context) about your goal, command, metric, and files in scope — then writes two files and starts the loop immediately:
autoresearch-finalize turns a noisy autoresearch branch into clean, independent branches — one per logical change, each starting from the merge-base. Groups must not share files, so each branch can be reviewed and merged independently.
autoresearch-hooks (optional) helps author autoresearch.hooks/before.sh and autoresearch.hooks/after.sh for a session. It ships with ten reference scripts in skills/autoresearch-hooks/examples/ (external search, learnings journal, native notifications, anti-thrash, idea rotation, and more) — the skill handles the contract, you pick the inspiration. The core autoresearch loop has no hook awareness.
| File | Purpose |
|------|---------|
| autoresearch.md | Session document — objective, metrics, files in scope, what's been tried. A fresh agent can resume from this alone. |
| autoresearch.sh | Benchmark script — pre-checks, runs the workload, outputs METRIC name=number lines. |
| autoresearch.checks.sh | (optional) Backpressure checks — tests, types, lint. Runs after each passing benchmark. Failures block keep. |
| autoresearch.hooks/ | (optional) Executable scripts (before.sh, after.sh) that fire around iterations. Stdout is delivered to the agent as a steer message. |
pi install npm:pi-autoresearch
<details>
<summary>Manual install</summary>
cp -r extensions/pi-autoresearch ~/.pi/agent/extensions/
cp -r skills/autoresearch-create ~/.pi/agent/skills/
Then /reload in pi.
/skill:autoresearch-create
The agent asks about your goal, command, metric, and files in scope — or infers them from context. It then creates a branch, writes autoresearch.md and autoresearch.sh, runs the baseline, and starts looping immediately.
The agent runs autonomously: edit → commit → run_experiment → log_experiment → keep or revert → repeat. It never stops unless interrupted.
Every result is appended to autoresearch.jsonl in your project — one line per run. This means:
autoresearch.md captures what's been tried so a fresh agent has full context/skill:autoresearch-finalize
The agent reads autoresearch.jsonl, groups kept experiments into logical changesets, proposes the grouping for your approval, then creates independent branches from the merge-base. Each commit includes metric improvements in the message. Groups must not share files, so branches can be reviewed and merged independently.
Ctrl+Shift+T — expand/collapse the full results table inline (config key: shortcuts.toggleDashboard)Ctrl+Shift+F — fullscreen scrollable dashboard overlay (config key: shortcuts.fullscreenDashboard)/autoresearch export — open a live browser dashboard with chart and share cardEscape — interrupt anytime and ask for a summary| Domain | Metric | Command |
|--------|--------|---------|
| Test speed | seconds ↓ | pnpm test |
| Bundle size | KB ↓ | pnpm build && du -sb dist |
| LLM training | val_bpb ↓ | uv run train.py |
| Build speed | seconds ↓ | pnpm build |
| Lighthouse | perf score ↑ | lighthouse http://localhost:3000 --output=json |
The extension is domain-agnostic infrastructure. The skill encodes domain knowledge. This separation means one extension serves unlimited domains.
┌──────────────────────┐ ┌──────────────────────────┐
│ Extension (global) │ │ Skill (per-domain) │
│ │ │ │
│ run_experiment │◄────│ command: pnpm test │
│ log_experiment │ │ metric: seconds (lower) │
│ widget + dashboard │ │ scope: vitest configs │
│ │ │ ideas: pool, parallel… │
└──────────────────────┘ └──────────────────────────┘
Two files keep the session alive across restarts and context resets:
autoresearch.jsonl — append-only log of every run (metric, status, commit, description)
autoresearch.md — living document: objective, what's been tried, dead ends, key wins
A fresh agent with no memory can read these two files and continue exactly where the previous session left off.
Create autoresearch.config.json in your pi session directory to customize behavior:
{
"workingDir": "/path/to/project",
"maxIterations": 50
}
| Field | Type | Description |
|-------|------|-------------|
| workingDir | string | Override the directory for all autoresearch operations — file I/O, command execution, and git. Supports absolute or relative paths (resolved against the pi session cwd). The config file itself always stays in the session cwd. Fails if the directory doesn't exist. |
| maxIterations | number | Maximum experiments before auto-stopping. The agent is told to stop and won't run more experiments until a new segment is initialized. |
The loop is designed to run unattended across context limits. When pi's auto-compaction summarizes the older portion of the conversation, autoresearch detects the resulting idle and re-prompts the agent to re-read autoresearch.md, the tail of autoresearch.jsonl, autoresearch.ideas.md, and git log before continuing. All progress is persisted in those files, so the post-summary turn rehydrates from the source of truth instead of relying on whatever survived compaction. No tuning required — if pi's auto-compaction is enabled (the default), this just works.
After 3+ experiments in a session, pi-autoresearch computes a confidence score — how the best improvement compares to the session's noise floor. This helps distinguish real gains from benchmark jitter, especially on noisy signals like ML training, Lighthouse scores, or flaky benchmarks.
How it works:
|best_improvement| / MAD. A score of 2.0× means the best improvement is twice the noise floor.log_experiment output.autoresearch.jsonl on each result for post-hoc analysis.| Confidence | Color | Meaning | |-----------|-------|---------| | ≥ 2.0× | 🟢 green | Improvement is likely real | | 1.0–2.0× | 🟡 yellow | Above noise but marginal | | < 1.0× | 🔴 red | Within noise — consider re-running to confirm |
Create autoresearch.checks.sh to run correctness checks (tests, types, lint) after every passing benchmark. This ensures optimizations don't break things.
#!/bin/bash
set -euo pipefail
pnpm test --run
pnpm typecheck
How it works:
checks_failed (same behavior as a crash — no commit, revert changes).checks_failed status is shown separately in the dashboard so you can distinguish correctness failures from benchmark crashes.checks_timeout_seconds in run_experiment).Drop executable scripts in autoresearch.hooks/ to run code at iteration boundaries. Hooks are transparent to the agent — the agent calls tools and sees results; hooks run alongside without any agent-facing surface.
autoresearch.hooks/before.sh — fires before every iteration (at /autoresearch activation and at the end of every log_experiment, after after.sh). Use for prospective work: fetch research, prime context for the next attempt.autoresearch.hooks/after.sh — fires at the end of every log_experiment. Use for retrospective work: annotate learnings, send notifications.Contract:
chmod +x). Preserved on revert like all autoresearch.* artefacts.jq.{"type":"hook",…} entry to autoresearch.jsonl for observability.before.sh stdin (on fresh activation last_run is null):
{
"event": "before",
"cwd": "/path/to/workdir",
"next_run": 6,
"last_run": {
"run": 5, "status": "discard", "metric": 42.1,
"description": "…",
"asi": { "hypothesis": "…", "next_focus": "…" }
},
"session": {
"metric_name": "total_ms", "metric_unit": "ms", "direction": "lower",
"baseline_metric": 40.7, "best_metric": 33.5,
"run_count": 5, "goal": "optimize sort speed"
}
}
after.sh stdin:
{
"event": "after",
"cwd": "/path/to/workdir",
"run_entry": {
"run": 6, "status": "discard", "metric": 38.9,
"description": "…",
"asi": { "hypothesis": "…", "learned": "…" }
},
"session": { "metric_name": "total_ms", "direction": "lower", "baseline_metric": 40.7, "best_metric": 33.5, "run_count": 6, "goal": "…" }
}
Agent signal. The agent writes description and asi.* fields in its log_experiment calls for its own future-self reasoning. The hook opportunistically mines whichever fields the agent naturally uses — asi.hypothesis, asi.next_focus, description, etc. There is no dedicated "hook input" field; the agent is unaware the hook exists.
Examples. Reference scripts for both stages live at skills/autoresearch-hooks/examples/ — external search, qmd document search, persistent learnings, native notifications, git tagging, anti-thrash, idea rotator, hypothesis reflection, context rotation. Copy one to your session's autoresearch.hooks/ directory, adapt, chmod +x.
Autoresearch loops run autonomously and can burn through tokens. Two ways to cap spend:
maxIterations — cap experiments per session in autoresearch.config.json:
{
"maxIterations": 30
}
MIT
The only AutoResearch tutorial you’ll ever need
David Ondrej · 245K views
Karpathy's "autoresearch" broke the internet
Greg Isenberg · 98K views
AutoResearch explained..
Caleb Writes Code · 71K views
“I originally was just messing with pi-autoresearch. Gave it a sample task to build the most portable coding agent. First cut was 6 KB of shell. Great for one-shots, unusable interactively. I was shocked it actually worke…”
“I use PI ( https://pi.dev ) and ( https://hermes-agent.nousresearch.com/ ) as the main drivers together with deepseek-v4-pro as the main model (~10M/day tokens overall there). Hermes basically rules my personal life at t…”
“Karpathy's Autoresearch Loop Is Spreading Fast: Shopify's 53% Speed Claim Still Unmerged, Flagged as Overfit - Tech Times — Tech Times”
Web
Author: lewis <sudolewis@gmail.com> License: MIT 中文文档: README.zh-CN.md Press S on any deck to pop open a dedicated presenter window with four draggable, resizable magnetic cards: current slide, next slide preview, speaker script (逐字稿), and timer. Two windows stay in sync via BroadcastChannel. Why previews are pixel-perfect: each card is an that loads the same deck HTML with a ?preview=N query param. The runtime detects this and renders only slide N with no chrome — so the preview uses the same CSS, theme, fonts and viewport as the audience view. Colors and layout are guaranteed identical.
Web
The JavaScript / TypeScript SDK for building Astrid capsules. Companion to sdk-rust. Same WIT contract, same wasip2 Component Model output, same .capsule archive format — your kernel can't tell which language built the binary. Where the Rust SDK feels like writing against std, this one feels like writing against node:fs/promises / WHATWG / Node's EventEmitter. Same host ABI, idiom translated. [package] name = "my-capsule" version = "0.1.0"
Web
dontbesilent 商业诊断工具箱。从 12,307 条推文中提炼方法论,做成 21 个 Agent skill。 可在 Claude Code、Codex、Cursor、Trae Solo 等任意支持 skill / system prompt 的 Agent 上使用。 v2.14.1 更新:修复 /dbs-content-system 发布包缺少 scaffold/root/AGENTS.md、CLAUDE.md、SOURCEOFTRUTH.md 的问题。v2.14.0 新增了正式版内容结构化系统,本版本补齐初始化脚本依赖的根级脚手架文件。
Web
Pixel-perfect skeleton loading screens, extracted from your real UI. No manual measurement, no hand-tuned placeholders. Works with React, Preact, Vue, Svelte 5, Angular, and React Native. import { Skeleton } from 'boneyard-js/react'