What if your carefully crafted AI prompts are silently failing 30% of the time and your clients notice before you do? That is exactly what happened to one AI workflow builder, and it led to a systematic method for automatically improving any AI skill to near-perfection.
The method is called autoresearch, originally developed by Andrej Karpathy (OpenAI co-founder, former Tesla AI lead). While Karpathy designed it for machine learning code, the approach works for anything measurable and improvable including the AI prompts and skills you use every day.
The Problem: Your AI Skills Are Secretly Underperforming
Most people cannot tell the difference between an AI workflow that performs well and one that just produces text. There are three types of silent failure:
- Gradual drift: The model slowly drifts toward safe, vague, template-like outputs. Each seems acceptable, but quality erodes imperceptibly.
- Survivorship bias: You only see outputs you use. Failed ones with wrong formats or missing elements go unreviewed.
- One-off fixes: You fix a specific output but not the underlying skill. The same error recurs.
What Is the Autoresearch Method?
The core concept is simple: let an AI agent run an optimization loop for you.
- The agent tries a small change to your prompt
- It tests the modified version and measures results
- If results improve, keep the change
- If results worsen, revert
- Repeat indefinitely
Think of perfecting a recipe. Change one ingredient, cook it ten times, see if it is better. After 50 rounds, your recipe works 9.5 times out of 10.
The Secret: Your Evaluation Checklist
The only thing you provide is a checklist that defines good using simple yes/no questions:
| Checklist Item | Type |
|---|---|
| Does the headline contain a specific number or quantifiable result? | Yes/No |
| Does the opening sentence name a specific pain scenario? | Yes/No |
| Does the CTA clearly tell the user what happens after they click? | Yes/No |
| Is the copy free of buzzwords (revolutionary, cutting-edge, synergy)? | Yes/No |
3-6 questions is the sweet spot. More than 10 causes the skill to teach to the test.
Real Results: 56% to 92% on Autopilot
A landing page copy skill was improved with zero human intervention:
- Starting score: 56%
- Final score: 92%
- Iterations: 4 changes tested, 3 kept, 1 reverted
- Human intervention: Zero
The agent added explicit headline rules, created a banned buzzword list, included concrete examples of good copy, and correctly reverted a word limit change that harmed overall quality.
How to Set Up Autoresearch
- Download the skill and add it to your Claude Code skills folder
- Choose your most inconsistent skill to improve
- Define your checklist of what good looks like
- Run it and the agent establishes a baseline score
- Walk away while the agent loops through improvements
- It stops at 95%+ consistency or when you tell it to
Beyond AI: Where Else This Works
| Use Case | What You Measure | Example Result |
|---|---|---|
| Website performance | Page load time | 1,100ms to 67ms in 67 iterations |
| Cold outreach emails | Personalization, length, question ending | 50 variants automatically |
| Newsletter intros | Personal details, no cliches | Polished on autopilot |
| Any repeated prompt | Your custom checklist | Continuous improvement |
The Deeper Lesson
The biggest change is the shift from hope to knowledge. Before autoresearch, every delivery comes with quiet uncertainty. After running it, you know exactly when your skill works, when it fails, and how to find the problem. From luck-based to system-based. That is the real value.
FAQ
What is Karpathy autoresearch method?
An automated optimization loop that makes small changes to prompts or code, tests results against measurable criteria, and keeps improvements while reverting failures. Originally for ML code, it works for any measurable process.
Do I need coding skills?
No. The method has been packaged as a Claude Code skill you can download and run directly. Just define what good looks like.
How long does it take?
A typical run with 4-10 iterations takes 15-30 minutes. It runs autonomously and stops at 95%+ consistency.
Can it make things worse?
No. Every change is tested. If a modification reduces the score, it is automatically reverted. Your original is always preserved.
What is the ideal number of checklist items?
3-6 questions. Fewer than 3 gives insufficient signal. More than 10 causes teaching to the test.
Source: @MinLiBuilds on X, adapting @itsolelehmann original article on Karpathy autoresearch method.
内容搜集自网络,整理者:BTCover,如若侵权请联系站长,会尽快删除。