SEAS
PrivateSelf-Evolving Agent System
Solo research project. Designed the evolution framework, evaluation pipeline, and ran all experiments.
The Problem
Most LLM agents are hand-designed: someone writes a prompt, tests it manually, tweaks it, repeat. I wanted to see if genetic programming could automate this — evolve a population of agents where the good ones reproduce and pass down their strategies. This is a research experiment, not a production system.
How I Built It
Arena-based evaluation
Each agent in the population receives a coding task, generates a solution, and gets scored by a multi-layer judge: syntax check, test execution, output validation, and code quality. The layered approach was necessary because a single LLM judge was too noisy — it would give high scores to plausible-looking but incorrect code.
Genetic operators
Top-performing agents reproduce using crossover (combine strategies from two parents) and mutation (randomly modify prompts, temperature settings, tool use patterns). The crossover operator is still pretty basic — it concatenates sections from each parent rather than doing anything structurally smart.
Archive to prevent forgetting
Inspired by the Go-Explore algorithm, an archive stores the best agent found for each task category. Without this, the population would converge on one strategy and forget how to solve other task types. The archive acts as a safety net — if the population regresses on a task, the archived agent can be reintroduced.
Toolkit harvesting
When an agent discovers a useful code pattern, it gets extracted into a reusable toolkit that future generations can access. In practice this worked okay but not great — the harvested patterns were often too specific to generalize well.
Architecture
Population Manager
├── Agent Pool (N agents per generation)
├── Arena
│ └── Coding tasks + multi-layer evaluation
├── Judge (4 layers)
│ ├── Syntax Check (AST parsing)
│ ├── Test Execution (subprocess)
│ ├── Output Validation (expected vs actual)
│ └── Code Quality (heuristic scoring)
├── Genetic Operators
│ ├── Crossover (strategy combination)
│ └── Mutation (prompt/config perturbation)
├── Archive (Go-Explore inspired)
│ └── Best agent per task category
└── Toolkit Harvester
└── Extract reusable patterns from top agentsResults
- ▸79% average fitness across 10 benchmark coding tasks
- ▸17+ full evolution runs completed with different configs
- ▸Multi-layer judge reduced false-positive scoring vs single LLM judge
- ▸Archive mechanism successfully prevented catastrophic forgetting
- ▸Toolkit harvesting extracts reusable code patterns from top-performing agents
The biggest limitation is cost — running LLM inference for every agent in every generation is expensive, which limits population size and generation count. The crossover operator is also naive — it concatenates sections from each parent rather than doing structurally-aware recombination. This is a research prototype, not a production tool.