SEAS

Private

Self-Evolving Agent System

79%

avg fitness

17+

evolution runs

benchmark tasks

Code available on request

My Role

Solo research project. Designed the evolution framework, evaluation pipeline, and ran all experiments.

The Problem

Most LLM agents are hand-designed: someone writes a prompt, tests it manually, tweaks it, repeat. I wanted to see if genetic programming could automate this — evolve a population of agents where the good ones reproduce and pass down their strategies. This is a research experiment, not a production system.

How I Built It

Arena-based evaluation

Each agent in the population receives a coding task, generates a solution, and gets scored by a multi-layer judge: syntax check, test execution, output validation, and code quality. The layered approach was necessary because a single LLM judge was too noisy — it would give high scores to plausible-looking but incorrect code.

Genetic operators

Top-performing agents reproduce using crossover (combine strategies from two parents) and mutation (randomly modify prompts, temperature settings, tool use patterns). The crossover operator is still pretty basic — it concatenates sections from each parent rather than doing anything structurally smart.

Archive to prevent forgetting

Inspired by the Go-Explore algorithm, an archive stores the best agent found for each task category. Without this, the population would converge on one strategy and forget how to solve other task types. The archive acts as a safety net — if the population regresses on a task, the archived agent can be reintroduced.

Toolkit harvesting

When an agent discovers a useful code pattern, it gets extracted into a reusable toolkit that future generations can access. In practice this worked okay but not great — the harvested patterns were often too specific to generalize well.

Architecture

Population Manager
├── Agent Pool (N agents per generation)
├── Arena
│   └── Coding tasks + multi-layer evaluation
├── Judge (4 layers)
│   ├── Syntax Check (AST parsing)
│   ├── Test Execution (subprocess)
│   ├── Output Validation (expected vs actual)
│   └── Code Quality (heuristic scoring)
├── Genetic Operators
│   ├── Crossover (strategy combination)
│   └── Mutation (prompt/config perturbation)
├── Archive (Go-Explore inspired)
│   └── Best agent per task category
└── Toolkit Harvester
    └── Extract reusable patterns from top agents

Results

▸79% average fitness across 10 benchmark coding tasks
▸17+ full evolution runs completed with different configs
▸Multi-layer judge reduced false-positive scoring vs single LLM judge
▸Archive mechanism successfully prevented catastrophic forgetting
▸Toolkit harvesting extracts reusable code patterns from top-performing agents

What I'd Do Differently

The biggest limitation is cost — running LLM inference for every agent in every generation is expensive, which limits population size and generation count. The crossover operator is also naive — it concatenates sections from each parent rather than doing structurally-aware recombination. This is a research prototype, not a production tool.

Tech Stack

PythonLLM APIs (GPT-4, Claude)Genetic ProgrammingAST ManipulationSubprocess SandboxingMultiprocessing