Project Overview

Automated De Novo Protein Design Optimization

De novo protein design has advanced rapidly through generative models: diffusion-based structure generators (Proteina-Complexa, RFdiffusion), structure predictors (AlphaFold2, RoseTTAFold3), conformational samplers (BioEmu), and inverse folding networks (ProteinMPNN, LigandMPNN). These models are powerful. But using them to produce high-quality binders for a specific target requires searching over dozens of inference-time hyperparameters: beam search width, checkpoint schedules, branching factors, scoring model choice, refinement strategies, and more. No single default configuration works across targets. The search is high-dimensional, target-dependent, and expensive.

We frame this as an inference-time optimization problem. The trained model weights are fixed. The agent's job is to find the best way to run the model for each target. LLMs can act as the brain that connects to and orchestrates protein design models, turning what was a manual trial-and-error process into a closed-loop optimization.

Research Question

Can an LLM agent optimize inference-time hyperparameters for protein generative models, improving binder quality iteration by iteration for each target?

Proteina-Complexa generation and inference-time optimization pipeline
Proteina-Complexa pipeline: flow-matching generation with beam search and inference-time scoring. The agent optimizes the hyperparameters at each stage. Figure from NVIDIA Proteina-Complexa.
LLM Agent analyzes results, decides what to change next WHAT THE AGENT CHANGES (v1 shipped · v2 transitional · v3 active) v1 Tune 4 hyperparameters beam_width, n_branch, nsamples, checkpoints v2 Tree-search over programs InferenceProgram schema, superseded by v3 v3 Evolve ideas + programs SCORE loop, first-class ResearchIdea, content-addressed Proteina-Complexa Pipeline (model weights frozen) Generate flow-matching diffusion nsteps, noise schedule, guidance_w, self_cond Beam Search branch, score, prune beam_width, n_branch, checkpoints, nsamples AF2 Score AlphaFold2 evaluation num_recycles, reward_weights Refine sequence hallucination n_iters, temperature, loss_weights i_PAE ↓ pLDDT ↑ → next iteration closed-loop feedback v1: 4 params v2: program schema (superseded) v3: SCORE loop (active) model weights frozen

Four generations of the agent, laid out by what the LLM actually writes and whether it reasons about one protein or a whole protein set at once. v3 and v4 both evolve an inference program — the difference is scope.

Binder Discovery — what the agent writes at each version per-protein optimization (v1–v3) → protein-set scheduling (v4) PER-PROTEIN · one target per inference pass PROTEIN-SET · N + budget v1 Config tuner shipped LLM WRITES 4 Hydra scalars beam_width · n_branch nsamples · ckpts 1 target Hydra override bundle one Proteina call scope: per protein 1 target → 1 score v2 Program schema superseded LLM WRITES Structured program dataclass stages + module refs (no free Python code) 1 target Program schema stages · overrides · refs scope: per protein 1 target → 1 score v3 Inference program active LLM WRITES ResearchIdea + Python code reranker.py · filter.py reward_adapter.py 1 target Inference Program Idea + stages + generated code SCORE: idea-aware mutation scope: per protein 1 target → n_successful v4 Inference program frontier LLM WRITES ResearchIdea + orchestrator orchestrator.py schedules N targets under budget N targets · 100 steps / 2h ×N Inference Program Idea + orchestrator.py → run_single_target(ctx, target) scope: per protein set ×N skip skip N targets → unequal spend version evolution frontier →
v1 picks 4 scalar knobs. v2 mutates a structured program schema. v3 promotes the artifact to an inference program — a research idea plus generated Python modules (reranker / filter / reward adapter) that still run against one protein. v4 keeps the same inference-program frame but lifts the scope to a whole protein set: the LLM now writes an orchestrator.py that decides, within a step+wallclock budget, how to spend compute across N targets (reorder, batch, skip, retry) — all per-target inference still flows through the host-sanctioned run_single_target(ctx, target) chokepoint, so the LLM owns the policy while the host owns the subprocess boundary.

The agent minimizes i_PAE (binding quality) while maintaining high pLDDT (fold confidence). Example on the CD45 target (hardest, largest improvement):

CD45: i_PAE and pLDDT over agent iterations agent iteration 1 2 3 4 i_PAE (lower = better) 0.25 0.20 0.15 0.10 pLDDT (higher = better) 1.00 0.95 0.90 0.85 0.241 0.173 0.88 0.95 -28% i_PAE, +8% pLDDT i_PAE (left axis, lower = better binding) pLDDT (right axis, higher = better fold)

Motivation: Good Models Need Good Hyperparameters

We ran 50 experiments (5 beam search configs × 10 protein targets) with Proteina-Complexa. The results make the case clearly:

No universal best config

E_fine_selector wins 7/10, but D_early_brancher wins IFNAR2, C_deep_exploiter wins PD-L1. The #1 config varies per target.

Selection criterion matters

4/10 proteins have different optimal configs depending on whether you pick top-1 or pool top-16 candidates.

Hard targets need tuning

BetV1 (CV=0.65): config choice changes reward by 50%. IFNAR2 (CV=0.03): any config works.

Configs explore different space

Different search strategies produce different sequences. Multi-config pooling beats single-config compute.

Motivation 2: The Full Inference-Time Control Surface

The challenge goes beyond tuning scalar hyperparameters. A generative pipeline like Proteina-Complexa exposes 80+ inference-time parameters organized across qualitatively different layers:

Search policy

Which algorithm to use (beam search, MCTS, FK-steering, best-of-n), how to branch, when to prune, at which checkpoints.

Diffusion rollout policy

How many steps, noise schedules, guidance weights, self-conditioning, ODE integration limits. These controls shape how the generative model samples.

Reward policy

Which scoring model (AF2, RF3, Boltz2), which metrics to optimize (i_PAE, pLDDT, i_con, radius of gyration), how to weight them.

Refinement & filtering policy

Whether to enable sequence hallucination, ProteinMPNN redesign, or LigandMPNN; self-consistency checks; structural filters.

Multi-stage orchestration

Composing cheap exploration → expensive refinement; conditional compute allocation; re-ranking under alternative criteria.

These are not parameters to grid-search. They are design decisions that require reasoning about the target, the failure mode, and the experimental goal. This is why LLM agents, rather than Bayesian optimization alone, are the right tool: they can reason about which knobs to turn, not just how far.

Why the Control Surface Is So Large: Branching Inside Flow Matching

Proteina-Complexa generates structures via conditional flow matching, a learned velocity field that denoises Gaussian noise into protein structure over T steps. The key architectural feature is partial simulation: the denoising loop can pause at intermediate checkpoints, return the state, and let search algorithms branch, evaluate, and prune within a single generation run:

step_checkpoints = [0, 100, 200, 300, 400]

Checkpoint 0 (pure noise):
  Initialize beam_width × nsamples noise samples

  ── Denoise steps 0 → 100 (partial simulation) ──

Checkpoint 100:
  BRANCH: duplicate each beam into n_branch copies
  LOOKAHEAD: roll out each copy to step 400 (completion)
  SCORE: evaluate i_PAE on completed structures
  SELECT: keep top beam_width candidates per sample

  ── Continue denoising 100 → 200 with selected beams ──

Checkpoint 200:
  BRANCH → LOOKAHEAD → SCORE → SELECT

  ...repeat at each checkpoint until step 400 (final)

Four search algorithms exploit this differently: beam search (branch, lookahead-score, keep top-k), best-of-N (independent full trajectories, pick best), FK-steering (branch + steer with reward signal + temperature), and MCTS (PUCT tree search over checkpoint branches).

This is what makes the optimization surface deeply coupled rather than flat:

  • Where to branch (checkpoint placement): correcting early structural decisions vs. late refinements
  • How to branch (n_branch, noise injection): diversity vs. exploitation of promising intermediates
  • How to score (reward weights, AF2 recycles): what the lookahead evaluations actually measure
  • How to denoise (ODE vs. SDE, schedule shape, guidance): the character of each trajectory segment
  • How to chain stages (cheap explore → conditional refine): compute allocation based on intermediate evidence

These decisions interact: checkpoint placement only matters given a particular denoising mode; reward weights only matter given a particular branching strategy. This is why we frame the problem as composing InferencePrograms rather than tuning flat hyperparameters. The decisions form a structured, multi-level control program over the generation process. (See Pipeline Details below for the full technical breakdown.)

Inference Programs, Not Configs

We formalize this as inference-time scaling: the LLM agent does not change the trained model backbone or weights. Instead, it learns to compose multi-stage InferencePrograms: executable plans that specify search policy, sampling controls, reward weighting, refinement strategy, and conditional transitions between stages. The agent evolves these programs through population-based search, treating each experiment as an evolvable artifact with lineage tracking and duplicate detection.

The Problem This Creates

A human researcher running Proteina-Complexa on a new drug target faces a combinatorial space: 5 search algorithms × checkpoint schedules × branching factors × reward compositions × refinement strategies × multi-stage sequencing. This is not a hyperparameter grid. It is a space of programs. Running all combinations for every target is infeasible. An LLM agent that understands the target–config interaction can adaptively allocate compute: scouting cheaply first, then investing in promising strategies, composing multi-stage programs, and transferring learnings across targets.

Pipeline Details: Proteina-Complexa
[Target PDB + Hotspots] → Flow Matching (400 steps) → Beam Search → AF2 Scoring → Candidate PDBs
StageWhat it doesKey hyperparameters
GenerateFlow-matching diffusion in partially latent spacensteps, guidance_w, self_cond
SearchBeam search with branching at checkpointsbeam_width, n_branch, checkpoint schedule, nsamples
ScoreAF2 multimer reward (i_PAE)num_recycles, reward weights
RefineSequence hallucination (test-time optimization)n_iters, loss weights, temperature

Hardware: 8× NVIDIA H100 80GB  |  Targets: 44 pre-configured (10 proteins + 4 ligand targets used in sweeps)  |  Scale: 26,560 samples from 50 protein binder experiments

Flow Matching: Technical Details

The generative model uses conditional flow matching in a product space of two data modes, each with independent schedules and noise injection:

ModeWhat it representsDimension
bb_caBackbone Cα coordinates[n_residues, 3]
local_latentsCompressed local structure (via autoencoder)[n_residues, latent_dim]

At each step, the neural network predicts the clean structure from the current noisy state xt. The update rule depends on the sampling mode:

ModeUpdate ruleCharacter
vfx ← x + v·dtPure ODE: deterministic, low variance
scx ← x + (v + g(t)·s)·dt + noiseSDE: stochastic, higher diversity
vf_ssScore-scaled ODE with temperature controlTemperature-tuned deterministic
vf_tsrSNR-based adaptive temperature schedulingAdvanced adaptive

Each mode has its own time schedule (how to partition [0,1] into steps: log, power, uniform, cosine), noise injection schedule g(t) (1/t, tangent, power-law variants), and step parameters (noise scale, score scale, ODE/SDE switching thresholds). Optional self-conditioning feeds back the previous step’s prediction, and classifier-free guidance or auto-guidance can steer generation toward the conditioning target.

Evidence: Sweep Results and Per-Protein Optimality (50 experiments)

Beam Search Sweep Rankings

RankConfigStrategyMean RewardWins (top-1)
1E_fine_selector6 checkpoints, 3 branches−0.24757/10
2A_balancedPaper default−0.25710/10
3C_deep_exploiter2 lengths, 16 beams−0.27001/10
4D_early_brancher8 branches, 2 checkpoints−0.27922/10
5B_length_explorer16 lengths, 2 beams−0.35010/10
Key Insight

The gap between best and worst config (0.103 mean reward) is ~42% relative difference. For hard targets like BetV1, the gap is even larger. Choosing the right hyperparameters matters as much as choosing the right model.

Per-Protein Optimality

The optimal config depends on both the target and what you optimize for:

ProteinTop-1 WinnerTop-16 Pool WinnerSame?
CD45E_fineA_balancedNO
Claudin-1D_earlyE_fineNO
DerF7D_earlyE_fineNO
SpCas9E_fineC_deepNO
6 other proteins: same winner for both criteria

This is exactly the kind of problem LLM agents can solve: high-dimensional configuration spaces, target-dependent optima, and multi-objective trade-offs.

Model Landscape: 6 Models in the Agent Framework

The agent framework is designed to work across multiple protein and molecular models:

ModelTypeWhat it doesStatus
Proteina-ComplexaBinder DesignFlow-matching de novo binder generation with beam search (protein + ligand binder modes)Active: 50 protein + 20 ligand experiments
AlphaFold2ScoringStructure prediction & binding quality assessment (i_PAE, pLDDT) for protein targetsActive: reward model for protein binders
RoseTTAFold3ScoringStructure prediction for protein–ligand complexes (min_ipAE, ipTM)Active: reward model for ligand binders
ProteinMPNNInverse FoldingSequence redesign for designability evaluation (scRMSD)Active: enabled in ligand binder evaluation
LigandMPNNLigand-Aware DesignLigand-aware sequence redesign for small-molecule bindersActive: in ligand binder evaluation
BioEmuDynamicsProtein conformational ensembles, 1000s of structures/hour/GPU. Predicts folding stability (ΔG), captures domain motion & local unfoldingPlanned: assess binder robustness to target flexibility
Why BioEmu Matters

BioEmu (Microsoft Research, Science 2025) generates conformational ensembles 10,000–100,000× faster than molecular dynamics. A designed binder must work not just against the crystal structure, but across the target's dynamic conformations. BioEmu can generate these ensembles in minutes, enabling the agent to test binder robustness as part of the design loop.

Detailed Reports: PD-L1 Demo · Sweep Results · Config Diversity
Models: Proteina-Complexa · BioEmu (Science 2025) · ProteinMPNN
Related: Tinker Explorer: RL agent for budget-constrained sequential decisions (parallel theme: adaptive compute allocation under step budgets)
Pipeline Validation: PD-L1 Binder Demo
Goal

Validate the Proteina-Complexa pipeline end-to-end: can we generate protein binder candidates and score them with AlphaFold2 on our hardware?

First run: PD-L1 target (Programmed Death-Ligand 1), 100 diffusion steps, 2 samples, best-of-n search with 1 replica. Deliberately minimal to test the pipeline, not the quality.

Results

MetricSample 0 (n=262)Sample 1 (n=234)
total_reward−0.831−0.881
i_PAE0.8310.881
pLDDT0.2120.239
RMSD11.25 Å45.21 Å

82.92 seconds total on 1× H100 (41.5s per sample). Pipeline functional.

Quality Thresholds (Demo vs Production)

Both samples fail production thresholds. Expected at 100 steps with no refinement:

CriterionDemo ResultThresholdStatus
i_PAE × 31 ≤ 7.025.76 / 27.32≤ 7.0FAIL
pLDDT ≥ 0.90.212 / 0.239≥ 0.9FAIL
Takeaway

Pipeline works. Quality requires production settings: 400 steps, more samples, beam search, and refinement. Full report →

Beam Search Configuration Sweep (5 configs × 10 proteins)
Question

How do different beam search strategies (varying checkpoint frequency, branching factor, and length sampling) affect binder quality across diverse protein targets?

5 Beam Search Configurations

All configs use 400 diffusion steps and produce 32 final PDBs per experiment. They differ in how they allocate compute during search:

Confignsamplesbeam_widthn_branchCheckpointsStrategy
A_balanced484[0, 100, 200, 300, 400]Paper default
B_length_explorer1624[0, 100, 200, 300, 400]Max length diversity
C_deep_exploiter2164[0, 100, 250, 400]Deep per-length optimization
D_early_brancher488[0, 200, 400]Wide early exploration
E_fine_selector483[0, 65, 130, 200, 270, 340, 400]Frequent fine-grained pruning

10 protein targets: PD-1, PD-L1, IFNAR2, CD45, Claudin-1, CrSAS-6, DerF7, BetV1, SpCas9, HER2

Scale: 50 experiments × 8 GPUs = ~93 min runtime. 26,560 total samples (1,600 finals).

Final Rankings (50 of 50 Complete)

RankConfigMean RewardBest i_PAEMean pLDDT
1E_fine_selector−0.24750.1370.076
2A_balanced−0.25710.1330.095
3C_deep_exploiter−0.27000.1410.093
4D_early_brancher−0.27920.1300.094
5B_length_explorer−0.35010.1360.112
Note

D_early_brancher initially appeared #1 (mean −0.2119) when only 4/10 proteins had completed, all easy targets. With full data, it drops to #4. Partial results can mislead.

Protein Difficulty Ranking

DifficultyProteinBest i_PAEBest Config
EasyIFNAR20.130D_early_brancher
EasyPD-10.138E_fine_selector
EasyPD-L10.150C_deep_exploiter
EasyCrSAS-60.156E_fine_selector
MediumCD450.159E_fine_selector
MediumSpCas90.168E_fine_selector
MediumDerF70.175E_fine_selector
MediumClaudin-10.193D_early_brancher
HardBetV10.205E_fine_selector
HardHER20.222E_fine_selector

Full sweep report with figures →

Config Diversity: Per-Protein Optimality & Sequence Space
The Claim

The optimal beam search configuration is protein-dependent. The winner changes depending on both the target and the selection criterion (top-1 vs top-16 pool). Different configs explore different regions of sequence space.

Full Per-Protein Config Ranking (10 × 5)

Ranking each config 1–5 per protein reveals that no config is universally best or worst. The #1 position is held by 3 different configs across the 10 proteins:

Protein#1 (Best)#2#3#4#5 (Worst)
PD-1E_fineA_balB_lenD_earlyC_deep
PD-L1C_deepE_fineA_balD_earlyB_len
IFNAR2D_earlyA_balE_fineB_lenC_deep
CD45E_fineA_balC_deepD_earlyB_len
Claudin-1D_earlyC_deepA_balE_fineB_len
CrSAS-6E_fine*D_early*B_lenA_balC_deep
DerF7D_early*E_fine*C_deepB_lenA_bal
BetV1E_fine*D_early*B_lenA_balC_deep
SpCas9E_fineC_deepD_earlyA_balB_len
HER2E_fineA_balB_lenC_deepD_early

* Near-ties (< 0.001 difference), effectively equivalent, winner depends on random seed.

Top-1 vs Top-16: The Winner Changes

When switching from “which config finds the single best binder?” to “which config produces the best pool of 16 candidates?”, the winner changes for 4 out of 10 proteins:

ProteinTop-1 WinnerTop-16 Pool WinnerSame?
CD45E_fineA_balancedNO
Claudin-1D_earlyE_fineNO
DerF7D_earlyE_fineNO
SpCas9E_fineC_deepNO
PD-1E_fineE_fineYes
PD-L1C_deepC_deepYes
IFNAR2D_earlyD_earlyYes
CrSAS-6E_fineE_fineYes
BetV1E_fineE_fineYes
HER2E_fineE_fineYes
Why this matters

If you plan to experimentally test 16 candidates, optimizing for top-1 gives you the wrong config for 40% of targets. The selection criterion must match the experimental plan.

Sequence Diversity

Beyond finding different quality binders, configs generate structurally and sequentially distinct candidates:

  • Length distributions differ: B_length_explorer samples 16 binder lengths (broadest); C_deep_exploiter samples only 2 (narrowest). Same target, very different structural coverage.
  • Inter-config sequence identity is lower than intra-config: Sequences from different configs share less identity than sequences within the same config. The search strategies explore different regions of the design landscape.
  • Amino acid composition varies subtly: Different configs show 2–5% differences in hydrophobic/polar/charged residue fractions, reflecting different design biases.

Practical Implications

Your GoalRecommendation
Test 1 candidateUse E_fine_selector (wins 6–7/10 proteins)
Test 16 candidatesCheck the per-protein winner (it varies)
Hard target (high CV)Run 2–3 configs and pool. Config choice is critical
Easy target (low CV)Any config works. Save compute
Maximize diversityMulti-config pooling. Different configs = different sequences

Full config diversity report with all figures →

Agent v1

Agent Loop v1: Closed-Loop Hyperparameter Search

Question

Can an LLM agent autonomously improve binder quality by iteratively selecting configurations, running experiments, analyzing results, and adapting its strategy?

v1 SEARCH: LLM agent iterates over 4 hyperparameters per target beam_width 8 or 16 n_branch 4 or 8 nsamples 4 or 8 checkpoints standard or fine 80+ other parameters untouched by v1 agent LLM picks config one variable at a time Run pipeline ~5 min on H100 Score (AF2) i_PAE, pLDDT Analyze + decide next config 21 iterations across 4 protein targets, 1,004 binder candidates

Optimization Trajectories

Each target shows a clear improvement trajectory. The agent achieves 10–28% improvement in best i_PAE from baseline:

TargetRunsStart i_PAEBest i_PAEImprovementBest Config
PD-140.1660.133-20%bw=8, nb=4, ns=8, fine ckpts
PD-L180.1580.134-15%bw=16, nb=4, ns=4, fine ckpts
IFNAR250.1450.130-10%bw=16, nb=8, ns=4, standard ckpts
CD4540.2410.173-28%bw=16, nb=4, ns=4, standard ckpts
Per-Target Trajectory Detail: Run-by-Run Logs
PD-1 (4 runs)

Run 1: bw=8,nb=4,ns=4,standard → 0.166
Run 2: bw=8,nb=4,ns=4,fine → 0.138 (-17%)
Run 3: bw=8,nb=4,ns=8,fine → 0.133 (-20%)
Run 4: mcts,bw=8,nb=4,ns=8 → 0.138 (regressed)

PD-L1 (8 runs)

Run 1-3: bw=8,nb=4,ns=4 → 0.158 (3 duplicates)
Run 4: bw=8,nb=4,ns=8 → 0.156
Run 5: bw=16,nb=4,ns=4 → 0.165 (worse)
Run 6: bw=8,nb=4,fine → 0.150
Run 7: bw=16,nb=4,fine → 0.134 (-15%)
Run 8: bw=16,nb=8,fine → OOM

IFNAR2 (5 runs)

Run 1: bw=8,nb=4,ns=4 → 0.145
Run 2: bw=16,nb=4,ns=4 → 0.139
Run 3: bw=16,nb=8,ns=4 → 0.130 (-10%)
Run 4: mcts,bw=16,nb=8 → 0.146 (regressed)
Run 5: bw=16,nb=8,ns=8 → 0.133

CD45 (4 runs)

Run 1: bw=8,nb=4,ns=4 → 0.241
Run 2: bw=8,nb=4,fine → 0.180 (-25%)
Run 3: bw=8,nb=4,ns=8,fine → 0.180
Run 4: bw=16,nb=4,fine → 0.173 (-28%)

Key Findings from 21 Runs
Fine checkpoints help most

Switching from standard (5-point) to fine-grained (7-point) checkpoints produced the largest single improvement for PD-1 (-17%) and CD45 (-25%).

MCTS did not help

Tried on PD-1 and IFNAR2, both times regressed. Beam search with larger beam_width consistently outperformed.

Interaction effects matter

PD-L1's best config (fine ckpts + bw=16) combined two changes. The "change one variable at a time" instruction prevented discovering this earlier.

3 duplicate runs wasted 25%

The agent launched identical PDL1 configs 3 times in one session: no deduplication, no history check. A BO method would never do this.


v1 Results and Reflection

The agent achieves 10-28% improvement on every target, but the search strategy has clear limitations:

Narrow search, no surrogate

Same 4 knobs in the same order for every target. Never tried fk-steering, best-of-n, or nsteps<400. A GP-BO would likely match results in ~10 runs vs 21. No surrogate model, no transfer across targets.

Marginal binding quality

Best i_PAE (0.130–0.134) is marginal vs BindCraft validated threshold (<0.10). pLDDT is excellent (0.94–0.96). CD45 fails due to only 1 hotspot residue. ProteinMPNN is disabled. Enabling it is the single highest-impact change.

Wasted compute

GPU utilization ~35–40%. 25% compute wasted on 3 duplicate PD-L1 runs. No auto-retry on OOM, no file locking for concurrent agents. Total cost: ~16 GPU-hours ($47).

Full interactive dashboard with per-run details and Plotly visualizations →

Agent v2

v2 Plan: From Configs to Programs

Superseded

v2 was the transitional design that introduced tree search and InferenceProgram as the unit of evolution. The active track is now v3 below, which keeps v2's tree-search core but makes research ideas first-class alongside the programs themselves, explicitly following the SCORE decomposition.

v1 showed that the agent loop works and that how you spend compute matters more than how much. The question for v2: what if the agent could compose programs instead of picking configs?

1. Representation: Configs → Programs

What is the right abstraction for inference-time computation in generative protein design, and can an LLM reason about it?

The agent currently picks from 5 named presets differing in 4 knobs. But the real design space is 80+ parameters across 6 qualitatively different layers (search, sampling, reward, refinement, filtering, orchestration), and the interesting choices are conditional: "refine only if i_PAE < 0.25", "switch scoring model based on target type", "double beam width if early steps show high variance." These are not hyperparameters. They are programs.

2. Search: Hill-Climbing → Population Evolution

Can LLM-guided program evolution outperform Bayesian optimization in structured, high-dimensional spaces where the search objects are programs, not vectors?

The v1 agent uses the same 4 knobs in the same order for every target, with no memory across sessions. A population of programs with lineage tracking, crossover, and deduplication turns the agent from a sequential optimizer into an evolutionary researcher that builds on its own history.

3. Scaling Law: More Compute → Better Designs?

Is there an inference-time scaling law for protein design, and can an LLM agent find it?

v1 showed that E_fine_selector beats B_length_explorer by 42% with the same GPU budget. v2 tests a stronger claim: that an agent composing multi-stage programs (scout → invest → pool) can turn additional compute into monotonically better designs, the way test-time compute scaling works for LLM reasoning.


Experiment: Graphify for Codebase Understanding

A key bottleneck for the agent is understanding the Proteina-Complexa codebase well enough to reason about which parameters exist, how they interact, and what new configurations are possible. The codebase has thousands of lines of Hydra configs, model definitions, and pipeline scripts. Reading these files sequentially is slow and loses structural relationships.

We will test whether Graphify, a tool that converts code, documentation, and configs into a knowledge graph with clustered communities, can help. The idea:

Build the graph

Run Graphify on the Proteina-Complexa codebase: Hydra configs, model classes, pipeline scripts, scoring modules. Extract entities (parameters, functions, config groups) and relationships (parameter → affects → pipeline stage).

Query during search

When the agent decides which parameter to try next, it queries the knowledge graph instead of reading raw files. "What parameters affect beam search scoring?" returns a structured subgraph, not thousands of lines of code.

Measure the difference

Compare agent performance with and without the graphified knowledge base: does it discover more of the 80+ parameter space? Does it avoid duplicate runs? Does it find better configs in fewer iterations?

Hypothesis

The graphified knowledge base will help the agent discover parameters it currently ignores (noise schedules, guidance weights, sampling modes, refinement settings) and reason about parameter interactions that are invisible when reading configs sequentially.


Orthogonal Experiments Backlog

Items about program representation, deduplication, 80+ parameter surface, population-based search, and LLM-vs-BO benchmarking have moved to v3, which now owns them. What remains here is the orthogonal experiment backlog — work that is not about the agent loop itself but about the downstream pipeline, conformational robustness, scoring models, and codebase tooling.

PriorityTaskStatus
P0Enable ProteinMPNN redesign + scRMSD filter in protein binder pipeline (identified as single highest-impact change in v1 reflection)Pending
P0Complete ligand binder sweep: 18/20 experiments remaining (configs B-E for all 4 ligand targets)In progress
P1Run Graphify on Proteina-Complexa codebase: build knowledge graph, evaluate whether it improves the mutator's ability to find uncovered parametersPending
P2Integrate BioEmu for conformational ensemble robustness testingPending
P2Cross-model orchestration: agent selects AF2 vs RF3 vs Boltz2 scoring per targetPending
Agent v3

v3: From Programs to Ideas + Programs

v2 asked: what if the agent composes programs instead of picking configs? v3 asks a tighter question: what if the agent also evolves the scientific ideas behind those programs, and the search tree knows about both? v3 is an explicit implementation of the SCORE decomposition (Aygun et al. 2025), narrowed to Proteina-Complexa inference-time binder design.

v3 System Equation

AutoresearchV3 = ScorableTask + ResearchIdeaLibrary + LLM Mutator + Sandbox Executor + Tree Search + Fixed Evaluator

The Six Primitives

The v3 loop makes every SCORE primitive a first-class host object. Each expansion step touches all six:

SCORE-STYLE LOOP (host-owned = dark · LLM-touched = orange tint) ScorableTask fixed problem definition target · data · baseline · metric ResearchIdeaLibrary first-class hypotheses persisted · recombinable · lineage LLM Mutator idea-aware expansion produces ChildProposal(idea, spec, code) Sandbox Executor materialized root, restricted imports runs ExecutableInferenceProgram Fixed Evaluator n_successful (primary) i_pae · plddt · diversity · wall-time Tree Search PUCT / rank-score over (task, ideas, program, score) select next expansion JOINT SEARCH STATE SearchState = (ScorableTask, ResearchIdea[], ExecutableInferenceProgram, ProgramEvaluation?) the tree searches over ideas + code, not code alone
v3 loop. Host owns every persisted object; the LLM only produces ChildProposals that must clear validation before hitting the sandbox. The dashed feedback edge is the SCORE outer loop — tree search picks which (idea, program) node to expand next.

Research Ideas as First-Class Objects

The most important conceptual shift from v2 is that a ResearchIdea is not code, not a program, and not a prompt fragment. It is a persisted, hashable, recombinable hypothesis object with its own lineage. Every program in the tree points back to one or more ideas, and mutations are typed by what they do to those ideas.

Idea schema
ResearchIdea:
  idea_id
  title
  summary
  hypothesis
  implementation_hints
  source_type  # user|llm|paper|recombination
  parent_idea_ids
  tags
Idea examples (from Plan.md)
  • diversify early with wider branching, exploit late with aggressive reranking
  • use AF2 confidence + interface compactness as a secondary reranker
  • replace hard top-k beam selection with stochastic temperature-weighted selection
  • warm-start from high-confidence partial trajectories
Idea-aware mutation operators

Mutations are typed by their effect on the idea set, not just the code:

  • idea_preserving
  • idea_refining
  • idea_branching
  • idea_recombination
  • module_synthesis / module_edit
  • program_recombination
Why this matters

Without first-class ideas, an agent loop collapses into parent program → LLM prompt → child program. That misses the part of SCORE that makes it a research loop rather than a code-perturbation loop. Separating the idea from its implementation lets the tree reason about hypothesis diversity independently from code diversity, and lets two different programs that embody the same idea be recognized as such.


Content-Addressed Program Identity

Every persisted artifact in v3 has a content hash, and program identity folds the idea choice directly into the program:

spec_hash    = sha256(canonical(ProgramSpec \ {lineage_fields}))
bundle_hash  = sha256(canonical(CodeBundle))
program_hash = sha256(spec_hash + bundle_hash)
program_id   = program_hash[:12]

# selected_idea_ids ∈ spec_hash
# ⇒ picking different ideas yields a different program_id
# ⇒ the tree dedupes on (ideas + code), not code alone

This is what makes the joint search state above tractable: nodes are still program-centric for deduplication, but the expansion policy can reason about ideas separately because the idea choice is already baked into the identity.


Implementation Status

v3 lives at Autoresearch_Denovo/autoresearch_v3/ as a standalone package with its own Plan.md, ImplementationSpec.md, and pyproject.toml. Snapshot as of 2026-04-14 (verified by reading each module):

LayerStatusDetail
Artifacts & hashing Done ScorableTask, ResearchIdea, ProgramSpec, StageSpec, ModuleRef, CodeAsset, CodeBundle, ExecutableInferenceProgram, ProgramEvaluation — canonical JSON + content hashing, selected_idea_idsspec_hash
Persistence (RunStoreV3 + IdeaBank) Done atomic JSON writes for task / idea / asset / bundle / program / validation / evaluation / tree; check_duplicate gate
Host control surface Done SCHEMA_VERSION / VALIDATOR_VERSION / EVALUATOR_VERSION, allowed + banned import prefixes, patch whitelist, MAX_IDEAS_PER_PROGRAM
Validator (static) Done task + idea refs + program structure + AST parse + import allow/ban + patch format + patch whitelist + entry_symbol presence
Validator (runtime smoke) Pending Plan.md §9.2 step 8 — tiny import + instantiate check before real-run
Compiler Done ProgramSpec → normalized Hydra overrides + resolved module manifest (built-in refs + generated_reranker/etc. resolved against bundle assets)
Sandbox Executor Dry-run verified prepare_execution + run_single_stage (dry & real code paths). Real-run subprocess launches proteinfoundation.generate with sourced env.sh, CUDA_VISIBLE_DEVICES, and sandboxed PYTHONPATH. No recorded v3 GPU runs yet.
Evaluator (fixed) Done n_successful from reward CSVs: i_pae < 0.18plddt_log > 0.90sample_type == "final"; sequence-diversity diagnostic
Mutator (deterministic + LLM) Done propose_idea_branch, propose_param_mutation, propose_program_recombination, plus full LLM path: build_prompt_requestbuild_llm_promptsparse_llm_proposal (strict JSON contract) → ChildProposal
LLM backend Done V1LLMClientAdapter (reuses v1's multi-provider client from autoresearch_common.llm) + StaticJSONClient for deterministic tests
SearchDriver + CLI Done initialize_task · run_single_step · run_n_steps; python -m autoresearch_v3 {init|step|loop} with --model / --static-payload / --real-run flags
Tree state Done TreeNode / TreeState: add_node, record_evaluation, best_nodes sort by (rank_key, mean_value), persistence to tree.json
Tree selection policy (PUCT / UCB) Minimal driver currently walks child-of-best; no UCB expansion yet — port from v2 plan pending
End-to-end dry-run 2026-04-13 full artifact chain persisted in _tmp_store_check/; compiled manifest in _tmp_exec_dryrun/exec_cd74ed879378/ — see below
Unit tests 9 passing tests/test_core.py: hash stability, validator, compiler ref resolution, mutator (branch / recombine / param), prompt builder, LLM proposal parse, end-to-end driver with dedup
Real GPU run of v3 on a target Pending dry-run entries exist in run_store/v3/ (5 tree nodes, 2 evaluated); first real GPU run blocked only by scheduling, not code
Multi-stage execution + warm-start Pending executor runs stage 0 only; Plan.md milestone 7

First end-to-end dry-run (2026-04-13)

The scratch dirs capture a real run of the full loop: host registers a ScorableTask for PD-L1, seeds an IdeaBank, the mutator (via StaticJSONClient) emits a ChildProposal with a generated reranker asset, the validator passes the program, the compiler resolves the module manifest, the executor materializes it into a sandbox, and the evaluator + tree record the result. This exercises every host object in the six-primitives diagram above.

_tmp_exec_dryrun/exec_cd74ed879378/manifest.json { "task_id": "task_demo", "target": "02_PDL1", "program_id": "cd74ed879378", "selected_idea_ids": ["idea_demo"], "stages": [{ "stage_index": 0, "stage_name": "stage0", "hydra_overrides": [ "++generation.args.nsteps=100", "++generation.task_name=02_PDL1", "++run_name=cd74ed879378" ], "module_manifest": [ {"role": "search", "ref_kind": "builtin_search", "name": "beam-search"}, {"role": "reranker", "ref_kind": "generated_reranker", "name": "simple", "asset_id": "reranker_0b0ee075e9b3", "module_path": "generated.rerankers.simple", "entry_symbol":"SimpleReranker"} ] }] } _tmp_store_check/evaluations/2a22d4b1a2b7.json { "program_id": "2a22d4b1a2b7", "task_id": "task_demo", "primary_metric": "n_successful", "primary_score": 3.0, "diagnostics": {"rank_key": 3.5}, "evaluator_version": 1 }
Host-owned artifacts from a full dry-run pass: the compiler resolved generated_reranker:simple to the asset reranker_0b0ee075e9b3 written into the sandbox, the Hydra overrides were augmented with task_name and run_name from the task, and a mock ProgramEvaluation was persisted. The only missing step before a real GPU run is scheduling.
What unlocks next

The entire loop is code-complete through SearchDriver.run_single_step(dry_run=True). The near-term unblocking moves are: (1) first real GPU run of python -m autoresearch_v3 step --real-run --task-id demo --target 02_PDL1 on a single H100, to land the first entry under autoresearch_outputs/run_store/v3/; (2) replace TreeState.best_nodes ranking with a PUCT expansion policy so run_n_steps can select across the whole frontier instead of walking the last child; (3) wire up the validator's runtime smoke step so real-run programs cannot reach subprocess launch without a trivial import check.


SCORE View: Live Search Tree

The SCORE view is an interactive HTML report generated from run_store/v3/tree.json. Each node is a candidate inference program; edges trace parent-child mutations. Clicking a node reveals its research ideas, generated code with syntax highlighting, parent-child diffs, pipeline stages, and execution diagnostics. Below is the current v3 tree topology (5 nodes across 3 independent seed roots, 2 evaluated).

SCORE VIEW — v3 SEARCH TREE (run_store/v3/tree.json) 279076 seed (unscored) d2a88e scored: 0.0 84ec20 seed (unscored) b2aa33 scored: 0.0 b69f86 seed (unscored) gen 0 gen 1 INTERACTIVE SCORE VIEW FEATURES Gray = unscored seed program Blue = scored (low) Red = scored (high) Best node (star) Click node → drawer shows ideas, code bundle, syntax-highlighted source, parent→child diff, diagnostics Zoom / pan → mousewheel zoom, drag to pan, +/−/fit buttons (scales to 100+ node trees) Progress chart → cumulative best score over generation, clickable data points select corresponding node Auto-refresh → live server polls run_store every 60s, browser auto-reloads (analysis/serve.py) low score high score
v3 SCORE view search tree (current state: 5 nodes, 3 seed roots, 2 evaluated at score 0.0 — dry-run only). The interactive HTML report is generated by analysis/v3/build_score_view.py with shared viz components in analysis/viz/. As real GPU runs populate the tree, nodes fill with color and the progress chart tracks improvement. View latest reports →