Project Overview

Autoresearch De Novo: LLM Agents for Protein Design

April 2026 — Ongoing

The Research Question

Can LLM agents guide efficient and effective de novo protein design for drug discovery — on top of various protein and small-molecule generative models?

Powerful protein generative models now exist — Proteina-Complexa for binder design, BioEmu for conformational ensembles, AlphaFold for structure prediction. But having a good model is not enough. These models have dozens of hyperparameters (search strategies, diffusion steps, branching factors, refinement settings) that interact with the target in complex ways. We show that no single configuration works best across all targets — and that the gap between good and bad hyperparameters can be larger than the gap between models.

This motivates a new role for LLM agents: not just as chatbots, but as autonomous research assistants that design experiments, sweep hyperparameters, analyze results, and adaptively choose the next experiment — closing the loop between model and target.


Motivation: Good Models Need Good Hyperparameters

We ran 50 experiments (5 beam search configs × 10 protein targets) with Proteina-Complexa. The results make the case clearly:

No universal best config

E_fine_selector wins 7/10, but D_early_brancher wins IFNAR2, C_deep_exploiter wins PD-L1. The #1 config varies per target.

Selection criterion matters

4/10 proteins have different optimal configs depending on whether you pick top-1 or pool top-16 candidates.

Hard targets need tuning

BetV1 (CV=0.65): config choice changes reward by 50%. IFNAR2 (CV=0.03): any config works.

Configs explore different space

Different search strategies produce different sequences. Multi-config pooling beats single-config compute.

Coming Soon: Small Molecules

The same hyperparameter sensitivity applies to small-molecule drug design. Diffusion-based molecular generators (e.g., DiffDock, TargetDiff) face analogous challenges: sampling strategies, noise schedules, and scoring functions interact with the binding pocket in target-dependent ways. We plan to extend this agent framework to ligand-based design, demonstrating that the need for adaptive hyperparameter search is a general principle across modalities.

The Problem This Creates

A human researcher running Proteina-Complexa on a new drug target faces a combinatorial hyperparameter space: 5+ search algorithms × checkpoint schedules × branching factors × refinement strategies. Running all combinations for every target is wasteful. An LLM agent that understands the target–config interaction could adaptively allocate compute — running cheap scouts first, then investing in the most promising strategy.


Current Pipeline: Proteina-Complexa

[Target PDB + Hotspots] → Flow Matching (400 steps) → Beam Search → AF2 Scoring → Candidate PDBs
StageWhat it doesKey hyperparameters
GenerateFlow-matching diffusion in partially latent spacensteps, guidance_w, self_cond
SearchBeam search with branching at checkpointsbeam_width, n_branch, checkpoint schedule, nsamples
ScoreAF2 multimer reward (i_PAE)num_recycles, reward weights
RefineSequence hallucination (test-time optimization)n_iters, loss weights, temperature

Hardware: 8× NVIDIA H100 80GB  |  Targets: 44 pre-configured proteins  |  Scale: 26,560 samples from 50 experiments


Evidence: Beam Search Sweep (50/50 Complete)

RankConfigStrategyMean RewardWins (top-1)
1E_fine_selector6 checkpoints, 3 branches−0.24757/10
2A_balancedPaper default−0.25710/10
3C_deep_exploiter2 lengths, 16 beams−0.27001/10
4D_early_brancher8 branches, 2 checkpoints−0.27922/10
5B_length_explorer16 lengths, 2 beams−0.35010/10
Key Insight

The gap between best and worst config (0.103 mean reward) is substantial — equivalent to a ~42% relative difference. For hard targets like BetV1, the gap is even larger. Choosing the right hyperparameters matters as much as choosing the right model.


Central Result: Per-Protein Optimality

The optimal config depends on both the target and what you optimize for:

ProteinTop-1 WinnerTop-16 Pool WinnerSame?
CD45E_fineA_balancedNO
Claudin-1D_earlyE_fineNO
DerF7D_earlyE_fineNO
SpCas9E_fineC_deepNO
6 other proteins: same winner for both criteria

Implication for LLM Agents

This is exactly the kind of problem LLM agents excel at: high-dimensional configuration spaces, target-dependent optima, and multi-objective trade-offs. An agent can run cheap scouts, learn target–config patterns from prior experiments, and adaptively allocate compute — something a static pipeline cannot do.


Roadmap

Phase 1: Baseline & Evidence (Current)

  1. Pipeline validation — Proteina-Complexa end-to-end on PD-L1
  2. Beam search sweep — 5 configs × 10 proteins, 50 experiments
  3. Config diversity analysis — per-protein optimality, sequence diversity
  4. Refinement sweep — add sequence_hallucination to top configs
  5. Full pipeline evaluation — filter → evaluate → analyze (scRMSD, ProteinMPNN)

Phase 2: Agent-Guided Search

  1. Scout → Invest strategy — cheap 100-step scouts across configs, then full 400-step runs on winners
  2. Target-adaptive config selection — agent learns target–config mapping from prior sweeps
  3. Multi-config pooling — agent selects 2–3 complementary configs per target for maximum diversity
  4. Closed-loop refinement — agent analyzes AF2 scores, adjusts refinement parameters, re-runs

Phase 3: Multi-Model Agent

  1. Integrate BioEmu — use conformational ensembles to assess binder robustness across target dynamics
  2. Small-molecule extension — adapt the agent framework for ligand-based drug design
  3. Cross-model orchestration — agent selects which model to use for which target/task

Model Landscape

The agent framework is designed to work across multiple protein and molecular models:

ModelTypeWhat it doesStatus
Proteina-ComplexaBinder DesignFlow-matching de novo binder generation with beam searchActive — 50 experiments complete
AlphaFold2ScoringStructure prediction & binding quality assessment (i_PAE, pLDDT)Active — reward model in pipeline
ProteinMPNNInverse FoldingSequence redesign for designability evaluationAvailable — in evaluate stage
BioEmuDynamicsProtein conformational ensembles — 1000s of structures/hour/GPU. Predicts folding stability (ΔG), captures domain motion & local unfoldingPlanned — assess binder robustness to target flexibility
LigandMPNNLigand-Aware DesignSequence design aware of small-molecule ligandsAvailable — in Complexa pipeline
Why BioEmu Matters

BioEmu (Microsoft Research, Science 2025) generates conformational ensembles 10,000–100,000× faster than molecular dynamics. A designed binder must work not just against the crystal structure, but across the target's dynamic conformations. BioEmu can generate these ensembles in minutes, enabling the agent to test binder robustness as part of the design loop.


Timeline

DatePhaseKey Result
Apr 2Environment setupProteina-Complexa + conda env + checkpoints
Apr 3Pipeline validationPD-L1 demo: 2 samples in 83s
Apr 4–5Beam search sweep50/50 experiments. E_fine_selector wins 7/10
Apr 5Config diversity analysis4/10 targets: winner changes by criterion
NextRefinement sweepsequence_hallucination on top configs
NextFull pipeline evaluationDesignability (scRMSD, ProteinMPNN)
FutureAgent-guided searchScout → invest, adaptive config selection
FutureBioEmu integrationConformational robustness testing
Day 1

Pipeline Validation: PD-L1 Binder Demo

April 3, 2026

Goal

Validate the Proteina-Complexa pipeline end-to-end: can we generate protein binder candidates and score them with AlphaFold2 on our hardware?

First run: PD-L1 target (Programmed Death-Ligand 1), 100 diffusion steps, 2 samples, best-of-n search with 1 replica. Deliberately minimal to test the pipeline, not the quality.

Results

MetricSample 0 (n=262)Sample 1 (n=234)
total_reward−0.831−0.881
i_PAE0.8310.881
pLDDT0.2120.239
RMSD11.25 Å45.21 Å

82.92 seconds total on 1× H100 (41.5s per sample). Pipeline functional.

Quality Thresholds (Demo vs Production)

Both samples fail production thresholds — expected at 100 steps with no refinement:

CriterionDemo ResultThresholdStatus
i_PAE × 31 ≤ 7.025.76 / 27.32≤ 7.0FAIL
pLDDT ≥ 0.90.212 / 0.239≥ 0.9FAIL
Takeaway

Pipeline works. Quality requires production settings: 400 steps, more samples, beam search, and refinement. Full report →

Day 2–3

Beam Search Configuration Sweep

April 4–5, 2026

Question

How do different beam search strategies — varying checkpoint frequency, branching factor, and length sampling — affect binder quality across diverse protein targets?

5 Beam Search Configurations

All configs use 400 diffusion steps and produce 32 final PDBs per experiment. They differ in how they allocate compute during search:

Confignsamplesbeam_widthn_branchCheckpointsStrategy
A_balanced484[0, 100, 200, 300, 400]Paper default
B_length_explorer1624[0, 100, 200, 300, 400]Max length diversity
C_deep_exploiter2164[0, 100, 250, 400]Deep per-length optimization
D_early_brancher488[0, 200, 400]Wide early exploration
E_fine_selector483[0, 65, 130, 200, 270, 340, 400]Frequent fine-grained pruning

10 protein targets: PD-1, PD-L1, IFNAR2, CD45, Claudin-1, CrSAS-6, DerF7, BetV1, SpCas9, HER2

Scale: 50 experiments × 8 GPUs = ~93 min runtime. 26,560 total samples (1,600 finals).

Final Rankings (50/50 Complete)

RankConfigMean RewardBest i_PAEMean pLDDT
1E_fine_selector−0.24750.1370.076
2A_balanced−0.25710.1330.095
3C_deep_exploiter−0.27000.1410.093
4D_early_brancher−0.27920.1300.094
5B_length_explorer−0.35010.1360.112
Note

D_early_brancher initially appeared #1 (mean −0.2119) when only 4/10 proteins had completed — all easy targets. With full data, it drops to #4. Partial results can mislead.

Protein Difficulty Ranking

DifficultyProteinBest i_PAEBest Config
EasyIFNAR20.130D_early_brancher
EasyPD-10.138E_fine_selector
EasyPD-L10.150C_deep_exploiter
EasyCrSAS-60.156E_fine_selector
MediumCD450.159E_fine_selector
MediumSpCas90.168E_fine_selector
MediumDerF70.175E_fine_selector
MediumClaudin-10.193D_early_brancher
HardBetV10.205E_fine_selector
HardHER20.222E_fine_selector

Full sweep report with figures →

Day 3

No Universal Best Config: Per-Protein Optimality & Sequence Diversity

April 5, 2026

The Claim

The optimal beam search configuration is protein-dependent. The winner changes depending on both the target and the selection criterion (top-1 vs top-16 pool). Different configs explore different regions of sequence space.

Per-Protein Config Ranking

Ranking each config 1–5 per protein reveals that no config is universally best or worst. The #1 position is held by 3 different configs across the 10 proteins:

Protein#1 (Best)#2#3#4#5 (Worst)
PD-1E_fineA_balB_lenD_earlyC_deep
PD-L1C_deepE_fineA_balD_earlyB_len
IFNAR2D_earlyA_balE_fineB_lenC_deep
CD45E_fineA_balC_deepD_earlyB_len
Claudin-1D_earlyC_deepA_balE_fineB_len
CrSAS-6E_fine*D_early*B_lenA_balC_deep
DerF7D_early*E_fine*C_deepB_lenA_bal
BetV1E_fine*D_early*B_lenA_balC_deep
SpCas9E_fineC_deepD_earlyA_balB_len
HER2E_fineA_balB_lenC_deepD_early

* Near-ties (< 0.001 difference) — effectively equivalent, winner depends on random seed.


Top-1 vs Top-16: The Winner Changes

When switching from “which config finds the single best binder?” to “which config produces the best pool of 16 candidates?”, the winner changes for 4 out of 10 proteins:

ProteinTop-1 WinnerTop-16 Pool WinnerSame?
CD45E_fineA_balancedNO
Claudin-1D_earlyE_fineNO
DerF7D_earlyE_fineNO
SpCas9E_fineC_deepNO
PD-1E_fineE_fineYes
PD-L1C_deepC_deepYes
IFNAR2D_earlyD_earlyYes
CrSAS-6E_fineE_fineYes
BetV1E_fineE_fineYes
HER2E_fineE_fineYes
Why this matters

If you plan to experimentally test 16 candidates, optimizing for top-1 gives you the wrong config for 40% of targets. The selection criterion must match the experimental plan.


Sequence Diversity: Configs Explore Different Space

Beyond finding different quality binders, configs generate structurally and sequentially distinct candidates:

Practical Implications

Your GoalRecommendation
Test 1 candidateUse E_fine_selector (wins 6–7/10 proteins)
Test 16 candidatesCheck the per-protein winner — it varies
Hard target (high CV)Run 2–3 configs and pool — config choice is critical
Easy target (low CV)Any config works — save compute
Maximize diversityMulti-config pooling — different configs = different sequences

Full config diversity report with all figures →