Autoresearch De Novo: LLM Agents for Protein Design
April 2026 — Ongoing
Can LLM agents guide efficient and effective de novo protein design for drug discovery — on top of various protein and small-molecule generative models?
Powerful protein generative models now exist — Proteina-Complexa for binder design, BioEmu for conformational ensembles, AlphaFold for structure prediction. But having a good model is not enough. These models have dozens of hyperparameters (search strategies, diffusion steps, branching factors, refinement settings) that interact with the target in complex ways. We show that no single configuration works best across all targets — and that the gap between good and bad hyperparameters can be larger than the gap between models.
This motivates a new role for LLM agents: not just as chatbots, but as autonomous research assistants that design experiments, sweep hyperparameters, analyze results, and adaptively choose the next experiment — closing the loop between model and target.
Motivation: Good Models Need Good Hyperparameters
We ran 50 experiments (5 beam search configs × 10 protein targets) with Proteina-Complexa. The results make the case clearly:
E_fine_selector wins 7/10, but D_early_brancher wins IFNAR2, C_deep_exploiter wins PD-L1. The #1 config varies per target.
4/10 proteins have different optimal configs depending on whether you pick top-1 or pool top-16 candidates.
BetV1 (CV=0.65): config choice changes reward by 50%. IFNAR2 (CV=0.03): any config works.
Different search strategies produce different sequences. Multi-config pooling beats single-config compute.
The same hyperparameter sensitivity applies to small-molecule drug design. Diffusion-based molecular generators (e.g., DiffDock, TargetDiff) face analogous challenges: sampling strategies, noise schedules, and scoring functions interact with the binding pocket in target-dependent ways. We plan to extend this agent framework to ligand-based design, demonstrating that the need for adaptive hyperparameter search is a general principle across modalities.
A human researcher running Proteina-Complexa on a new drug target faces a combinatorial hyperparameter space: 5+ search algorithms × checkpoint schedules × branching factors × refinement strategies. Running all combinations for every target is wasteful. An LLM agent that understands the target–config interaction could adaptively allocate compute — running cheap scouts first, then investing in the most promising strategy.
Current Pipeline: Proteina-Complexa
[Target PDB + Hotspots] → Flow Matching (400 steps) → Beam Search → AF2 Scoring → Candidate PDBs
| Stage | What it does | Key hyperparameters |
|---|---|---|
| Generate | Flow-matching diffusion in partially latent space | nsteps, guidance_w, self_cond |
| Search | Beam search with branching at checkpoints | beam_width, n_branch, checkpoint schedule, nsamples |
| Score | AF2 multimer reward (i_PAE) | num_recycles, reward weights |
| Refine | Sequence hallucination (test-time optimization) | n_iters, loss weights, temperature |
Hardware: 8× NVIDIA H100 80GB | Targets: 44 pre-configured proteins | Scale: 26,560 samples from 50 experiments
Evidence: Beam Search Sweep (50/50 Complete)
| Rank | Config | Strategy | Mean Reward | Wins (top-1) |
|---|---|---|---|---|
| 1 | E_fine_selector | 6 checkpoints, 3 branches | −0.2475 | 7/10 |
| 2 | A_balanced | Paper default | −0.2571 | 0/10 |
| 3 | C_deep_exploiter | 2 lengths, 16 beams | −0.2700 | 1/10 |
| 4 | D_early_brancher | 8 branches, 2 checkpoints | −0.2792 | 2/10 |
| 5 | B_length_explorer | 16 lengths, 2 beams | −0.3501 | 0/10 |
The gap between best and worst config (0.103 mean reward) is substantial — equivalent to a ~42% relative difference. For hard targets like BetV1, the gap is even larger. Choosing the right hyperparameters matters as much as choosing the right model.
Central Result: Per-Protein Optimality
The optimal config depends on both the target and what you optimize for:
| Protein | Top-1 Winner | Top-16 Pool Winner | Same? |
|---|---|---|---|
| CD45 | E_fine | A_balanced | NO |
| Claudin-1 | D_early | E_fine | NO |
| DerF7 | D_early | E_fine | NO |
| SpCas9 | E_fine | C_deep | NO |
| 6 other proteins: same winner for both criteria | |||
Implication for LLM Agents
This is exactly the kind of problem LLM agents excel at: high-dimensional configuration spaces, target-dependent optima, and multi-objective trade-offs. An agent can run cheap scouts, learn target–config patterns from prior experiments, and adaptively allocate compute — something a static pipeline cannot do.
Roadmap
Phase 1: Baseline & Evidence (Current)
Pipeline validation— Proteina-Complexa end-to-end on PD-L1Beam search sweep— 5 configs × 10 proteins, 50 experimentsConfig diversity analysis— per-protein optimality, sequence diversity- Refinement sweep — add sequence_hallucination to top configs
- Full pipeline evaluation — filter → evaluate → analyze (scRMSD, ProteinMPNN)
Phase 2: Agent-Guided Search
- Scout → Invest strategy — cheap 100-step scouts across configs, then full 400-step runs on winners
- Target-adaptive config selection — agent learns target–config mapping from prior sweeps
- Multi-config pooling — agent selects 2–3 complementary configs per target for maximum diversity
- Closed-loop refinement — agent analyzes AF2 scores, adjusts refinement parameters, re-runs
Phase 3: Multi-Model Agent
- Integrate BioEmu — use conformational ensembles to assess binder robustness across target dynamics
- Small-molecule extension — adapt the agent framework for ligand-based drug design
- Cross-model orchestration — agent selects which model to use for which target/task
Model Landscape
The agent framework is designed to work across multiple protein and molecular models:
| Model | Type | What it does | Status |
|---|---|---|---|
| Proteina-Complexa | Binder Design | Flow-matching de novo binder generation with beam search | Active — 50 experiments complete |
| AlphaFold2 | Scoring | Structure prediction & binding quality assessment (i_PAE, pLDDT) | Active — reward model in pipeline |
| ProteinMPNN | Inverse Folding | Sequence redesign for designability evaluation | Available — in evaluate stage |
| BioEmu | Dynamics | Protein conformational ensembles — 1000s of structures/hour/GPU. Predicts folding stability (ΔG), captures domain motion & local unfolding | Planned — assess binder robustness to target flexibility |
| LigandMPNN | Ligand-Aware Design | Sequence design aware of small-molecule ligands | Available — in Complexa pipeline |
BioEmu (Microsoft Research, Science 2025) generates conformational ensembles 10,000–100,000× faster than molecular dynamics. A designed binder must work not just against the crystal structure, but across the target's dynamic conformations. BioEmu can generate these ensembles in minutes, enabling the agent to test binder robustness as part of the design loop.
Timeline
| Date | Phase | Key Result |
|---|---|---|
| Apr 2 | Environment setup | Proteina-Complexa + conda env + checkpoints |
| Apr 3 | Pipeline validation | PD-L1 demo: 2 samples in 83s |
| Apr 4–5 | Beam search sweep | 50/50 experiments. E_fine_selector wins 7/10 |
| Apr 5 | Config diversity analysis | 4/10 targets: winner changes by criterion |
| Next | Refinement sweep | sequence_hallucination on top configs |
| Next | Full pipeline evaluation | Designability (scRMSD, ProteinMPNN) |
| Future | Agent-guided search | Scout → invest, adaptive config selection |
| Future | BioEmu integration | Conformational robustness testing |
Models: Proteina-Complexa · BioEmu (Science 2025) · ProteinMPNN
Pipeline Validation: PD-L1 Binder Demo
April 3, 2026
Validate the Proteina-Complexa pipeline end-to-end: can we generate protein binder candidates and score them with AlphaFold2 on our hardware?
First run: PD-L1 target (Programmed Death-Ligand 1), 100 diffusion steps, 2 samples, best-of-n search with 1 replica. Deliberately minimal to test the pipeline, not the quality.
Results
| Metric | Sample 0 (n=262) | Sample 1 (n=234) |
|---|---|---|
| total_reward | −0.831 | −0.881 |
| i_PAE | 0.831 | 0.881 |
| pLDDT | 0.212 | 0.239 |
| RMSD | 11.25 Å | 45.21 Å |
82.92 seconds total on 1× H100 (41.5s per sample). Pipeline functional.
Quality Thresholds (Demo vs Production)
Both samples fail production thresholds — expected at 100 steps with no refinement:
| Criterion | Demo Result | Threshold | Status |
|---|---|---|---|
| i_PAE × 31 ≤ 7.0 | 25.76 / 27.32 | ≤ 7.0 | FAIL |
| pLDDT ≥ 0.9 | 0.212 / 0.239 | ≥ 0.9 | FAIL |
Pipeline works. Quality requires production settings: 400 steps, more samples, beam search, and refinement. Full report →
Beam Search Configuration Sweep
April 4–5, 2026
How do different beam search strategies — varying checkpoint frequency, branching factor, and length sampling — affect binder quality across diverse protein targets?
5 Beam Search Configurations
All configs use 400 diffusion steps and produce 32 final PDBs per experiment. They differ in how they allocate compute during search:
| Config | nsamples | beam_width | n_branch | Checkpoints | Strategy |
|---|---|---|---|---|---|
| A_balanced | 4 | 8 | 4 | [0, 100, 200, 300, 400] | Paper default |
| B_length_explorer | 16 | 2 | 4 | [0, 100, 200, 300, 400] | Max length diversity |
| C_deep_exploiter | 2 | 16 | 4 | [0, 100, 250, 400] | Deep per-length optimization |
| D_early_brancher | 4 | 8 | 8 | [0, 200, 400] | Wide early exploration |
| E_fine_selector | 4 | 8 | 3 | [0, 65, 130, 200, 270, 340, 400] | Frequent fine-grained pruning |
10 protein targets: PD-1, PD-L1, IFNAR2, CD45, Claudin-1, CrSAS-6, DerF7, BetV1, SpCas9, HER2
Scale: 50 experiments × 8 GPUs = ~93 min runtime. 26,560 total samples (1,600 finals).
Final Rankings (50/50 Complete)
| Rank | Config | Mean Reward | Best i_PAE | Mean pLDDT |
|---|---|---|---|---|
| 1 | E_fine_selector | −0.2475 | 0.137 | 0.076 |
| 2 | A_balanced | −0.2571 | 0.133 | 0.095 |
| 3 | C_deep_exploiter | −0.2700 | 0.141 | 0.093 |
| 4 | D_early_brancher | −0.2792 | 0.130 | 0.094 |
| 5 | B_length_explorer | −0.3501 | 0.136 | 0.112 |
D_early_brancher initially appeared #1 (mean −0.2119) when only 4/10 proteins had completed — all easy targets. With full data, it drops to #4. Partial results can mislead.
Protein Difficulty Ranking
| Difficulty | Protein | Best i_PAE | Best Config |
|---|---|---|---|
| Easy | IFNAR2 | 0.130 | D_early_brancher |
| Easy | PD-1 | 0.138 | E_fine_selector |
| Easy | PD-L1 | 0.150 | C_deep_exploiter |
| Easy | CrSAS-6 | 0.156 | E_fine_selector |
| Medium | CD45 | 0.159 | E_fine_selector |
| Medium | SpCas9 | 0.168 | E_fine_selector |
| Medium | DerF7 | 0.175 | E_fine_selector |
| Medium | Claudin-1 | 0.193 | D_early_brancher |
| Hard | BetV1 | 0.205 | E_fine_selector |
| Hard | HER2 | 0.222 | E_fine_selector |
No Universal Best Config: Per-Protein Optimality & Sequence Diversity
April 5, 2026
The optimal beam search configuration is protein-dependent. The winner changes depending on both the target and the selection criterion (top-1 vs top-16 pool). Different configs explore different regions of sequence space.
Per-Protein Config Ranking
Ranking each config 1–5 per protein reveals that no config is universally best or worst. The #1 position is held by 3 different configs across the 10 proteins:
| Protein | #1 (Best) | #2 | #3 | #4 | #5 (Worst) |
|---|---|---|---|---|---|
| PD-1 | E_fine | A_bal | B_len | D_early | C_deep |
| PD-L1 | C_deep | E_fine | A_bal | D_early | B_len |
| IFNAR2 | D_early | A_bal | E_fine | B_len | C_deep |
| CD45 | E_fine | A_bal | C_deep | D_early | B_len |
| Claudin-1 | D_early | C_deep | A_bal | E_fine | B_len |
| CrSAS-6 | E_fine* | D_early* | B_len | A_bal | C_deep |
| DerF7 | D_early* | E_fine* | C_deep | B_len | A_bal |
| BetV1 | E_fine* | D_early* | B_len | A_bal | C_deep |
| SpCas9 | E_fine | C_deep | D_early | A_bal | B_len |
| HER2 | E_fine | A_bal | B_len | C_deep | D_early |
* Near-ties (< 0.001 difference) — effectively equivalent, winner depends on random seed.
Top-1 vs Top-16: The Winner Changes
When switching from “which config finds the single best binder?” to “which config produces the best pool of 16 candidates?”, the winner changes for 4 out of 10 proteins:
| Protein | Top-1 Winner | Top-16 Pool Winner | Same? |
|---|---|---|---|
| CD45 | E_fine | A_balanced | NO |
| Claudin-1 | D_early | E_fine | NO |
| DerF7 | D_early | E_fine | NO |
| SpCas9 | E_fine | C_deep | NO |
| PD-1 | E_fine | E_fine | Yes |
| PD-L1 | C_deep | C_deep | Yes |
| IFNAR2 | D_early | D_early | Yes |
| CrSAS-6 | E_fine | E_fine | Yes |
| BetV1 | E_fine | E_fine | Yes |
| HER2 | E_fine | E_fine | Yes |
If you plan to experimentally test 16 candidates, optimizing for top-1 gives you the wrong config for 40% of targets. The selection criterion must match the experimental plan.
Sequence Diversity: Configs Explore Different Space
Beyond finding different quality binders, configs generate structurally and sequentially distinct candidates:
- Length distributions differ: B_length_explorer samples 16 binder lengths (broadest); C_deep_exploiter samples only 2 (narrowest). Same target, very different structural coverage.
- Inter-config sequence identity is lower than intra-config: Sequences from different configs share less identity than sequences within the same config. The search strategies explore different regions of the design landscape.
- Amino acid composition varies subtly: Different configs show 2–5% differences in hydrophobic/polar/charged residue fractions, reflecting different design biases.
Practical Implications
| Your Goal | Recommendation |
|---|---|
| Test 1 candidate | Use E_fine_selector (wins 6–7/10 proteins) |
| Test 16 candidates | Check the per-protein winner — it varies |
| Hard target (high CV) | Run 2–3 configs and pool — config choice is critical |
| Easy target (low CV) | Any config works — save compute |
| Maximize diversity | Multi-config pooling — different configs = different sequences |