No Universal Best Config: Per-Protein Optimality & Sequence Diversity
Date: 2026-04-05 |
Data: 50/50 experiments (5 configs × 10 proteins, 1,600 finals) |
Claim: Optimal search config is protein-dependent; configs explore different sequence space
Central Claim
The optimal beam search configuration depends on the protein target — both which config finds the
single best binder (top-1) and which produces the best candidate pool (top-16).
The winner changes for 4/10 proteins when switching selection criteria.
Different configs produce structurally and sequentially distinct binders.
1. Per-Protein Config Ranking
Configs ranked 1-5 per protein. Left: by best single sample (top-1). Right: by mean of best 16 (top-16 pool).
No config is universally best or worst.
Rank heatmap: green=#1 (best), red=#5 (worst). Rankings shift between top-1 and top-16 criteria.
2. Top-1 vs Top-16: The Winner Changes
Red lines = winner changes between top-1 and top-16 selection. 4/10 proteins affected.
Protein
Top-1 Winner
Top-16 Pool Winner
Same?
CD45
E_fine
A_bal
NO
Claudin1
D_early
E_fine
NO
DerF7
D_early*
E_fine
NO
SpCas9
E_fine
C_deep
NO
PD1
E_fine
E_fine
Yes
PDL1
C_deep
C_deep
Yes
IFNAR2
D_early
D_early
Yes
CrSAS6
E_fine*
E_fine
Yes
BetV1
E_fine*
E_fine
Yes
HER2_AAV
E_fine
E_fine
Yes
SpCas9: E_fine_selector finds the single best binder, but
C_deep_exploiter produces the better top-16 pool (0.192 vs 0.193) — deep per-length optimization yields a tighter candidate set.
*Near ties: CrSAS6, DerF7, BetV1 have <0.001 difference between D_early and E_fine for top-1.
These are effectively equivalent — the winner depends on random seed.
3. Config Sensitivity Varies by Protein
CV of i_PAE across configs. Higher = config choice matters more. Hard targets are most config-sensitive.
Pattern: Hard targets (BetV1 CV=0.648, CD45 CV=0.503) are highly config-dependent.
Easy targets (IFNAR2 CV=0.027) give similar results regardless of config.
Search strategy matters most when the target is hardest.
4. Configs Produce Different Sequences
4a. Length Distributions
Binder length distributions by config. B_length_explorer is broadest (16 lengths); C_deep_exploiter narrowest (2 lengths).
4b. Sequence Identity: Within vs Between Configs
Intra-config identity (colored) vs inter-config identity (black). Inter-config is lower = different sequence space.
AA composition by config. Different search strategies lead to subtly different residue preferences.
Configs explore different sequence space. Inter-config sequence identity is systematically lower than
intra-config identity. This means running multiple configs and pooling candidates yields a more diverse
set than running one config with equivalent compute.
Implications
Goal
Recommendation
Test 1 candidate
Use E_fine_selector (best top-1 for 7/10 proteins)
Test 16 candidates
Check per-protein winner — pool quality leader varies
Hard target (high CV)
Run 2-3 configs and pool — config choice is critical
Easy target (low CV)
Any config works; save compute
Maximize diversity
Multi-config pooling — different configs = different sequences
Appendix: Sweep Rankings (all 50 experiments)
Best i_PAE heatmap with gold winners per protein.Config rankings: mean reward and win counts.