No Universal Best Config: Per-Protein Optimality & Sequence Diversity

Date: 2026-04-05  |  Data: 50/50 experiments (5 configs × 10 proteins, 1,600 finals)  |  Claim: Optimal search config is protein-dependent; configs explore different sequence space

Central Claim

The optimal beam search configuration depends on the protein target — both which config finds the single best binder (top-1) and which produces the best candidate pool (top-16). The winner changes for 4/10 proteins when switching selection criteria. Different configs produce structurally and sequentially distinct binders.

1. Per-Protein Config Ranking

Configs ranked 1-5 per protein. Left: by best single sample (top-1). Right: by mean of best 16 (top-16 pool). No config is universally best or worst.

Rank heatmap: green=#1 (best), red=#5 (worst). Rankings shift between top-1 and top-16 criteria.
Rank heatmap: green=#1 (best), red=#5 (worst). Rankings shift between top-1 and top-16 criteria.

2. Top-1 vs Top-16: The Winner Changes

Red lines = winner changes between top-1 and top-16 selection. 4/10 proteins affected.
Red lines = winner changes between top-1 and top-16 selection. 4/10 proteins affected.
ProteinTop-1 WinnerTop-16 Pool WinnerSame?
CD45E_fineA_balNO
Claudin1D_earlyE_fineNO
DerF7D_early*E_fineNO
SpCas9E_fineC_deepNO
PD1E_fineE_fineYes
PDL1C_deepC_deepYes
IFNAR2D_earlyD_earlyYes
CrSAS6E_fine*E_fineYes
BetV1E_fine*E_fineYes
HER2_AAVE_fineE_fineYes
SpCas9: E_fine_selector finds the single best binder, but C_deep_exploiter produces the better top-16 pool (0.192 vs 0.193) — deep per-length optimization yields a tighter candidate set.

*Near ties: CrSAS6, DerF7, BetV1 have <0.001 difference between D_early and E_fine for top-1. These are effectively equivalent — the winner depends on random seed.

3. Config Sensitivity Varies by Protein

CV of i_PAE across configs. Higher = config choice matters more. Hard targets are most config-sensitive.
CV of i_PAE across configs. Higher = config choice matters more. Hard targets are most config-sensitive.
Pattern: Hard targets (BetV1 CV=0.648, CD45 CV=0.503) are highly config-dependent. Easy targets (IFNAR2 CV=0.027) give similar results regardless of config. Search strategy matters most when the target is hardest.

4. Configs Produce Different Sequences

4a. Length Distributions

Binder length distributions by config. B_length_explorer is broadest (16 lengths); C_deep_exploiter narrowest (2 lengths).
Binder length distributions by config. B_length_explorer is broadest (16 lengths); C_deep_exploiter narrowest (2 lengths).

4b. Sequence Identity: Within vs Between Configs

Intra-config identity (colored) vs inter-config identity (black). Inter-config is lower = different sequence space.
Intra-config identity (colored) vs inter-config identity (black). Inter-config is lower = different sequence space.
AA composition by config. Different search strategies lead to subtly different residue preferences.
AA composition by config. Different search strategies lead to subtly different residue preferences.
Configs explore different sequence space. Inter-config sequence identity is systematically lower than intra-config identity. This means running multiple configs and pooling candidates yields a more diverse set than running one config with equivalent compute.

Implications

GoalRecommendation
Test 1 candidateUse E_fine_selector (best top-1 for 7/10 proteins)
Test 16 candidatesCheck per-protein winner — pool quality leader varies
Hard target (high CV)Run 2-3 configs and pool — config choice is critical
Easy target (low CV)Any config works; save compute
Maximize diversityMulti-config pooling — different configs = different sequences
Appendix: Sweep Rankings (all 50 experiments)
Best i_PAE heatmap with gold winners per protein.
Best i_PAE heatmap with gold winners per protein.
Config rankings: mean reward and win counts.
Config rankings: mean reward and win counts.

Generated by Analyst Team — 2026-04-05 — Full analysis (markdown)  |  Sweep report