No Universal Best Config: Per-Protein Optimality & Sequence Diversity

Date: 2026-04-05 | Data: 50/50 experiments (5 configs × 10 proteins, 1,600 finals) | Claim: Optimal search config is protein-dependent; configs explore different sequence space

Central Claim

The optimal beam search configuration depends on the protein target — both which config finds the single best binder (top-1) and which produces the best candidate pool (top-16). The winner changes for 4/10 proteins when switching selection criteria. Different configs produce structurally and sequentially distinct binders.

1. Per-Protein Config Ranking

Configs ranked 1-5 per protein. Left: by best single sample (top-1). Right: by mean of best 16 (top-16 pool). No config is universally best or worst.

Rank heatmap: green=#1 (best), red=#5 (worst). Rankings shift between top-1 and top-16 criteria.

2. Top-1 vs Top-16: The Winner Changes

Red lines = winner changes between top-1 and top-16 selection. 4/10 proteins affected.

Protein	Top-1 Winner	Top-16 Pool Winner	Same?
CD45	E_fine	A_bal	NO
Claudin1	D_early	E_fine	NO
DerF7	D_early*	E_fine	NO
SpCas9	E_fine	C_deep	NO
PD1	E_fine	E_fine	Yes
PDL1	C_deep	C_deep	Yes
IFNAR2	D_early	D_early	Yes
CrSAS6	E_fine*	E_fine	Yes
BetV1	E_fine*	E_fine	Yes
HER2_AAV	E_fine	E_fine	Yes

SpCas9: E_fine_selector finds the single best binder, but C_deep_exploiter produces the better top-16 pool (0.192 vs 0.193) — deep per-length optimization yields a tighter candidate set.

*Near ties: CrSAS6, DerF7, BetV1 have <0.001 difference between D_early and E_fine for top-1. These are effectively equivalent — the winner depends on random seed.

3. Config Sensitivity Varies by Protein

CV of i_PAE across configs. Higher = config choice matters more. Hard targets are most config-sensitive.

Pattern: Hard targets (BetV1 CV=0.648, CD45 CV=0.503) are highly config-dependent. Easy targets (IFNAR2 CV=0.027) give similar results regardless of config. Search strategy matters most when the target is hardest.

4. Configs Produce Different Sequences

4a. Length Distributions

Binder length distributions by config. B_length_explorer is broadest (16 lengths); C_deep_exploiter narrowest (2 lengths).

4b. Sequence Identity: Within vs Between Configs

Intra-config identity (colored) vs inter-config identity (black). Inter-config is lower = different sequence space.

AA composition by config. Different search strategies lead to subtly different residue preferences.

Configs explore different sequence space. Inter-config sequence identity is systematically lower than intra-config identity. This means running multiple configs and pooling candidates yields a more diverse set than running one config with equivalent compute.

Implications

Goal	Recommendation
Test 1 candidate	Use E_fine_selector (best top-1 for 7/10 proteins)
Test 16 candidates	Check per-protein winner — pool quality leader varies
Hard target (high CV)	Run 2-3 configs and pool — config choice is critical
Easy target (low CV)	Any config works; save compute
Maximize diversity	Multi-config pooling — different configs = different sequences

Appendix: Sweep Rankings (all 50 experiments)

Best i_PAE heatmap with gold winners per protein.

Config rankings: mean reward and win counts.

Generated by Analyst Team — 2026-04-05 — Full analysis (markdown) | Sweep report