Series Overview

Protein LLM: Teaching Language Models to Understand Proteins

March 20, 2026

The Project

Can we teach a language model to effectively leverage protein structural embeddings — going beyond raw amino acid sequences to reason about structure, function, and evolutionary relationships — by bridging a protein encoder with an LLM?

This is a living summary of the Protein LLM project. Each section captures a phase of the work and links to the detailed deep dive below.


The Architecture

We connect two worlds: ESM-3, a protein foundation model that encodes structural and evolutionary knowledge into 1536-dimensional embeddings, and Qwen3-8B-Instruct, a general-purpose LLM. A learnable bridge (pooling + projector) translates between them. ESM-3 stays frozen; the LLM adapts via LoRA.

To isolate what each component contributes, we designed a four-way comparison:

ApproachHow protein enters the LLMTrainable params
TextRaw sequence as <protein>MKTL...</protein> tokens~2M (LoRA only)
MLPESM-3 → AttentionPooling (32 tokens) → MLP projector~32.5M
PerceiverESM-3 → Perceiver Resampler (cross-attention)~31.4M
FlamingoESM-3 → Perceiver → Gated cross-attention at LLM layers~120-150M

We also explored the full training pipeline: SSL (continued pre-training on biology literature) → SFT (instruction fine-tuning on protein tasks) → GRPO (reinforcement learning with verifiable rewards).


Scaling Data: 50K → 4.9 Million Samples

Our first 50K-sample run proved the architecture converges. But one data source wasn't enough. We assembled 4.89 million instruction pairs from six sources.

Metric50K / Qwen3-4B4.89M / Qwen3-8BChange
eval_loss3.640.361-90.1%*
BLEU (best)0.315first measurement
ROUGE-L (best)0.515first measurement

*This comparison spans both a model change (4B→8B) and data change (50K→4.89M). On the same 8B model, data scaling alone gave an 81% reduction (1.91→0.361).

Important Update

The 4.89M dataset was later refined to 1.82M samples after discovering protein overlap between train and evaluation splits. The eval_loss numbers above may be inflated by this leakage.


Reinforcement Learning: The Gradient Routing Problem

With a strong SFT checkpoint in hand, we applied GRPO using verifiable rewards. Then came the surprise — the rankings completely inverted:

ModelSFT eval_lossGRPO Reward (best)
ESM3+MLP0.361 (better)0.780 (worse)
Text-only1.207 (worse)0.832 (better)

The gradient routing problem: During GRPO, gradients flow predominantly to either the projector OR the LoRA adapters — not both effectively. SFT's dense per-token loss flows naturally through the entire graph; GRPO's sparse scalar reward creates competition.


What's Next

SSL: Domain Pre-Training

Continued pre-training on 50GB of biology literature using Qwen3.5-4B BASE with LoRA to inject biological domain knowledge before instruction tuning.

Perceiver & Flamingo at Scale

Both architectures are implemented but untested at 4.89M scale. Cross-attention (Perceiver) may exhibit different gradient routing during GRPO.

Resolving the Gradient Routing Problem

  1. Two-stage GRPO — freeze projector first, then unfreeze
  2. Gradient balancing — GradNorm or PCGrad
  3. Architectural solutions — Perceiver/Flamingo may route gradients differently

Timeline

DatePhaseKey Result
Feb 2026Architecture & first trainingESM3+MLP converges; 50K SFT eval_loss=3.64
Mar 7Data scaling4.89M samples; eval_loss 0.361 (-90.1%)
Mar 13GRPO reinforcement learningGradient routing problem; text 0.832 vs MLP 0.582
Mar 16SSL pipeline & curated data50GB SSL corpus + 1.82M curated SFT dataset
TBDFour-way comparisonText vs MLP vs Perceiver vs Flamingo at scale
TBDGradient routing fixTwo-stage GRPO / gradient balancing
Project Repository: Post_Training_Protein_LLM
Key References: ESM-3 · Mol-Instructions (ICLR 2024)
Part 1

Building a Protein-Understanding LLM: Trials, Errors, and Breakthroughs

February 25, 2026

The Question

What happens when you bridge protein language models with large language models? Can we make an LLM that leverages structural protein embeddings, not just parses their sequences?

The Beginning: Why This Project?

It started with a simple observation. Protein language models like ESM-3 have learned remarkably rich representations from millions of protein sequences — structural features, evolutionary constraints, functional signals — all encoded in high-dimensional embeddings. Meanwhile, LLMs excel at reasoning, following instructions, and generating coherent explanations. But they exist in separate worlds.

The question that kept nagging me: what if we could bridge these two worlds?

Not by training a protein model from scratch (prohibitively expensive), but by post-training — freezing the protein encoder, attaching a learnable bridge, and teaching an LLM to interpret protein embeddings alongside natural language.


Day 1: Architecture Decisions

The Core Design

We settled on a modular pipeline:

  1. Protein Encoder (frozen): ESM-3 small with 1.4B parameters producing 1536-dimensional embeddings
  2. Attention Pooling: Compress variable-length residue embeddings into 32 fixed tokens
  3. MLP Projector (trainable): Map 1536-dim protein space into 2560-dim LLM input space
  4. LLM: Qwen3-4B with LoRA on all linear layers
Key Decision

Start with 32 pooled tokens. Too few and we lose protein information; too many and we blow up sequence length. 32 seemed like a reasonable middle ground — equivalent to roughly 100 words of text context.

The Three Encoding Approaches

Text-Based

Raw sequence as <protein>MKTL...</protein> — the simplest baseline. No ESM-3, just treating proteins as weird text.

ESM-3 + MLP

ESM-3 embeddings → Attention Pooling → MLP Projector. Our main approach.

ESM-3 + Perceiver

ESM-3 embeddings → Perceiver Resampler. ~29M parameters, with latent_dim decoupling for better information compression.

We later added a fourth approach — Flamingo-style gated cross-attention — bringing the comparison to four pathways.


Day 2: First Training Run — It Works!

The Mol-Instructions Dataset

We started with Mol-Instructions, a curated set of 505K protein instruction pairs:

TaskSamplesDescription
Catalytic Activity53KPredict enzyme reactions
Protein Function114KPredict biological roles
General Description87KDescribe protein properties
Domain/Motif45KIdentify structural domains
Protein Design196KGenerate sequences

First Results

RunSamplesLoss (Start → End)Duration
500-sample test50017.35 → 4.08~6 min
10K baseline10,00035.80 → 27.84~96 min
50K full50,00034.25 → 14.77~2.35 hr

The training converged. On 8x H100 80GB GPUs, the 50K run used only ~39 GB memory with an effective batch size of 32.


Day 2-3: The Evaluation Reality Check

The vanilla Qwen3-4B set a floor — essentially random performance:

After SFT, the numbers didn't improve much. Some got worse.

Painful Discovery #1

GO prediction degraded after SFT. The vanilla model at least guessed some correct terms. After SFT, the model predicted too few terms (or none at all). F1 went from 0.047 to 0.0.

The PPI Bias Problem

All models defaulted to predicting "No interaction" for every protein pair. High specificity (0.9-1.0), terrible recall (0.2 or 0.0). The silver lining: AUROC improved with scale (0.51 → 0.60 → 0.70), suggesting better internal representations.

What This Told Us

  1. The architecture works — training converges, memory is manageable
  2. SFT alone isn't enough — instruction-tuning doesn't automatically teach structured output
  3. Scale helps internal representations — the knowledge is there, it just doesn't surface in generation

Day 4: The Instruct Model Disaster

We launched an 8B training run overnight. Loss looked fine. But generation was garbage:

KQKQKQKQKQKQKQKQ...
SSSSSSSSSSSSSSSS...
(empty string)

After hours of debugging: the config had path: Qwen/Qwen3-8B — the base model, not Qwen/Qwen3-8B-Instruct-2507.

Why This Matters

Base models don't know how to stop. They never learned instruction-following, chat templates, or turn-taking. The LoRA adapter trained on the base model couldn't be transferred to Instruct. The entire overnight run was wasted.

Fix: All model configs now use -it suffix. A simple naming convention that would have saved 8 hours of H100 time.


Day 4-5: Protein Boundary Tokens

The ESM-3 embedding path used a single <|protein_embed|> placeholder. The LLM had no explicit signal for where the protein representation starts and ends.

Solution — structured boundary tokens:

<|protein_start|>  <|protein_embed|>  <|protein_end|>
     ID 151669         ID 151670          ID 151671
[..., text, START_embed, prot_1, prot_2, ..., prot_32, END_embed, text, ...]
         ↑ LLM learns    ↑ 32 ESM-3 pooled tokens      ↑ LLM learns
         what this means  (replaced at forward)         what this means

The start/end tokens remain as regular LLM embeddings. The middle tokens get replaced by ESM-3's pooled output. All three are masked from loss computation.


Day 6: Scaling Up Data

Why 505K Wasn't Enough

Mol-Instructions covers five tasks well, but they all come from one paper's annotation pipeline. We needed different annotation styles, more tasks, and scale.

The Six Sources

SourceRecordsWhat It Brings
Mol-Instructions299KCurated instruction pairs (ICLR 2024)
Swiss-Prot1.08MGene prediction, organism prediction, function
ProteinLMDataset826KSubunit, PTM, disease, tissue, induction
SwissProtCLAP511KRich paragraph-length descriptions
ProtDescribe1.76MNaming, similarity, location, function
Protein2Text-QA52K44,915 unique questions — highest diversity

Total: 4.52 million training records.

Sequence Length Distribution

Protein Sequence Length Distribution

Distribution of protein sequence lengths across the combined 4.5M dataset. Median: 306 amino acids.

StatisticValue
Median306 AA
Mean339 AA
5th percentile89 AA
95th percentile751 AA
Max1,000 AA
Watch Out

Single-word outputs: 31% of ProteinLMDataset subunit records have outputs like just "Monomer." 12-16% duplicates from template augmentation. 72.8% sequence overlap across sources — same proteins, different analytical contexts.


The RL Path Forward

SFT established the foundation, but we need task-specific rewards to push beyond "reasonable-sounding but wrong." We implemented GRPO with four verifiable reward functions — GO term F1, stability prediction accuracy, PPI interaction detection, and ESMFold structural quality assessment.


Lessons Learned

  1. Naming conventions save hours. The -it suffix on Instruct models would have prevented an overnight disaster.
  2. Evaluation comes first. We spent days on architecture before realizing our evaluation was incomplete.
  3. SFT is necessary but not sufficient. Task-specific rewards (RL) are needed for downstream metrics.
  4. Data diversity likely matters alongside scale. 300K from one source → linguistic tics. 4.5M from six sources should help.
  5. Proteins are hard. Longer than typical text, different statistical properties, domain-specific evaluation.
Part 2

Scaling to 4.9 Million Samples

March 7, 2026

The Big Picture

We scaled our protein-LLM training from 50K samples on Qwen3-4B to 4.9 million samples on Qwen3-8B across six sources. Combined, these changes reduced eval_loss from 3.64 to 0.361.

Important Update (March 2026)

The 4.89M combined dataset was later refined to 1.82M samples after discovering significant protein overlap between train and evaluation splits. All results below reflect the original 4.89M training and may not hold on the curated dataset.


The Scaling Experiment

What ChangedBeforeAfter
Training samples50K (one source)4.89M (six sources)
Dataset diversity5 task types15+ task types
Annotation stylesTemplate-based onlyParagraphs, QA, structured
Total training steps2,59528,941

Everything else stayed constant: Qwen3-8B-Instruct, ESM-3 small (frozen), MLP projector, LoRA on all linear layers, 8x H100 GPUs with FSDP.


The Results

Loss Dropped Off a Cliff

Training Loss Over Time

Training loss showing continued improvement through epoch 1. The combined dataset produces dramatically lower loss than single-source training.

MetricCombined 4.9M (MLP)Text-Only on 4.9MMol-Instr Only (505K)
eval_loss0.3611.2071.908
token_avg_loss0.3661.6432.282
MLP advantage vs text70.1% lower81.1% lower
Eval Loss Comparison

Eval loss at epoch 1 checkpoint: 0.361 — dramatically lower than mol-instructions-only (1.908).

The Two Biggest Levers

DateChangeeval_lossImprovement
Feb 20First run (Qwen3-4B, 50K)3.64baseline
Feb 23Scale model: 4B → 8B1.9447%
Feb 27Text-only baseline (8B, 505K)2.4233%
Mar 5MLP on full mol-instructions1.9148%
Mar 7Combined 4.89M dataset0.36190%

First Generation Quality Numbers

MetricBest Value (step 9,250)
BLEU0.315
ROUGE-L0.515

Scores declined slightly by the final checkpoint (step 9,750: BLEU 0.255, ROUGE-L 0.481), suggesting some overfitting.


Training Stability

Gradient Norms

Gradient Norms

Gradient norm trajectories. The combined run stabilizes to a mean of 0.627 after initial spikes in the first 100 steps.

Brief instability early on (steps 10-100), then rock-solid stability. Mean gradient norm: 0.627 — lower than both mol-instructions-only (1.67) and text baseline (1.48). More diverse data actually made optimization smoother.


Architectural Comparison

On 50K data, MLP and Perceiver achieve nearly identical performance: eval_loss 1.942 vs 1.952 (MLP wins by 0.5%).

ApproachProjector Params50K eval_loss4.89M eval_loss
Text-only— (LoRA only)2.4151.207
MLP30.5M1.9420.361
Perceiver29.4M1.952Not yet run
Flamingo~120-150MNot yet run

What This Means

Data Diversity Hypothesis: Supported

Combining six sources produced substantially lower loss than single-source training. Scaling data from diverse sources yielded larger gains than scaling model size.

72% of proteins appear in multiple sources but described differently each time. Multiple annotation perspectives help generalization — though disentangling diversity from scale requires further experiments.

ESM-3 Embeddings Show a Clear SFT Advantage

On 4.89M: eval_loss 0.361 (ESM-3+MLP) vs 1.207 (text-only). The structural information in ESM-3's 1536-dim embeddings appears difficult to recover from sequences alone.


The Takeaway

Data diversity and scale are among the most impactful levers for protein LLM SFT performance.

  • eval_loss: 1.91 → 0.361 on same 8B model (81% reduction from data scaling)
  • BLEU: 0.315, ROUGE-L: 0.515 at best checkpoint
  • 33.7% through training (headroom remains)
Part 3

Teaching a Protein LLM with RL: The Gradient Routing Problem

March 13, 2026

The Question

When a multimodal protein LLM achieves strong supervised fine-tuning loss, why does it struggle with reinforcement learning — and what does gradient flow tell us about the answer?

Over the past month, I built a multimodal protein language model — ESM-3 encodes sequences into 1536-dimensional structural embeddings, passing through attention pooling (32 tokens) and an MLP projector into Qwen3-8B-Instruct. LoRA adapts the LLM; ESM-3 stays frozen. ~32.5M trainable parameters.

Four-pathway architecture schematic

Schematic overview of the four-pathway architecture.


SFT Results: The MLP Advantage

MetricESM3+MLPText-OnlyMLP Advantage
Best eval_loss0.3611.20770.1% lower
Gradient norm (mean)0.6272.3653.8x lower
SFT eval loss curves

SFT eval loss curves. MLP converged smoothly; text-only showed degradation after step 8000.


From Imitation to Reasoning: Enter GRPO

Group Relative Policy Optimization (GRPO): for each prompt, generate a group of completions (8-16). A reward function scores each, and the model updates toward better-rewarded outputs using policy gradient. Rewards are normalized within each group.

For structure quality, we designed a multi-component reward:

Training data: 5,878 protein sequences folded by ESMFold to produce ground-truth structural quality metrics. We ran 13 GRPO experiments over three days.


The Reversal: Text Wins at RL

The text-only model (the worse SFT model) rapidly learned structure quality assessment: 0.832 mean reward, near-perfect format compliance, and remarkably calibrated pLDDT predictions (79.1 predicted vs 79.7 actual).

The MLP model showed 0.582 reward in its initial configuration: format compliance stuck at 65%, pLDDT underestimated (67.9 vs 79.7). A separate run reached 0.780, but text-only still led.

GRPO reward curves

GRPO reward curves showing text-only outperforming MLP. The model with better SFT is worse at RL.


Diagnosing the Problem: Follow the Gradients

Model / TaskLoRA GradientsMultimodal Gradients
MLP, Structure GRPO~0 (frozen out)0.247
MLP, ProteinLM GRPO0.8260.0 (frozen out)
Text-only, Structure GRPO0.476-0.493N/A
Gradient norms showing the routing problem

Gradient norms revealing the routing problem: one subsystem captures signal, starving the other.

This is the gradient routing problem: gradients flow predominantly to one subsystem during GRPO, largely starving the other. The "winner" depends on the reward function and task characteristics.

During SFT, every token contributes to loss, distributing gradients through both projector and LoRA. During GRPO, the reward is a single scalar — creating a "winner-take-all" dynamic where the steepest local descent captures the signal.


The Fix (Partial): Freeze and Focus

Freeze the multimodal projector completely and only update LoRA during GRPO. This eliminates the competition:

ConfigurationMean RewardFormat BonusLoRA GradMM Grad
Text-only (standard)0.8320.0990.4760.0
MLP best run0.780
MLP frozen-MM0.7740.0950.0710.0
MLP frozen-MM + focal0.7250.0950.1190.0
MLP standard (gradient-starved)0.5820.065~00.247

Frozen-MM: reward jumps from 0.582 to 0.774. Format compliance from 65% to 95%. The gradient routing problem was the primary bottleneck.

But text-only still wins (0.832). Even with projector frozen, LoRA gradients (0.071-0.119) are smaller than text-only (0.476) — the 32-token compression creates an information bottleneck.


Reward Breakdown: Component-Level Analysis

Reward component breakdown

Reward component breakdown: text-only outperforms across all four components.

Biggest gaps in numerical accuracy (text: 0.185, standard MLP: 0.086) and category match (text: 0.197, standard MLP: 0.104). Quality alignment closer (text: 0.351, standard MLP: 0.326).

pLDDT calibration: Actual mean 79.7. Text-only predicts 79.1 (remarkably close). Frozen-MM overestimates at 85.1. Standard MLP underestimates at 67.9. The projector embeddings contain strong signal but GRPO fails to calibrate it properly.

Format compliance

Format compliance: text-only reaches ~100% early. Standard MLP stuck at 74-76%. Frozen-MM recovers to ~95%.


ProteinLM Benchmark: A Negative Result

GRPO on 849 multiple-choice protein questions: clear negative result. Neither 3-epoch (reward 0.695) nor 10-epoch (reward 0.676) showed improvement. 37-42% of groups had zero reward standard deviation — identical completions within each group meant zero gradient signal. Nearly half the compute was wasted.

Interestingly, gradient routing was reversed: LoRA received healthy gradients (0.8) while multimodal gradients were exactly zero.


What This Means for Multimodal RL

In SFT, every token contributes to loss, and chain rule distributes gradients through both projector and LoRA. In GRPO, the sparse scalar reward creates competition — a "winner-take-all" dynamic resembling gradient conflicts in multi-task learning.

Scope caveat: Observed on two GRPO tasks with one architecture (MLP projector). Generalization to other multimodal RL settings remains untested.

What's Next

  1. Two-stage GRPO. Frozen projector first (format + basic reward via LoRA), then unfreeze for structural refinement.
  2. Gradient balancing. GradNorm or PCGrad to prevent winner-take-all dynamics.
  3. The Perceiver pathway. Cross-attention may produce different gradient routing behavior.

The Central Paradox

Our best SFT model (MLP, eval_loss 0.361) produces our worst GRPO learner (reward 0.582), while our weaker SFT model (text-only, eval_loss 1.207) produces our best GRPO learner (reward 0.832). This SFT→RL inversion is the key challenge for the next phase.

Project Repository: Post_Training_Protein_LLM
Key References: ESM-3 · Mol-Instructions (ICLR 2024)

This analysis covers 4 SFT and 13 GRPO experiments run between March 6-13, 2026, on 8x H100 GPUs. All models use Qwen3-8B-Instruct-2507 with LoRA (r=8). The combined SFT dataset contains 4.89M instruction-response pairs from six protein data sources.