DISCO — Teaching AI to Invent Enzymes Nature Never Imagined

01 — The Challenge

Nature's Chemistry —
and Beyond

Evolution is the greatest chemist the world has ever known. Over billions of years, it has crafted enzymes of breathtaking precision, molecular machines that accelerate reactions by factors of millions, under the gentlest of conditions. But even evolution has its blind spots. The chemical reactions it has explored represent a remarkably narrow slice of what is possible. Vast swathes of synthetically valuable chemistry remain untouched by biology, not because they are impossible for enzymes, but because evolution simply never had a reason to go there.

The grand challenge of modern enzyme engineering is to venture into this uncharted territory: to build enzymes for reactions that nature has never attempted. Directed evolution, the Nobel Prize-winning approach[1] of iteratively mutating and screening proteins, can push enzymes toward new-to-nature chemistry. But every campaign needs a starting point: an initial protein with at least a flicker of the desired activity. For truly novel chemistry, finding one is laborious and challenging, and we are limited by what evolution has already sampled.

Deep learning has already transformed protein design. Models like RFdiffusion[2] and BindCraft[3] can design novel proteins that fold into desired shapes and bind to specific targets. But designing an enzyme, a protein that doesn't just bind a molecule but transforms it, is a fundamentally harder problem.

Current computational approaches face two critical bottlenecks. First, they require someone to pre-specify the exact arrangement of catalytic residues or reaction transition state structure, a so-called "theozyme", before the design process begins[4]. This demands deep mechanistic understanding of the reaction, which for new-to-nature chemistry is often unavailable. Second, existing methods treat sequence and structure as separate problems, solved sequentially: first generate a backbone, then fill in the amino acids. Because protein function emerges from the interplay of both, critical information is lost at the handoff.

The Central Question

How can we systematically design functional enzymes for chemical transformations that have no precedent in known biology, without pre-specifying a precise catalytic mechanism?

DISCO is the first model to solve both problems at once, jointly generating sequence and structure around any target molecule, with no pre-fixed theozyme required.

02 — The Model

DISCO: Designing Sequence
and Structure as One

DISCO, DIffusion for Sequence-structure CO-design, is a multimodal generative model that simultaneously generates both a protein's amino acid sequence and its three-dimensional atomic structure. This might sound like an incremental advance, but it represents a fundamental departure from how the field has worked.

The wet-lab pipeline of existing generative approaches, including recent models like RFdiffusion2[5,6], RFdiffusion3[7], and BoltzGen[8], work in two separate stages: first generate a protein backbone, then use a separate inverse-folding network such as LigandMPNN[9] to predict what amino acid sequence would fold into that shape. This decoupled pipeline cannot use sequence-level signals to guide backbone design, or structural context to inform sequence choices.

DISCO eliminates this handoff entirely. It learns a joint distribution over discrete amino acid tokens and continuous 3D coordinates, denoising both simultaneously. The mathematical foundation is elegant: by independently sampling noise per modality during training, the model provably learns the joint reverse process using only unimodal losses, no special joint supervision required[10]. The coupling emerges from the architecture.

Core Techniques

Getting the model to generate sequences that actually fold into their designed structures requires us to align the two modalities together, which includes three critical inference innovations, each with a dramatic effect on co-designability:

Cross-Modal Recycling

Structure informs sequence, sequence informs structure

At each denoising step, the model conditions on four signals: its current predicted clean sequence and structure, plus the noised versions of both. A frozen protein language model and a structure encoder inject rich information bidirectionally, ensuring amino acid choices reflect emerging geometry while backbone predictions adapt to evolving sequence identity.

Self-Correcting Sequence

Revise early mistakes, temper overconfidence

Standard masked diffusion unmasks tokens one-by-one, irreversibly. DISCO enables sequence revision, revisiting and correcting earlier amino acid choices. An entropy-adaptive temperature mechanism smooths the amino acid distribution early in the trajectory when structural detail is still coarse, allowing confident commitments only once the backbone has crystallized.

Noisy Guidance

Condition on noise to sharpen predictions

DISCO conditions each modality on a noisier version of the other, sharpening the model's conditional predictions and ensuring the two modalities stay aligned throughout the trajectory. The intuition: by seeing a slightly degraded version of the partner signal, the model learns to extract the essential structural or sequence information rather than overfitting to intermediate noise.

Ablation: Co-designability (fraction of sequences that refold within 2Å)

Full DISCO+ noisy guidance

0.88

DISCOwithout noisy guidance

0.80

− Recycling

0.62

− Entropy-adaptive temperature

0.60

− Sequence correctionstandard masked diffusion

0.23

Some other highlights of the model are as follows:

Conditions on arbitrary biomolecular contexts, small molecules, reactive intermediates, DNA, RNA Co-folds ligands with the protein throughout generation Trained on unfiltered PDB, no "designability" selection bias, and no synthetic data

Figure 2: DISCO architecture and inference overview — **Figure 2.** DISCO's multimodal inference overview with arbitrary molecular conditioning.

Inference-Time Scaling

Because DISCO operates on both modalities, it unlocks inference-time steering using reward functions defined over both sequence and structure. Rather than generating thousands of candidates and filtering, DISCO uses Feynman-Kac Correctors[11] (FKC) — a principled mathematical framework that tilts the sampling distribution toward proteins with desired properties during generation itself. We derive two novel FKC methods:

FKC-Multimodal (FKC-MM) steers generation using reward functions defined simultaneously on both discrete sequences and continuous structures, something impossible in decoupled pipelines. When targeting increased disulfide bonds, FKC-MM produces 100-residue proteins with six disulfide bonds, a density found in only the top 0.2% of comparable training proteins. Meanwhile, structure guidance alone does not work.

FKC-Specificity Guidance (FKC-SG) solves a different problem: designing a protein that binds a target molecule while avoiding a structurally similar decoy. By sampling from a tilted distribution that encourages on-target likelihood while penalizing off-target likelihood, FKC-SG generates proteins with high binding-site separation, including cases where best-of-N filtering produces zero hits.

03 — Computational Benchmarks

State-of-the-Art Across
Diverse Design Tasks

On unconditional monomer generation, ~90% of generated sequences refold to within 2Å of their designed backbones, while achieving the highest sequence and structural diversity and novelty among existing methods. On the Studio-179 benchmark, a new library of 179 natural and non-natural ligands spanning catalysis, pharmaceuticals, luminescence, and sensing, DISCO generates the most diverse, co-designable complexes for 178 of 179 targets, surpassing all baselines[7,8]. Note that only DISCO natively supports co-folding with multi-ligand.

The generality extends beyond small molecules. DISCO successfully generates co-designable proteins predicted to bind sequence-specific DNA and RNA, outperforming existing models on macromolecular interfaces as well. From small-molecule cofactors to nucleic acids, no other generative model matches DISCO's breadth.

Figure 3: Benchmarking and steering — **Figure 3.** (A) DISCO outperforms existing methods in co-designability, novelty, and diversity when designing proteins conditioned on a wide range of biomolecular targets. (B) An in silico demonstration of how FKC-SG specificity guidance can design proteins that selectively bind one target over another — rather than relying on simple filtering.

But co-designability alone is not enough. DISCO's designs also capture the complex statistical properties of real proteins: natural amino-acid compositions, diverse secondary structures, favorable Ramachandran geometries, appropriate surface hydrophobicity and net charge, and high long-range contact order, indicating complex, well-connected topologies rather than the trivial folds.

Novel Pockets That Understand Chemistry

The most distinctive capability of joint sequence-structure generation is that the pocket's chemistry and geometry co-adapt with the target's conformation. DISCO samples chemically valid target conformers that explore geometric diversity beyond the reference input. More crucially, DISCO's designs are chemically intelligent: binding-site lipophilicity correlates with ligand hydrophobicity, appropriate coordinating residues emerge for specific cofactors, and cavities form with the right geometry to avoid clashes. These pockets are also diverse — up to 80% motif diversity among the four closest residues — and novel, with the majority of designed pockets having no close match in AlphaFoldDB.

Figure 4: Realistic features, responsive pockets, novel motifs — **Figure 4.** (A) DISCO designs novel active-site motifs. (B) An example of novel binding site (purple) designed to bind Coenzyme Q1, compared to the closest motif in AlphaFoldDB (beige).

04 — Experimental Validation

New Enzymes for
New-to-Nature Chemistry

The ultimate test is the wet lab. DISCO was challenged with designing enzymes for carbene-transfer reactions, a class of transformations nature has not explored, valuable for constructing pharmaceuticals and complex molecules. This chemistry was first brought into biology by directed evolution of cytochrome P450 enzymes for cyclopropanation[12], and has since been expanded to boron-hydrogen[13] and carbon–hydrogen[14] bond insertion, among others. Carbene transfers proceed via formation of an iron-carbenoid intermediate, which then delivers the carbene fragment to a substrate through diverse pathways.

Rather than specifying a precise theozyme[15], DISCO was conditioned solely on DFT-computed geometries of the heme - carbene precursor intermediate, a deliberate simplification. We let the model explore catalytic solutions without being constrained by human assumptions about which residues are required or what the transition state looks like.

From ~20,000 generated sequence-structure pairs — one to two orders of magnitude fewer than recent pipelines [5,6,16] — 90 designs were selected through computational filtering and tested across four distinct reactions. Below, we highlight yield (fraction of substrate converted to product) and total turnover number (TTN, the amount of substrate each enzyme converts before deactivation). Although the designs were not optimized for stereoselectivity, enantiomeric excess reached up to 35%, with enzymes favoring either enantiomer identified for three of the four reactions.

Reaction 01

Alkene Cyclopropanation

Cyclopropanation of 4-methoxystyrene with ethyl diazoacetate (EDA) — a benchmark reaction for carbene chemistry that builds strained three-membered rings widely used in medicinal chemistry.

Styrene Cyclopropanation Reaction Scheme

4,050

TTN

72%

Yield

99:1

d.r.

For this substrate, the top DISCO design has activity that surpasses both early evolved P450 enzymes, the original breakthrough in engineered cyclopropanation biocatalysis [12] (Science 2013, 339, 6117, 307), and the recently reported designed enzyme PNC2, which scaffolded a helix bundle around a porphyrin-based theozyme[15] (Science 2025, 388, 6747, 665).

vs. Prior Enzyme Engineering Approaches

DISCO

4,050 TTN

PNC2 (theozyme)

630

P450_H2-5-F10 (evol.)

364

Reaction 02

B–H Insertion

Carbene insertion into an N-heterocyclic carbene–borane, forging a new carbon–boron bond. No organism has ever catalyzed this transformation, it is entirely alien to biology.

5,170

TTN

98%

Yield

A single DISCO design, with no laboratory optimization, more than doubles the activity achieved by three rounds of directed evolution[13] (Nature 2017, 552, 7683, 132), and exceeds the starting point by 43×.

vs. Previous Directed Evolution Campaign

DISCO

5,170 TTN

3 rounds evol.

2,490

Rma cyt c

120

Reaction 03

C(sp³)–H Insertion

Selective alkylation of an unactivated C–H bond in 1-phenylpyrrolidine, among the most challenging transformations in organic chemistry, requiring exquisite control over site- and stereoselectivity.

2,360

TTN

42%

Yield

The previous engineering campaign[14] (Nature 2019, 565, 7737, 67) started at fewer than 20 TTN and required 14 rounds of directed evolution to reach 2,030 TTN. DISCO exceeds that endpoint in a single computational step.

One Computational Design vs. 14 Rounds of Lab Evolution

DISCO

2,360 TTN

14 rounds evol.

2,030 TTN

P450 WT

N.D.

Reaction 04

Spirocyclopropanation

Building a strained spirocyclic motif in a pharmaceutically relevant azaspiro[2.3]alkane scaffold, sterically and electronically demanding.

Initial activity was modest, but the designs proved highly evolvable: a single round of error-prone PCR mutagenesis of dCT-H11 produced a variety of improved variants, with mutations that both increased activity by multiple folds as well as diverged in stereoselectivity — some favoring one enantiomer (+49% e.e.), others inverting to the opposite (–35% e.e.). This reaction class was recently explored enzymatically[17] (JACS 2025, 147, 31, 27165).

05 — Novel Architectures

Active Sites That
Don't Exist in Nature

Perhaps the most interesting property of DISCO's designs is the novelty and diversity of their molecular architectures. When generated binding motifs are searched against the entirety of the AlphaFold Database, the majority have no close natural homologs. Over 90% of the motifs cluster into distinct groups. These are chemically plausible, new residue motifs, invented by the model to accommodate the target.

The closest structural match to dCT-H11, one of the top-performing designs, is a TetR-family transcription factor from the extremophile Haloarcula marismortui, a DNA-binding protein with no known catalytic activity and only 21% sequence identity to the DISCO designed sequence. DISCO repurposed this non-enzymatic topology for carbene transfer. The catalytic residue geometry is novel: the closest motif in AlphaFoldDB[18,19] deviates by more than 7Å. Other designs are even more remote: dCT-F9 (TM-score 0.52, 5% identity) and dCT-G9 (TM-score 0.51, 9% identity) adopt folds with no corresponding motif identified anywhere in AlphaFoldDB.

Top DISCO Design — dCT-H11

DISCO-designed enzyme dCT-H11 — a novel carbene transferase with no natural homolog

Closest Natural Match

PDB 3CRJ

TetR-family transcription factor from Haloarcula marismortui, a Dead Sea extremophile. A DNA-binding protein with no known catalytic activity. Closest motif to the designed active site shows a significant deviation.

>7 ÅMotif RMSD

21%Seq. identity

0.81TM-score

Not a heme-binding protein. DISCO repurposed this non-enzymatic fold for carbene chemistry with a completely novel active-site geometry. Other top designs have lower seq. identity (<0.1) and pdbTM (~0.50), also with novel active sites.

None of the closest structural matches are naturally heme-binding proteins. DISCO has learned the underlying biochemical principles that enable heme binding and carbene transfer, and applies them to protein folds that evolution never associated with this chemistry. It does not remix known parts. It discovers fundamentally new solutions.

07 — Looking Forward

Toward Genetically Encodable
Arbitrary Chemistry

The four reactions explored here are a small sample from a vast universe of synthetically valuable transformations not found in nature. DISCO addresses a critical bottleneck: generating diverse, functional, evolvable enzymes from scratch, conditioned solely on the target chemistry — bypassing transition-state calculations, theozyme scaffolding[4], and large-scale screening.

Key Contributions

Multimodal co-design: Simultaneous sequence and structure generation through joint diffusion with cross-modal recycling.
No theozyme required: Enzymes designed from reaction intermediates alone — no pre-specified catalytic residues or precise theozymes, in contrast to physics-based approaches[15] and motif-scaffolding methods[4].
Arbitrary molecular conditioning: Co-folds with small molecules, cofactors, intermediates, and nucleic acids: state-of-the-art across 179 targets.
Principled inference-time steering: Feynman-Kac Correctors[11] for controllable, multimodal generation: reward tilting, not generate-and-filter.
Novel active-site geometries: Designed motifs have no close homologs across 200M+ structures in AlphaFoldDB.
High experimental activity: Top designs exceed extensively engineered variants from just 90 tested genes.
Evolvable scaffolds: One round of mutagenesis yields fourfold gains with divergent stereoselectivity.

The chemistry nature never explored is now within reach.

References

Arnold, F. H. Innovation by Evolution: Bringing New Chemistry to Life (Nobel Lecture). Angew. Chem. Int. Ed. 58(41), 14420–14426 (2019). doi:10.1002/anie.201907729
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., ... & Baker, D. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023). doi:10.1038/s41586-023-06415-8
Pacesa, M., Nickel, L., Schmidt, J., ... & Correia, B. E. One-shot design of functional protein binders with BindCraft. Nature (2025). doi:10.1038/s41586-025-09429-6
Wang, J., Lisanza, S., Juergens, D., Tischer, D., Watson, J. L., ... & Baker, D. Scaffolding protein functional sites using deep learning. Science 377(6604), 387–394 (2022). doi:10.1126/science.abn2100
Ahern, W., Yim, J., Tischer, D., ... Krishna, R. & Baker, D. Atom-level enzyme active site scaffolding using RFdiffusion2. Nature Methods (2025). doi:10.1038/s41592-025-02975-x
Kim, D., ... Krishna, R. & Baker, D. Computational design of metallohydrolases. Nature (2025). doi:10.1038/s41586-025-09746-w
Butcher, J., Krishna, R., Mitra, R., ... & Baker, D. De novo design of all-atom biomolecular interactions with RFdiffusion3. bioRxiv (2025). doi:10.1101/2025.09.18.676967
Stark, H., Faltings, F., Choi, M., ... Barzilay, R. & Jaakkola, T. BoltzGen: Toward Universal Binder Design. bioRxiv (2025). doi:10.1101/2025.11.20.689494
Dauparas, J., Lee, G. R., Pecoraro, R., An, L., Anishchenko, I., Glasscock, C. & Baker, D. Atomic context-conditioned protein sequence design using LigandMPNN. Nat. Methods 22, 717–723 (2025). doi:10.1038/s41592-025-02626-1
Rojas, K., Zhu, Y., Zhu, S., Ye, F. X.-F. & Tao, M. Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces. ICML (2025). arXiv:2506.07903
Skreta, M., Akhound-Sadegh, T., Ohanesian, V., Bondesan, R., Aspuru-Guzik, A., Doucet, A., Brekelmans, R., Tong, A. & Neklyudov, K. Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts. ICML (2025). arXiv:2503.02819
Coelho, P. S., Brustad, E. M., Kannan, A. & Arnold, F. H. Olefin cyclopropanation via carbene transfer catalyzed by engineered cytochrome P450 enzymes. Science 339(6117), 307–310 (2013). doi:10.1126/science.1231434
Kan, S. B. J., Huang, X., Gumulya, Y., Chen, K. & Arnold, F. H. Genetically programmed chiral organoborane synthesis. Nature 552(7683), 132–136 (2017). doi:10.1038/nature24996
Zhang, R. K., Chen, K., Huang, X., Wohlschlager, L., Renata, H. & Arnold, F. H. Enzymatic assembly of carbon–carbon bonds via iron-catalysed sp³ C–H functionalization. Nature 565(7737), 67–72 (2019). doi:10.1038/s41586-018-0808-5
Hou, K., Huang, W., Qi, M., ... & DeGrado, W. F. De novo design of porphyrin-containing proteins as efficient and stereoselective catalysts. Science 388(6747), 665–670 (2025). doi:10.1126/science.adt7268
Braun, M., Tripp, A., Chakatok, M., ... & Oberdorfer, G. Computational enzyme design by catalytic motif scaffolding (Riff-Diff). Nature 649(8095), 237–245 (2026). doi:10.1038/s41586-025-09747-9
Kennemur, J. L., Long, Y., Ko, C. J., Das, A. & Arnold, F. H. Enzymatic stereodivergent synthesis of azaspiro[2.y]alkanes. J. Am. Chem. Soc. 147(31), 27165–27171 (2025). doi:10.1021/jacs.5c07015
Varadi, M., Bertoni, D., Magana, P., ... Steinegger, M., Hassabis, D. & Velankar, S. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52(D1), D368–D375 (2024). doi:10.1093/nar/gkad1011
Kim, H., Kim, R. S., Mirdita, M. & Steinegger, M. Structural motif search across the protein-universe with Folddisco. bioRxiv (2025). doi:10.1101/2025.07.06.663357
Krishna, R., Wang, J., Ahern, W., ... & Baker, D. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384(6693), eadl2528 (2024). doi:10.1126/science.adl2528

Nature's Chemistry —and Beyond

DISCO: Designing Sequenceand Structure as One

Core Techniques

Inference-Time Scaling

State-of-the-Art AcrossDiverse Design Tasks

Novel Pockets That Understand Chemistry

New Enzymes forNew-to-Nature Chemistry

Alkene Cyclopropanation

B–H Insertion

C(sp³)–H Insertion

Spirocyclopropanation

Active Sites ThatDon't Exist in Nature

Designs That EvolutionCan Build Upon

Toward Genetically EncodableArbitrary Chemistry

Key Contributions

References

Nature's Chemistry —
and Beyond

DISCO: Designing Sequence
and Structure as One

State-of-the-Art Across
Diverse Design Tasks

New Enzymes for
New-to-Nature Chemistry

Active Sites That
Don't Exist in Nature

Designs That Evolution
Can Build Upon

Toward Genetically Encodable
Arbitrary Chemistry