Teaching AI to Invent Enzymes Nature Never Imagined

DISCO is a multimodal generative model that simultaneously designs protein sequence and 3D structure from scratch — creating entirely new enzymes for chemical reactions that have no precedent in biology.

1Caltech· 2Mila· 3Université de Montréal· 4Université Paris-Saclay· 5McGill University· 6University of Cambridge· 7University of Oxford· 8Imperial College London· 9Institut Courtois· 10LawZero· 11AITHYRA· 12FutureHouse
* Corresponding authors  ·  † Equal contribution
Figure 1. DISCO simultaneously generates protein sequence and 3D structure around a molecular target. The sequence is progressively unmasked with self-correction while the backbone co-folds with the conditioning molecule.
01 — The Challenge

Nature's Chemistry —
and Beyond

Evolution is the greatest chemist the world has ever known. Over billions of years, it has crafted enzymes of breathtaking precision, molecular machines that accelerate reactions by factors of millions, under the gentlest of conditions. But even evolution has its blind spots. The chemical reactions it has explored represent a remarkably narrow slice of what is possible. Vast swathes of synthetically valuable chemistry remain untouched by biology, not because they are impossible for enzymes, but because evolution simply never had a reason to go there.

The grand challenge of modern enzyme engineering is to venture into this uncharted territory: to build enzymes for reactions that nature has never attempted. Directed evolution, the Nobel Prize-winning approach[1] of iteratively mutating and screening proteins, can push enzymes toward new-to-nature chemistry. But every campaign needs a starting point: an initial protein with at least a flicker of the desired activity. For truly novel chemistry, finding one is laborious and challenging, and we are limited by what evolution has already sampled.

Deep learning has already transformed protein design. Models like RFdiffusion[2] and BindCraft[3] can design novel proteins that fold into desired shapes and bind to specific targets. But designing an enzyme, a protein that doesn't just bind a molecule but transforms it, is a fundamentally harder problem.

Current computational approaches face two critical bottlenecks. First, they require someone to pre-specify the exact arrangement of catalytic residues or reaction transition state structure, a so-called "theozyme", before the design process begins[4]. This demands deep mechanistic understanding of the reaction, which for new-to-nature chemistry is often unavailable. Second, existing methods treat sequence and structure as separate problems, solved sequentially: first generate a backbone, then fill in the amino acids. Because protein function emerges from the interplay of both, critical information is lost at the handoff.

The Central Question

How can we systematically design functional enzymes for chemical transformations that have no precedent in known biology, without pre-specifying a precise catalytic mechanism?

DISCO is the first model to solve both problems at once, jointly generating sequence and structure around any target molecule, with no pre-fixed theozyme required.

02 — The Model

DISCO: Designing Sequence
and Structure as One

DISCO, DIffusion for Sequence-structure CO-design, is a multimodal generative model that simultaneously generates both a protein's amino acid sequence and its three-dimensional atomic structure. This might sound like an incremental advance, but it represents a fundamental departure from how the field has worked.

The wet-lab pipeline of existing generative approaches, including recent models like RFdiffusion2[5,6], RFdiffusion3[7], and BoltzGen[8], work in two separate stages: first generate a protein backbone, then use a separate inverse-folding network such as LigandMPNN[9] to predict what amino acid sequence would fold into that shape. This decoupled pipeline cannot use sequence-level signals to guide backbone design, or structural context to inform sequence choices.

DISCO eliminates this handoff entirely. It learns a joint distribution over discrete amino acid tokens and continuous 3D coordinates, denoising both simultaneously. The mathematical foundation is elegant: by independently sampling noise per modality during training, the model provably learns the joint reverse process using only unimodal losses, no special joint supervision required[10]. The coupling emerges from the architecture.

Core Techniques

Getting the model to generate sequences that actually fold into their designed structures requires us to align the two modalities together, which includes three critical inference innovations, each with a dramatic effect on co-designability:

Cross-Modal Recycling
Structure informs sequence, sequence informs structure
At each denoising step, the model conditions on four signals: its current predicted clean sequence and structure, plus the noised versions of both. A frozen protein language model and a structure encoder inject rich information bidirectionally, ensuring amino acid choices reflect emerging geometry while backbone predictions adapt to evolving sequence identity.
Self-Correcting Sequence
Revise early mistakes, temper overconfidence
Standard masked diffusion unmasks tokens one-by-one, irreversibly. DISCO enables sequence revision, revisiting and correcting earlier amino acid choices. An entropy-adaptive temperature mechanism smooths the amino acid distribution early in the trajectory when structural detail is still coarse, allowing confident commitments only once the backbone has crystallized.
Noisy Guidance
Condition on noise to sharpen predictions
DISCO conditions each modality on a noisier version of the other, sharpening the model's conditional predictions and ensuring the two modalities stay aligned throughout the trajectory. The intuition: by seeing a slightly degraded version of the partner signal, the model learns to extract the essential structural or sequence information rather than overfitting to intermediate noise.
Ablation: Co-designability (fraction of sequences that refold within 2Å)
Full DISCO+ noisy guidance
0.88
DISCOwithout noisy guidance
0.80
− Recycling
0.62
− Entropy-adaptive temperature
0.60
− Sequence correctionstandard masked diffusion
0.23

Some other highlights of the model are as follows:

Conditions on arbitrary biomolecular contexts, small molecules, reactive intermediates, DNA, RNA Co-folds ligands with the protein throughout generation Trained on unfiltered PDB, no "designability" selection bias, and no synthetic data
Figure 2: DISCO architecture and inference overview
Figure 2. DISCO's multimodal inference overview with arbitrary molecular conditioning.

Inference-Time Scaling

Because DISCO operates on both modalities, it unlocks inference-time steering using reward functions defined over both sequence and structure. Rather than generating thousands of candidates and filtering, DISCO uses Feynman-Kac Correctors[11] (FKC) — a principled mathematical framework that tilts the sampling distribution toward proteins with desired properties during generation itself. We derive two novel FKC methods:

FKC-Multimodal (FKC-MM) steers generation using reward functions defined simultaneously on both discrete sequences and continuous structures, something impossible in decoupled pipelines. When targeting increased disulfide bonds, FKC-MM produces 100-residue proteins with six disulfide bonds, a density found in only the top 0.2% of comparable training proteins. Meanwhile, structure guidance alone does not work.

FKC-Specificity Guidance (FKC-SG) solves a different problem: designing a protein that binds a target molecule while avoiding a structurally similar decoy. By sampling from a tilted distribution that encourages on-target likelihood while penalizing off-target likelihood, FKC-SG generates proteins with high binding-site separation, including cases where best-of-N filtering produces zero hits.

03 — Computational Benchmarks

State-of-the-Art Across
Diverse Design Tasks

On unconditional monomer generation, ~90% of generated sequences refold to within 2Å of their designed backbones, while achieving the highest sequence and structural diversity and novelty among existing methods. On the Studio-179 benchmark, a new library of 179 natural and non-natural ligands spanning catalysis, pharmaceuticals, luminescence, and sensing, DISCO generates the most diverse, co-designable complexes for 178 of 179 targets, surpassing all baselines[7,8]. Note that only DISCO natively supports co-folding with multi-ligand.

The generality extends beyond small molecules. DISCO successfully generates co-designable proteins predicted to bind sequence-specific DNA and RNA, outperforming existing models on macromolecular interfaces as well. From small-molecule cofactors to nucleic acids, no other generative model matches DISCO's breadth.

Figure 3: Benchmarking and steering
Figure 3. (A) DISCO outperforms existing methods in co-designability, novelty, and diversity when designing proteins conditioned on a wide range of biomolecular targets. (B) An in silico demonstration of how FKC-SG specificity guidance can design proteins that selectively bind one target over another — rather than relying on simple filtering.

But co-designability alone is not enough. DISCO's designs also capture the complex statistical properties of real proteins: natural amino-acid compositions, diverse secondary structures, favorable Ramachandran geometries, appropriate surface hydrophobicity and net charge, and high long-range contact order, indicating complex, well-connected topologies rather than the trivial folds.

Novel Pockets That Understand Chemistry

The most distinctive capability of joint sequence-structure generation is that the pocket's chemistry and geometry co-adapt with the target's conformation. DISCO samples chemically valid target conformers that explore geometric diversity beyond the reference input. More crucially, DISCO's designs are chemically intelligent: binding-site lipophilicity correlates with ligand hydrophobicity, appropriate coordinating residues emerge for specific cofactors, and cavities form with the right geometry to avoid clashes. These pockets are also diverse — up to 80% motif diversity among the four closest residues — and novel, with the majority of designed pockets having no close match in AlphaFoldDB.

Figure 4: Realistic features, responsive pockets, novel motifs
Figure 4. (A) DISCO designs novel active-site motifs. (B) An example of novel binding site (purple) designed to bind Coenzyme Q1, compared to the closest motif in AlphaFoldDB (beige).
04 — Experimental Validation

New Enzymes for
New-to-Nature Chemistry

The ultimate test is the wet lab. DISCO was challenged with designing enzymes for carbene-transfer reactions, a class of transformations nature has not explored, valuable for constructing pharmaceuticals and complex molecules. This chemistry was first brought into biology by directed evolution of cytochrome P450 enzymes for cyclopropanation[12], and has since been expanded to boron-hydrogen[13] and carbon–hydrogen[14] bond insertion, among others. Carbene transfers proceed via formation of an iron-carbenoid intermediate, which then delivers the carbene fragment to a substrate through diverse pathways.

Rather than specifying a precise theozyme[15], DISCO was conditioned solely on DFT-computed geometries of the heme - carbene precursor intermediate, a deliberate simplification. We let the model explore catalytic solutions without being constrained by human assumptions about which residues are required or what the transition state looks like.

From ~20,000 generated sequence-structure pairs — one to two orders of magnitude fewer than recent pipelines [5,6,16] — 90 designs were selected through computational filtering and tested across four distinct reactions. Below, we highlight yield (fraction of substrate converted to product) and total turnover number (TTN, the amount of substrate each enzyme converts before deactivation). Although the designs were not optimized for stereoselectivity, enantiomeric excess reached up to 35%, with enzymes favoring either enantiomer identified for three of the four reactions.

Reaction 01

Alkene Cyclopropanation

Cyclopropanation of 4-methoxystyrene with ethyl diazoacetate (EDA) — a benchmark reaction for carbene chemistry that builds strained three-membered rings widely used in medicinal chemistry.
Styrene Cyclopropanation Reaction Scheme
4,050
TTN
72%
Yield
99:1
d.r.

For this substrate, the top DISCO design has activity that surpasses both early evolved P450 enzymes, the original breakthrough in engineered cyclopropanation biocatalysis [12] (Science 2013, 339, 6117, 307), and the recently reported designed enzyme PNC2, which scaffolded a helix bundle around a porphyrin-based theozyme[15] (Science 2025, 388, 6747, 665).

vs. Prior Enzyme Engineering Approaches
DISCO
4,050 TTN
PNC2 (theozyme)
630
P450H2-5-F10 (evol.)
364
Reaction 02

B–H Insertion

Carbene insertion into an N-heterocyclic carbene–borane, forging a new carbon–boron bond. No organism has ever catalyzed this transformation, it is entirely alien to biology.
B-H Insertion Reaction Scheme
5,170
TTN
98%
Yield

A single DISCO design, with no laboratory optimization, more than doubles the activity achieved by three rounds of directed evolution[13] (Nature 2017, 552, 7683, 132), and exceeds the starting point by 43×.

vs. Previous Directed Evolution Campaign
DISCO
5,170 TTN
3 rounds evol.
2,490
Rma cyt c
120
Reaction 03

C(sp³)–H Insertion

Selective alkylation of an unactivated C–H bond in 1-phenylpyrrolidine, among the most challenging transformations in organic chemistry, requiring exquisite control over site- and stereoselectivity.
C-H Insertion Reaction Scheme
2,360
TTN
42%
Yield

The previous engineering campaign[14] (Nature 2019, 565, 7737, 67) started at fewer than 20 TTN and required 14 rounds of directed evolution to reach 2,030 TTN. DISCO exceeds that endpoint in a single computational step.

One Computational Design vs. 14 Rounds of Lab Evolution
DISCO
2,360 TTN
14 rounds evol.
2,030 TTN
P450 WT
N.D.
Reaction 04

Spirocyclopropanation

Building a strained spirocyclic motif in a pharmaceutically relevant azaspiro[2.3]alkane scaffold, sterically and electronically demanding.
Spirocyclopropanation Reaction Scheme

Initial activity was modest, but the designs proved highly evolvable: a single round of error-prone PCR mutagenesis of dCT-H11 produced a variety of improved variants, with mutations that both increased activity by multiple folds as well as diverged in stereoselectivity — some favoring one enantiomer (+49% e.e.), others inverting to the opposite (–35% e.e.). This reaction class was recently explored enzymatically[17] (JACS 2025, 147, 31, 27165).

05 — Novel Architectures

Active Sites That
Don't Exist in Nature

Perhaps the most interesting property of DISCO's designs is the novelty and diversity of their molecular architectures. When generated binding motifs are searched against the entirety of the AlphaFold Database, the majority have no close natural homologs. Over 90% of the motifs cluster into distinct groups. These are chemically plausible, new residue motifs, invented by the model to accommodate the target.

The closest structural match to dCT-H11, one of the top-performing designs, is a TetR-family transcription factor from the extremophile Haloarcula marismortui, a DNA-binding protein with no known catalytic activity and only 21% sequence identity to the DISCO designed sequence. DISCO repurposed this non-enzymatic topology for carbene transfer. The catalytic residue geometry is novel: the closest motif in AlphaFoldDB[18,19] deviates by more than 7Å. Other designs are even more remote: dCT-F9 (TM-score 0.52, 5% identity) and dCT-G9 (TM-score 0.51, 9% identity) adopt folds with no corresponding motif identified anywhere in AlphaFoldDB.

Top DISCO Design — dCT-H11
DISCO-designed enzyme dCT-H11 — a novel carbene transferase with no natural homolog
Closest Natural Match
PDB 3CRJ
TetR-family transcription factor from Haloarcula marismortui, a Dead Sea extremophile. A DNA-binding protein with no known catalytic activity. Closest motif to the designed active site shows a significant deviation.
>7 ÅMotif RMSD
21%Seq. identity
0.81TM-score
Not a heme-binding protein. DISCO repurposed this non-enzymatic fold for carbene chemistry with a completely novel active-site geometry. Other top designs have lower seq. identity (<0.1) and pdbTM (~0.50), also with novel active sites.

None of the closest structural matches are naturally heme-binding proteins. DISCO has learned the underlying biochemical principles that enable heme binding and carbene transfer, and applies them to protein folds that evolution never associated with this chemistry. It does not remix known parts. It discovers fundamentally new solutions.

06 — Evolvability

Designs That Evolution
Can Build Upon

An enzyme's fitness landscape is sometimes as important as the design itself. DISCO's designs appear to occupy regions of sequence space with accessible uphill paths. When dCT-H11 was subjected to one round of error-prone PCR, screening ~700 mutants for the spirocyclopropanation reaction (Reaction 04) revealed ~35 significantly improved variants. The best achieved a fourfold activity increase, with mutations scattered across the protein, a long-range landscape characteristic of natural evolution. Some substitutions also inverted enantioselectivity, from +49% to –35% ee, demonstrating a structured pocket that can be manipulated.

Key Insight

DISCO doesn't just design enzymes, it designs starting points for evolution.

07 — Looking Forward

Toward Genetically Encodable
Arbitrary Chemistry

The four reactions explored here are a small sample from a vast universe of synthetically valuable transformations not found in nature. DISCO addresses a critical bottleneck: generating diverse, functional, evolvable enzymes from scratch, conditioned solely on the target chemistry — bypassing transition-state calculations, theozyme scaffolding[4], and large-scale screening.

Key Contributions

  • Multimodal co-design: Simultaneous sequence and structure generation through joint diffusion with cross-modal recycling.
  • No theozyme required: Enzymes designed from reaction intermediates alone — no pre-specified catalytic residues or precise theozymes, in contrast to physics-based approaches[15] and motif-scaffolding methods[4].
  • Arbitrary molecular conditioning: Co-folds with small molecules, cofactors, intermediates, and nucleic acids: state-of-the-art across 179 targets.
  • Principled inference-time steering: Feynman-Kac Correctors[11] for controllable, multimodal generation: reward tilting, not generate-and-filter.
  • Novel active-site geometries: Designed motifs have no close homologs across 200M+ structures in AlphaFoldDB.
  • High experimental activity: Top designs exceed extensively engineered variants from just 90 tested genes.
  • Evolvable scaffolds: One round of mutagenesis yields fourfold gains with divergent stereoselectivity.

The chemistry nature never explored is now within reach.

References

References

  1. Arnold, F. H. Innovation by Evolution: Bringing New Chemistry to Life (Nobel Lecture). Angew. Chem. Int. Ed. 58(41), 14420–14426 (2019). doi:10.1002/anie.201907729
  2. Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., ... & Baker, D. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023). doi:10.1038/s41586-023-06415-8
  3. Pacesa, M., Nickel, L., Schmidt, J., ... & Correia, B. E. One-shot design of functional protein binders with BindCraft. Nature (2025). doi:10.1038/s41586-025-09429-6
  4. Wang, J., Lisanza, S., Juergens, D., Tischer, D., Watson, J. L., ... & Baker, D. Scaffolding protein functional sites using deep learning. Science 377(6604), 387–394 (2022). doi:10.1126/science.abn2100
  5. Ahern, W., Yim, J., Tischer, D., ... Krishna, R. & Baker, D. Atom-level enzyme active site scaffolding using RFdiffusion2. Nature Methods (2025). doi:10.1038/s41592-025-02975-x
  6. Kim, D., ... Krishna, R. & Baker, D. Computational design of metallohydrolases. Nature (2025). doi:10.1038/s41586-025-09746-w
  7. Butcher, J., Krishna, R., Mitra, R., ... & Baker, D. De novo design of all-atom biomolecular interactions with RFdiffusion3. bioRxiv (2025). doi:10.1101/2025.09.18.676967
  8. Stark, H., Faltings, F., Choi, M., ... Barzilay, R. & Jaakkola, T. BoltzGen: Toward Universal Binder Design. bioRxiv (2025). doi:10.1101/2025.11.20.689494
  9. Dauparas, J., Lee, G. R., Pecoraro, R., An, L., Anishchenko, I., Glasscock, C. & Baker, D. Atomic context-conditioned protein sequence design using LigandMPNN. Nat. Methods 22, 717–723 (2025). doi:10.1038/s41592-025-02626-1
  10. Rojas, K., Zhu, Y., Zhu, S., Ye, F. X.-F. & Tao, M. Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces. ICML (2025). arXiv:2506.07903
  11. Skreta, M., Akhound-Sadegh, T., Ohanesian, V., Bondesan, R., Aspuru-Guzik, A., Doucet, A., Brekelmans, R., Tong, A. & Neklyudov, K. Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts. ICML (2025). arXiv:2503.02819
  12. Coelho, P. S., Brustad, E. M., Kannan, A. & Arnold, F. H. Olefin cyclopropanation via carbene transfer catalyzed by engineered cytochrome P450 enzymes. Science 339(6117), 307–310 (2013). doi:10.1126/science.1231434
  13. Kan, S. B. J., Huang, X., Gumulya, Y., Chen, K. & Arnold, F. H. Genetically programmed chiral organoborane synthesis. Nature 552(7683), 132–136 (2017). doi:10.1038/nature24996
  14. Zhang, R. K., Chen, K., Huang, X., Wohlschlager, L., Renata, H. & Arnold, F. H. Enzymatic assembly of carbon–carbon bonds via iron-catalysed sp3 C–H functionalization. Nature 565(7737), 67–72 (2019). doi:10.1038/s41586-018-0808-5
  15. Hou, K., Huang, W., Qi, M., ... & DeGrado, W. F. De novo design of porphyrin-containing proteins as efficient and stereoselective catalysts. Science 388(6747), 665–670 (2025). doi:10.1126/science.adt7268
  16. Braun, M., Tripp, A., Chakatok, M., ... & Oberdorfer, G. Computational enzyme design by catalytic motif scaffolding (Riff-Diff). Nature 649(8095), 237–245 (2026). doi:10.1038/s41586-025-09747-9
  17. Kennemur, J. L., Long, Y., Ko, C. J., Das, A. & Arnold, F. H. Enzymatic stereodivergent synthesis of azaspiro[2.y]alkanes. J. Am. Chem. Soc. 147(31), 27165–27171 (2025). doi:10.1021/jacs.5c07015
  18. Varadi, M., Bertoni, D., Magana, P., ... Steinegger, M., Hassabis, D. & Velankar, S. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52(D1), D368–D375 (2024). doi:10.1093/nar/gkad1011
  19. Kim, H., Kim, R. S., Mirdita, M. & Steinegger, M. Structural motif search across the protein-universe with Folddisco. bioRxiv (2025). doi:10.1101/2025.07.06.663357
  20. Krishna, R., Wang, J., Ahern, W., ... & Baker, D. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384(6693), eadl2528 (2024). doi:10.1126/science.adl2528