DISCO is a multimodal generative model that simultaneously designs protein sequence and 3D structure from scratch — creating entirely new enzymes for chemical reactions that have no precedent in biology.
Evolution is the greatest chemist the world has ever known. Over billions of years, it has crafted enzymes of breathtaking precision, molecular machines that accelerate reactions by factors of millions, under the gentlest of conditions. But even evolution has its blind spots. The chemical reactions it has explored represent a remarkably narrow slice of what is possible. Vast swathes of synthetically valuable chemistry remain untouched by biology, not because they are impossible for enzymes, but because evolution simply never had a reason to go there.
The grand challenge of modern enzyme engineering is to venture into this uncharted territory: to build enzymes for reactions that nature has never attempted. Directed evolution, the Nobel Prize-winning approach[1] of iteratively mutating and screening proteins, can push enzymes toward new-to-nature chemistry. But every campaign needs a starting point: an initial protein with at least a flicker of the desired activity. For truly novel chemistry, finding one is laborious and challenging, and we are limited by what evolution has already sampled.
Deep learning has already transformed protein design. Models like RFdiffusion[2] and BindCraft[3] can design novel proteins that fold into desired shapes and bind to specific targets. But designing an enzyme, a protein that doesn't just bind a molecule but transforms it, is a fundamentally harder problem.
Current computational approaches face two critical bottlenecks. First, they require someone to pre-specify the exact arrangement of catalytic residues or reaction transition state structure, a so-called "theozyme", before the design process begins[4]. This demands deep mechanistic understanding of the reaction, which for new-to-nature chemistry is often unavailable. Second, existing methods treat sequence and structure as separate problems, solved sequentially: first generate a backbone, then fill in the amino acids. Because protein function emerges from the interplay of both, critical information is lost at the handoff.
How can we systematically design functional enzymes for chemical transformations that have no precedent in known biology, without pre-specifying a precise catalytic mechanism?
DISCO is the first model to solve both problems at once, jointly generating sequence and structure around any target molecule, with no pre-fixed theozyme required.
DISCO, DIffusion for Sequence-structure CO-design, is a multimodal generative model that simultaneously generates both a protein's amino acid sequence and its three-dimensional atomic structure. This might sound like an incremental advance, but it represents a fundamental departure from how the field has worked.
The wet-lab pipeline of existing generative approaches, including recent models like RFdiffusion2[5,6], RFdiffusion3[7], and BoltzGen[8], work in two separate stages: first generate a protein backbone, then use a separate inverse-folding network such as LigandMPNN[9] to predict what amino acid sequence would fold into that shape. This decoupled pipeline cannot use sequence-level signals to guide backbone design, or structural context to inform sequence choices.
DISCO eliminates this handoff entirely. It learns a joint distribution over discrete amino acid tokens and continuous 3D coordinates, denoising both simultaneously. The mathematical foundation is elegant: by independently sampling noise per modality during training, the model provably learns the joint reverse process using only unimodal losses, no special joint supervision required[10]. The coupling emerges from the architecture.
Getting the model to generate sequences that actually fold into their designed structures requires us to align the two modalities together, which includes three critical inference innovations, each with a dramatic effect on co-designability:
Some other highlights of the model are as follows:
Because DISCO operates on both modalities, it unlocks inference-time steering using reward functions defined over both sequence and structure. Rather than generating thousands of candidates and filtering, DISCO uses Feynman-Kac Correctors[11] (FKC) — a principled mathematical framework that tilts the sampling distribution toward proteins with desired properties during generation itself. We derive two novel FKC methods:
FKC-Multimodal (FKC-MM) steers generation using reward functions defined simultaneously on both discrete sequences and continuous structures, something impossible in decoupled pipelines. When targeting increased disulfide bonds, FKC-MM produces 100-residue proteins with six disulfide bonds, a density found in only the top 0.2% of comparable training proteins. Meanwhile, structure guidance alone does not work.
FKC-Specificity Guidance (FKC-SG) solves a different problem: designing a protein that binds a target molecule while avoiding a structurally similar decoy. By sampling from a tilted distribution that encourages on-target likelihood while penalizing off-target likelihood, FKC-SG generates proteins with high binding-site separation, including cases where best-of-N filtering produces zero hits.
On unconditional monomer generation, ~90% of generated sequences refold to within 2Å of their designed backbones, while achieving the highest sequence and structural diversity and novelty among existing methods. On the Studio-179 benchmark, a new library of 179 natural and non-natural ligands spanning catalysis, pharmaceuticals, luminescence, and sensing, DISCO generates the most diverse, co-designable complexes for 178 of 179 targets, surpassing all baselines[7,8]. Note that only DISCO natively supports co-folding with multi-ligand.
The generality extends beyond small molecules. DISCO successfully generates co-designable proteins predicted to bind sequence-specific DNA and RNA, outperforming existing models on macromolecular interfaces as well. From small-molecule cofactors to nucleic acids, no other generative model matches DISCO's breadth.
But co-designability alone is not enough. DISCO's designs also capture the complex statistical properties of real proteins: natural amino-acid compositions, diverse secondary structures, favorable Ramachandran geometries, appropriate surface hydrophobicity and net charge, and high long-range contact order, indicating complex, well-connected topologies rather than the trivial folds.
The most distinctive capability of joint sequence-structure generation is that the pocket's chemistry and geometry co-adapt with the target's conformation. DISCO samples chemically valid target conformers that explore geometric diversity beyond the reference input. More crucially, DISCO's designs are chemically intelligent: binding-site lipophilicity correlates with ligand hydrophobicity, appropriate coordinating residues emerge for specific cofactors, and cavities form with the right geometry to avoid clashes. These pockets are also diverse — up to 80% motif diversity among the four closest residues — and novel, with the majority of designed pockets having no close match in AlphaFoldDB.
The ultimate test is the wet lab. DISCO was challenged with designing enzymes for carbene-transfer reactions, a class of transformations nature has not explored, valuable for constructing pharmaceuticals and complex molecules. This chemistry was first brought into biology by directed evolution of cytochrome P450 enzymes for cyclopropanation[12], and has since been expanded to boron-hydrogen[13] and carbon–hydrogen[14] bond insertion, among others. Carbene transfers proceed via formation of an iron-carbenoid intermediate, which then delivers the carbene fragment to a substrate through diverse pathways.
Rather than specifying a precise theozyme[15], DISCO was conditioned solely on DFT-computed geometries of the heme - carbene precursor intermediate, a deliberate simplification. We let the model explore catalytic solutions without being constrained by human assumptions about which residues are required or what the transition state looks like.
From ~20,000 generated sequence-structure pairs — one to two orders of magnitude fewer than recent pipelines [5,6,16] — 90 designs were selected through computational filtering and tested across four distinct reactions. Below, we highlight yield (fraction of substrate converted to product) and total turnover number (TTN, the amount of substrate each enzyme converts before deactivation). Although the designs were not optimized for stereoselectivity, enantiomeric excess reached up to 35%, with enzymes favoring either enantiomer identified for three of the four reactions.
For this substrate, the top DISCO design has activity that surpasses both early evolved P450 enzymes, the original breakthrough in engineered cyclopropanation biocatalysis [12] (Science 2013, 339, 6117, 307), and the recently reported designed enzyme PNC2, which scaffolded a helix bundle around a porphyrin-based theozyme[15] (Science 2025, 388, 6747, 665).
A single DISCO design, with no laboratory optimization, more than doubles the activity achieved by three rounds of directed evolution[13] (Nature 2017, 552, 7683, 132), and exceeds the starting point by 43×.
The previous engineering campaign[14] (Nature 2019, 565, 7737, 67) started at fewer than 20 TTN and required 14 rounds of directed evolution to reach 2,030 TTN. DISCO exceeds that endpoint in a single computational step.
Initial activity was modest, but the designs proved highly evolvable: a single round of error-prone PCR mutagenesis of dCT-H11 produced a variety of improved variants, with mutations that both increased activity by multiple folds as well as diverged in stereoselectivity — some favoring one enantiomer (+49% e.e.), others inverting to the opposite (–35% e.e.). This reaction class was recently explored enzymatically[17] (JACS 2025, 147, 31, 27165).
Perhaps the most interesting property of DISCO's designs is the novelty and diversity of their molecular architectures. When generated binding motifs are searched against the entirety of the AlphaFold Database, the majority have no close natural homologs. Over 90% of the motifs cluster into distinct groups. These are chemically plausible, new residue motifs, invented by the model to accommodate the target.
The closest structural match to dCT-H11, one of the top-performing designs, is a TetR-family transcription factor from the extremophile Haloarcula marismortui, a DNA-binding protein with no known catalytic activity and only 21% sequence identity to the DISCO designed sequence. DISCO repurposed this non-enzymatic topology for carbene transfer. The catalytic residue geometry is novel: the closest motif in AlphaFoldDB[18,19] deviates by more than 7Å. Other designs are even more remote: dCT-F9 (TM-score 0.52, 5% identity) and dCT-G9 (TM-score 0.51, 9% identity) adopt folds with no corresponding motif identified anywhere in AlphaFoldDB.
None of the closest structural matches are naturally heme-binding proteins. DISCO has learned the underlying biochemical principles that enable heme binding and carbene transfer, and applies them to protein folds that evolution never associated with this chemistry. It does not remix known parts. It discovers fundamentally new solutions.
An enzyme's fitness landscape is sometimes as important as the design itself. DISCO's designs appear to occupy regions of sequence space with accessible uphill paths. When dCT-H11 was subjected to one round of error-prone PCR, screening ~700 mutants for the spirocyclopropanation reaction (Reaction 04) revealed ~35 significantly improved variants. The best achieved a fourfold activity increase, with mutations scattered across the protein, a long-range landscape characteristic of natural evolution. Some substitutions also inverted enantioselectivity, from +49% to –35% ee, demonstrating a structured pocket that can be manipulated.
DISCO doesn't just design enzymes, it designs starting points for evolution.
The four reactions explored here are a small sample from a vast universe of synthetically valuable transformations not found in nature. DISCO addresses a critical bottleneck: generating diverse, functional, evolvable enzymes from scratch, conditioned solely on the target chemistry — bypassing transition-state calculations, theozyme scaffolding[4], and large-scale screening.
The chemistry nature never explored is now within reach.