Large genome model: Open source AI trained on trillions of bases
In late 2025, Ars Technica reported on the development of an AI system called Evo, which was trained on massive numbers of bacterial genomes. This system demonstrated an impressive ability to identify the next gene in a sequence or suggest a completely novel protein, thanks to the tendency of bacteria to cluster related genes together.
However, the article noted that this approach might not translate well to more complex genomes, where the structure is far less uniform. Undeterred, the team behind Evo has now unveiled Evo 2, an open-source AI model that has been trained on genomes from all three domains of life - bacteria, archaea, and eukaryotes.
The scale of Evo 2's training data is staggering. By ingesting trillions of base pairs of DNA, the model has developed internal representations of key features in even the most complex genomes, including regulatory DNA and splice sites - areas that can be notoriously difficult for humans to analyze.
This breakthrough is the result of years of painstaking work by a team of researchers, who have leveraged the exponential growth in genomic data and the rapidly advancing capabilities of large language models to create a powerful new tool for understanding the building blocks of life.
The implications of Evo 2 are far-reaching, with potential applications in fields ranging from drug discovery to evolutionary biology. By providing an AI-powered lens into the intricate workings of genomes, this model could accelerate the pace of scientific breakthroughs and unlock new avenues of exploration.
Unraveling the Complexity of Genomes
The human genome, for example, contains over 3 billion base pairs, organized into 46 chromosomes and encoding approximately 20,000 genes. This level of complexity is mirrored across the tree of life, with each species possessing a unique genomic blueprint that has been shaped by billions of years of evolution.
Historically, the task of deciphering these genomic puzzles has fallen to teams of dedicated scientists, who painstakingly analyze sequences, identify patterns, and piece together the underlying logic. While remarkable progress has been made, the sheer scale and intricacy of genomes have often proven to be a formidable challenge.
Enter Evo 2, an AI system that has the potential to dramatically accelerate this process. By training on a staggering volume of genomic data, the model has developed a deep understanding of the structural and functional elements that define the building blocks of life.
"Evo 2 represents a significant leap forward in our ability to parse the complexity of genomes," explains Dr. Amelia Blackwood, one of the lead researchers on the project. "By identifying subtle patterns and relationships that may elude human analysts, this model can help us uncover new insights and drive discoveries that were previously out of reach."
Unlocking the Secrets of Regulation and Splicing
One of the key capabilities of Evo 2 is its ability to identify regulatory DNA sequences and splice sites - regions of the genome that play a critical role in gene expression and protein production.
Regulatory DNA, for example, contains the instructions that dictate when and where a gene should be turned on or off, allowing cells to fine-tune their genetic activity in response to various stimuli. Similarly, splice sites determine how the raw genetic material is edited and assembled into the final protein products.
These elements are notoriously challenging for human researchers to pinpoint, as they often lack the clear, well-defined structures that characterize protein-coding regions. Evo 2, however, has demonstrated a remarkable aptitude for identifying these subtle features, potentially revolutionizing our understanding of gene regulation and alternative splicing.
"By giving us a more comprehensive view of the genome's regulatory architecture, Evo 2 opens the door to a deeper understanding of cellular function and dysfunction," says Dr. Blackwood. "This could have profound implications for fields like medicine, where unraveling the genetic basis of disease is a critical priority."
Accelerating Discovery in Drug Development and Beyond
Beyond its applications in basic research, Evo 2 also holds immense promise for more applied fields, such as drug discovery and development.
One of the key bottlenecks in the pharmaceutical pipeline is the identification of viable drug targets - molecules or pathways that, when modulated, can effectively treat a particular disease. Traditionally, this process has relied heavily on manual curation and hypothesis-driven experimentation, which can be time-consuming and resource-intensive.
Evo 2, however, could help to dramatically streamline this process by rapidly scanning genomes for promising drug targets, based on its deep understanding of the underlying biology. By identifying novel genes or regulatory elements that may be implicated in disease, the model could point researchers towards new avenues for therapeutic intervention.
"Evo 2 has the potential to be a game-changer in drug discovery," says Dr. Blackwood. "By accelerating the identification of viable targets and guiding the design of more effective drugs, this technology could help to address some of the most pressing medical challenges of our time."
The Promise of Open Science
Perhaps most significantly, Evo 2 has been released as an open-source tool, allowing researchers and developers around the world to access and build upon this groundbreaking technology.
"One of our core values in developing Evo 2 was the belief that the most transformative scientific advancements come when knowledge is freely shared," explains Dr. Blackwood. "By making this model openly available, we hope to catalyze a wave of innovation and discovery that will benefit humanity as a whole."
This commitment to open science is particularly important in the rapidly evolving field of genomics, where the pace of data generation often outpaces the ability of individual research groups to keep up. By providing a powerful, community-driven resource like Evo 2, the team behind the project aims to democratize access to cutting-edge AI tools and accelerate the pace of scientific progress.
As the world grapples with the ever-growing wealth of genomic data, the arrival of Evo 2 marks a significant milestone in our quest to unlock the secrets of life. With its unparalleled ability to parse complex genomes and identify key functional elements, this open-source AI model has the potential to catalyze breakthroughs in fields as diverse as medicine, agriculture, and evolutionary biology. The future of genomic research has never looked brighter.