Introducing the CodonFM Open Model for RNA Design and Evaluation

-


Open research is critical for driving innovation, and lots of breakthroughs in AI and science are achieved through open collaboration. In the sector of digital biology research, NVIDIA Clara supports this open collaboration. 

Clara is an open source family of models, tools, and recipes for biology, chemistry, and human health. It includes models to be used cases similar to small-molecule generative design, synthetic pathway prediction, ADMET property prediction, and protein structure-sequence co-design. To learn more about NVIDIA Clara models and tools for biology and chemistry, visit the NVIDIA-Digital-Bio GitHub repo. 

This post introduces CodonFM, a brand new addition to the Clara open model family. CodonFM is a language model for biology focused on RNA. We describe how the model was designed and the way it could possibly be used for various tasks, like variant effect prediction or mRNA design.

CodonFM: An open foundation model for RNA

Today, NVIDIA is announcing CodonFM, a brand new state-of-the-art RNA foundation model joining the Clara open model family. CodonFM processes RNA by reading it in codons, which each comprise three nucleotides. This approach treats RNA triplets like words in a sentence slightly than independent nucleotide letters. By analyzing RNA sequences of their natural syntax, this approach allows the model to learn the complex “grammar” of the genetic code. The result’s a model that understands the complex, context-dependent patterns of codon usage bias across organisms. 

A number of the most typical language models for biology are protein language models, which independently model each amino acid residue in a protein sequence. These models overlook that the identical amino acid could be encoded by different codons (synonymous variants) and that, during cellular protein synthesis, different synonymous codon variants result in different amounts of protein being produced.

By accounting for synonymous variants, CodonFM understands how these different RNA sequences that every one encode the identical amino acids can impact biological function. This allows predicting properties similar to mRNA stability, translation efficiency, and protein yield. It also enhances the performance of language models in predicting disease risk related to genetic mutations.

CodonFM is built on a BERT-style bidirectional encoder architecture, enabling the model to know your complete input RNA sequence. With a big context window of two,046 codon tokens (6,138 ribonucleotides), the model identifies complex, long-range sequence patterns which were refined over billions of years of evolution. 

To learn this biological language, CodonFM was trained on a curated set of 131 million protein-coding sequences from 22,000 species, encompassing lots of of billions of codon tokens drawn from the National Institutes of Health – National Center for Biotechnology Information RefSeq database

CodonFM is on the market in a series of various model sizes (80M, 600M, and 1B parameters) and two pretraining methods. Because the models increase in scale, they more accurately distinguish between synonymous codons that encode the identical amino acid. This reduction in codon confusion (the frequency with which the model mispredicts one synonymous codon for an additional) reflects a deeper understanding of codon usage patterns and translation-relevant sequence context (Figure 1).

A heat map comparing how consistently Encodon models of different sizes distinguish synonymous codons for each amino acid, with larger models having lower confusion values (lighter colors), indicating better codon differentiation.
A heat map comparing how consistently Encodon models of different sizes distinguish synonymous codons for each amino acid, with larger models having lower confusion values (lighter colors), indicating better codon differentiation.
Figure 1. Synonymous codon confusion evaluation across the Encodon model scales, where a darker green color indicates a lower confusion rating and a greater understanding of codon usage patterns

Moreover, each pretraining method offers unique benefits for the resulting models: 

  • Random codon masking: Randomly masks codons inside a sequence, no matter how often they seem within the pretraining corpus. This trains CodonFM to predict missing codons from their surrounding context, helping the model to learn the underlying grammar of the genetic code across a wide selection of coding regions.
  • Codon-weighted masking: Builds on random masking by selectively masking codons in response to their usage bias, specializing in rare codon usage in certain sub-contexts of sequences. This enables the model to raised capture patterns related to species-specific functional codon selection, slightly than treating all codons equally.

Because the benchmarks on this post show, CodonFM demonstrates clear scaling laws. As model and dataset size increase, model accuracy improves across use cases similar to synonymous and missense variant classification, mRNA translation efficiency, and protein abundance prediction. 

Using CodonFM across biological tasks

CodonFM demonstrates broad applicability across each zero-shot and fine-tuned settings, enabling diverse molecular and clinical use cases. This section highlights CodonFM performance across various life sciences applications and provides code snippets for implementing this model for every task.

Mutation effect size prediction

CodonFM models the coding sequence itself—capturing codon context, redundancy, and regulatory patterns—without explicitly counting on protein structure. This allows the fine-tuned 1B-parameter Encodon model, collaboratively developed by NVIDIA and Arc Institute, to attain robust performance in detecting pathogenic missense mutations. It demonstrates high accuracy in distinguishing disease-associated amino acid substitutions from benign variants.

As Encodon models scale from 80M to 1B parameters and undergo fine-tuning, their ability to distinguish (Mann-Whitney U test - two sided) pathogenic from control variants significantly improves, reflecting greater biological sensitivity and generalization.As Encodon models scale from 80M to 1B parameters and undergo fine-tuning, their ability to distinguish (Mann-Whitney U test - two sided) pathogenic from control variants significantly improves, reflecting greater biological sensitivity and generalization.
Figure 2. Classification of de novo missense variants in deciphering developmental disorders (DDD) case versus control cohorts 

More importantly, CodonFM extends this capability to the much harder problem of interpreting synonymous variants. Synonymous mutations leave the protein sequence unchanged and have historically eluded prediction models. Encodon detects subtle shifts in codon usage and translation-level effects. It achieves best-in-class discrimination of pathogenic versus benign synonymous variants in ClinVar, demonstrating its unique ability to interpret even silent mutations.

Larger Encodon models (80M→1B) achieve stronger statistical separation (Mann-Whitney U test, two-sided) between pathogenic and matched benign synonymous variants than codon-level baselines, demonstrating superior biological resolution and generalization.
Larger Encodon models (80M→1B) achieve stronger statistical separation (Mann-Whitney U test, two-sided) between pathogenic and matched benign synonymous variants than codon-level baselines, demonstrating superior biological resolution and generalization.
Figure 3. Classification of synonymous variants in ClinVar pathogenic versus benign datasets

The next code snippet demonstrates find out how to perform mutation scoring tasks using the pretrained Encodon models:

# Task: rating effect of single synonymous/missense mutation at a given codon
# Output: log-likelihoods for ref/alt codons + LLR per variant
# More details could be found here within the source code of CodonFM
# src/data/preprocess/mutation_pred.py and src/data/mutation_dataset.py

# 1) Configure model checkpoint
CKPT_PATH = "/path/to/NV-CodonFM-Encodon-1B-v1.ckpt"  # change to your .ckpt
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

enc = EncodonInference(model_path=CKPT_PATH, task_type=TaskTypes.MUTATION_PREDICTION, ...  
) # <- routes to predict_mutation
enc.configure_model()  # loads Tokenizer + EncodonPL + weights
enc.to(DEVICE)

# 2) Prepare a mutation example and construct a model batch
# Example CDS (coding DNA sequence) length have to be divisible by 3 and in-frame.
cds = ("ATGCCGGCGGTCAAGAAGGAGTTCCCGGGCCGCGAGGACCTGGCCCTGGCTCTGGCCACGTTCCACCCGACC") # <--- replace together with your full coding sequence (no introns, 5'->3')

# Select a 0-based CODON index (not nucleotide index). 
codon_idx = 10 #codon 10 within the CDS

# Define the ref codon present at that position and the alternate codon to check
ref_codon = "CGC"
alt_codon = "CGA"  # e.g., a synonymous change

# --- Tokenize the complete CDS into codon tokens using the tokenizer ---

tok = enc.tokenizer
context_length = 2048
cds = cds[:(context_length-2) * 3] # truncate the sequence to the context length
cds = '' + cds + ''
# Encode the complete sequence to input IDs. The tokenizer works on the codon resolution.

input_ids = np.array(tok.encode(cds), dtype=np.int32)  
attention_mask = np.ones_like(input_ids, dtype=bool)

# Add padding to the input_ids and attention_mask to the context length
input_ids = np.pad(input_ids, (0, context_length - len(input_ids)), ...)
attention_mask = np.pad(attention_mask, (0, context_length - len(attention_mask)), ...)

mutation_token_idx = codon_idx + 1 # the position within the tokenized sequence is shifted by 1 due to  token

Once the input_ids and a spotlight mask are set, get the ref and alt tokens, mask the input, after which make a batch. After the batch is made, run inference using the predict_mutation function.

ref_tok = tok.convert_tokens_to_ids(ref_codon)
alt_tok = tok.convert_tokens_to_ids(alt_codon)

input_ids[mutation_token_idx] = tok.mask_token_id # replace the mutated codon with mask token

batch = {
    MetadataFields.INPUT_IDS: torch.tensor(input_ids, dtype=torch.int32, device=DEVICE).unsqueeze(0),
    MetadataFields.ATTENTION_MASK: torch.tensor(attention_mask, dtype=torch.bool, device=DEVICE).unsqueeze(0),
......
}

# 3) Run inference and interpret LLRs
out = enc.predict_mutation(batch, ids=["example_variant"])

print("IDs:                ", out.ids[0])
print("log P(ref codon):   ", out.ref_likelihoods[0])
print("log P(alt codon):   ", out.alt_likelihoods[0])
print("LLR (ref - alt):    ", out.likelihood_ratios[0])

Under the hood, this script uses the Encodon masked-language model to evaluate how a mutation alters codon probability inside its sequence context. It masks the goal codon, predicts likelihoods for the reference and alternate versions, and computes their log-likelihood ratio (LLR). A positive LLR indicates the mutation is less natural or potentially disruptive, while a negative LLR suggests it’s tolerated or contextually favored.

mRNA therapeutic design

mRNA design is rapidly emerging as a significant modality in modern therapeutics, enabling gene substitute, protein restoration, and the event of programmable biologics. A key challenge on this area is sequence optimization—even small peptides or proteins could be encoded by an unlimited variety of synonymous mRNA sequences, each influencing expression, stability, and immunogenicity in other ways. 

Subtle decisions in codon usage and sequence context can significantly affect translational outcomes. CodonFM delivers a best-in-class predictive framework for these applications, achieving state-of-the-art performance across diverse mRNA stability and expression benchmarks. This includes zero-shot prediction of protein abundance and translation efficiency prediction, and provides a foundation for optimized mRNA design.

Each bar represents the cross-validated R² for translation efficiency regression across codon-aware models. The Encodon series (80M–1B parameters) shows progressively higher explanatory power compared to the SOTA Codon Model baseline, reflecting improved capture of codon co-occurrence patterns and translational context.
Each bar represents the cross-validated R² for translation efficiency regression across codon-aware models. The Encodon series (80M–1B parameters) shows progressively higher explanatory power compared to the SOTA Codon Model baseline, reflecting improved capture of codon co-occurrence patterns and translational context.
Figure 4. Modeling codon-level translation efficiency

Wonderful-tuning with CodonFM

The CodonFM repository includes implementations of multiple fine-tuning strategies that enable users to fine-tune the pretrained model for his or her use case. These strategies include:

  • Low-Rank Adaptation (LoRA): Wonderful-tunes low-rank adapters inside a pretrained model added to every transformer layer to cut back training cost and memory usage.
  • Head-Only Random: Trains a randomly initialized output head while the rest of the model is kept frozen.
  • Head-Only Pretrained: Trains a pretrained output head while the rest of the model is kept frozen.
  • Full: Wonderful-tunes all parameters of the model end-to-end

Toward programmable biology

Just as language models have learned to reason and protein models to fold, CodonFM learns the principles that connect RNA codons to its behavior and protein expression. It transforms RNA from a passive carrier of genetic information right into a programmable language—one which could be interpreted, optimized, and redesigned.

This capability to read and write the language of life forms is a cornerstone of the NVIDIA Virtual Cell initiative. Releasing open, powerful, and scalable models like CodonFM enables researchers and developers to construct AI systems that not only understand biology but in addition actively shape it.

We invite you to hitch our collaborators from Arc Institute, Therna Biosciences, Greenstone Biosciences, Moonwalk Biosciences, and the Stanford RNA Medicine Program to explore and test CodonFM as a part of this shared effort to advance biological intelligence.

Start with CodonFM

CodonFM was trained using the identical core infrastructure that powers other Clara open models, leveraging GPU-native acceleration through the NVIDIA cuDNN and NVIDIA cuBLAS libraries for optimized matrix operations during genomic tokenization. Input datasets were converted to memory-mapped files to enable fast, efficient data streaming, while NVIDIA NeMo Run served because the central training configuration and orchestration framework. 

Optionally, Transformer Engine through NVIDIA BioNeMo Framework recipes could be used to speed up model training and fine-tuning by as much as 3x with negligible accuracy loss, ensuring each scalability and computational efficiency.

Able to start with CodonFM?

Acknowledgments

We would love to acknowledge the next people for his or her support and contributions to this post: Sajad Darabi, Fan Cao, Mohsen Naghipourfar, Hani Goodarzi, Sara Rabhi, Yingfei Wang, William Greenleaf, Yang Zhang, Cory Ye, Jonathan Mitchell, Timur Rvachov, T.J. Chen, Daniel Burkhardt, and Neha Tadimeti.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x