The Hidden Flaws in AI Protein Data

January 1, 2026

Why True Progress Requires Experimental Foundations and How Neomera is Built on Biological Reality

The field of AI-driven protein design is moving at breakneck speed. Models leveraging diffusion, flow matching, and multimodal architectures promise to revolutionize biology by generating intricate structures and exploring vast protein landscapes. These tools appear poised to transform drug discovery by enabling the rapid design of novel therapeutics.

However, a growing body of evidence reveals a fundamental flaw: the data underpinning these models is often unreliable. This leads to outputs that look impressive in a digital environment but fail in real-world biological applications.


The Problem of Data Misalignment in the AFDB

A recent NVIDIA paper, "Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design," exposes this issue. The study highlights how the AlphaFold Protein Structure Database (AFDB)—a cornerstone for training AI protein models—contains structures that are predicted rather than experimentally determined.

While the AFDB offers immense scale, its reliance on AlphaFold2 inferences means its structures do not always reflect how sequences naturally fold in reality. To validate this, the NVIDIA team conducted a revealing experiment. They took sequences from the AFDB and attempted to "refold" them using structure prediction models like ESMFold and AlphaFold2.

The results were alarming:

  • Only 26.6% of protein backbones could be reconstructed within a 2 Å RMSD threshold.
  • A mere 19.1% matched at the all-atom level.
  • Even using the strongest Multiple Sequence Alignments (MSAs) via ColabFold, recovery rates reached only ~65%.

As shown in the paper’s analysis, most AFDB structures are not recoverable from their own sequences. This suggests AI models are training on a "fake" protein universe, inheriting the biases and "hallucinations" of the models that created the data.


Consequences for Drug Discovery

In drug discovery, precision is non-negotiable. Relying on flawed data can be disastrous for several reasons.

  • Discovery Challenges Models trained on misaligned data generate "phantom" proteins. An inhibitor designed for a predicted protein pocket may fail because that pocket simply doesn't exist in nature.
  • Efficiency Bottlenecks When AI predictions are unreliable, lab teams waste months on proteins that cannot fold or remain stable. This forces researchers into tedious "post-hoc" redesign cycles, extending timelines significantly.
  • Escalating Costs Every failed candidate incurs massive expenses in computational power and wet-lab resources. In an industry where bringing a drug to market costs billions, basing decisions on shaky data leads to astronomical budget waste.

Neomera Grounded in Experimental Reality

At Neomera, we are charting a different path. We believe that true innovation in protein design requires moving beyond prediction-heavy databases. Instead of relying on computational inferences that may or may not exist in nature, our approach is built on a bedrock of physical truth.

Our proprietary database is generated 100% internally through wet-lab experimental validation. Every entry is a verified biological fact, ensuring that our AI models learn the actual laws of protein folding rather than the biases of a previous algorithm.

  • Enhanced Discovery By using experimentally confirmed structures, we identify accurate drug targets and minimize the risk of pursuing "phantom" proteins. We start with what is real, which makes the transition to clinical application much more predictable.
  • Streamlined Efficiency Our pipelines drastically reduce false positives. Because our training data is high-fidelity, our AI produces candidates with a much higher likelihood of success, allowing teams to focus on high-confidence leads from day one.
  • Unmatched Trust Wet-lab data remains the gold standard for therapeutic development. This translates to safer, more effective candidates that withstand the rigors of clinical trials and regulatory scrutiny.

It is of the utmost importance to remember that the use of artificial intelligence in drug discovery serves only one purpose: to accelerate the delivery of better treatments and help patients with the most effective therapeutics possible. The industry must not derail from this mission.

While building larger or more complex models is technically impressive, it is a means to an end, not the end itself. If our data is flawed, we drift further from the patient. By prioritizing experimental reality, Neomera ensures that our AI tools remain aligned with actual biology, fulfilling the true promise of technology in service of human health.


About Neomera BioLab

Neomera BioLab, built on over a decade of pioneering research by a Stanford scientist, fuses cutting-edge wet-lab experimental biology with proprietary AI algorithms to create massive, exclusive datasets that are 100% internally generated and validated through real-world experiments—free from reliance on external predictive databases.

These experimentally grounded datasets shatter barriers around historically undruggable targets, accelerating drug discovery by up to 80% and slashing drug discovery costs by up to 90%. At its heart, the company's groundbreaking first therapy aims to deliver a non-addictive solution for more than a billion chronic pain sufferers worldwide, tackling selectivity challenges in pain-signaling sodium channels that have long eluded the industry.