Ginkgo Datapoints Launches VCPI Initiative to Tackle AI Drug Discovery Data Challenges
For years, AI in drug discovery has been held back by a deceptively simple problem: poor quality data. Vast quantities of sequencing data, pooled perturbation studies, and mixed-cell experiments created an illusion of progress, yet the predictive leap developers anticipated never arrived. The field generated noise instead of clarity, experimental drift instead of reproducibility. Datasets were optimized for scale rather than scientific integrity, lacking the precise, pharmacology-specific measurements needed to train reliable virtual cell models.
This is the context for Ginkgo Datapoints' launch of the Virtual Cell Pharmacology Initiative (VCPI). This project aims to deliver better data, not just more data—a resource purpose-built for AI models that predict how drug-like molecules affect real biological systems. As the official announcement states, VCPI will generate over 12 billion data points from profiling 100,000 compounds, establishing the first standardized pharmacology dataset designed for virtual cell modeling.
Why “More Data” Failed
In introducing VCPI, Ginkgo uses a telling analogy: imagine throwing a handful of pills into a cage of mice, then trying to determine which mouse consumed which pill. Now scale that to a million mice in one giant cage. This illustrates the fundamental flaw in pooled single-cell pharmacology experiments. They produce massive datasets, but the experimental design obscures the clear link between a specific compound and its resulting biological effect.
The issue isn't a lack of technology, but a flawed experimental architecture. The belief that larger datasets automatically create better AI models has proven incorrect. Ginkgo's blog post labels this mindset a "data addiction," arguing that without well-structured, high-quality inputs, even the most advanced AI will learn incorrect patterns.
VCPI represents a decisive break from this approach. It prioritizes biological traceability, experimental rigor, and controlled structure—the elements AI truly needs to learn pharmacology—over sheer data volume.
How VCPI Rebuilds the Data Pipeline
Moving away from pooled assays, VCPI employs DRUG-seq, a high-throughput bulk RNA-sequencing method. Each compound is tested in an isolated, barcoded well, enabling treatment-specific response measurements with a far cleaner signal-to-noise ratio than pooled methods allow. According to the press release, Ginkgo's automated infrastructure can process over one hundred 384-well plates weekly, generating millions of high-fidelity RNA measurements at an industrial scale.
Equally critical is the introduction of V-Ref293, a newly engineered, standardized reference cell line. By providing a universal biological baseline—an "organic twin" to virtual cells—VCPI eliminates the variability caused by different labs using mutated or genetically drifted versions of the same cell line. This addresses a major source of irreproducibility in pharmacogenomics and offers AI models the stable ground truth they require.
The initiative is building a community-driven dataset with several key features:
- Open participation for researchers, pharmaceutical teams, and AI developers
- Free high-throughput RNA profiling for submitted compounds
- Options for contributors to embargo data or retain permanent proprietary access
- Monthly data releases guided by community voting
- Opportunities for model sharing, compound prioritization, and early-access "super-user" status
A Community-Built Model, Not a Data Dump
One of VCPI's most distinctive aspects is its launch prior to the dataset's completion. Rather than presenting a finished resource, Ginkgo is inviting the scientific community to help decide which compounds are most valuable and to collaborate in real time as the dataset expands.
This structure also reduces risk for participants. Early-stage biotechs can submit compounds and receive real pharmacology data without the high cost of dedicated screening. AI teams can help ensure the dataset includes the specific biological perturbations needed for model training. Academic labs can contribute while potentially retaining a 90-day exclusive data window.
This approach transforms data generation from a static product into a dynamic, participatory scientific process.
What This Means for the Future of Bio-AI
The implications of VCPI extend beyond Ginkgo or any single virtual cell project. For virtual cell models to gain scientific credibility, they must be trained on reproducible, treatment-specific data anchored to a stable biological reference. Without this foundation, AI will continue to hallucinate, mispredict, or overfit to experimental artifacts.
Initiatives like VCPI mark a shift in how the field views data. Experimental design is now recognized as being as important as model architecture. Reproducibility is reclaiming its place as a core requirement, not an optional ideal. Community-driven, open-infrastructure projects are beginning to outperform closed proprietary datasets in their potential to accelerate innovation.
If virtual cells ever become reliable predictive tools—capable of ranking compounds, flagging toxicities, or illuminating biological pathways before wet-lab experiments begin—it will be because projects like VCPI created the structured, trustworthy data environment necessary for their development.
By prioritizing better data over simply more data, Ginkgo is reframing the foundations of AI-driven biology. VCPI doesn't just address the data crisis in drug discovery; it sets the stage for a new era where biological experiments and AI training pipelines co-evolve—openly and with clear purpose.
Related article
Cursor Composer 2 vs Claude Opus 4.6: Benchmark Test Ignites Fresh AI Coding Debate
On March 19, Cursor officially released its in-house coding model, Composer 2. The announcement sparked immediate discussion in the developer community – according to Cursor, Composer 2 scored 61.7% on Terminal-Bench 2.0, notably surpassing Claude Op
StrictlyVC San Francisco to Convene Leaders from TDK Ventures, Replit and More
The first StrictlyVC event of the year is coming to San Francisco sooner than you think. Tickets are still available for our April 30 gathering at the Sentro Filipino Cultural Center, featuring an impressive lineup of speakers. In addition to the net
Notion transforms its workspace into a hub for AI agents
Notion, the productivity software company, is entering the agentic era.During a live-streamed product announcement on Wednesday, Notion—best known for its collaborative note-taking app—unveiled a new developer platform that extends the capabilities o
Related Special Topic Recommendations
Comments (0)
0/500
For years, AI in drug discovery has been held back by a deceptively simple problem: poor quality data. Vast quantities of sequencing data, pooled perturbation studies, and mixed-cell experiments created an illusion of progress, yet the predictive leap developers anticipated never arrived. The field generated noise instead of clarity, experimental drift instead of reproducibility. Datasets were optimized for scale rather than scientific integrity, lacking the precise, pharmacology-specific measurements needed to train reliable virtual cell models.
This is the context for Ginkgo Datapoints' launch of the Virtual Cell Pharmacology Initiative (VCPI). This project aims to deliver better data, not just more data—a resource purpose-built for AI models that predict how drug-like molecules affect real biological systems. As the official announcement states, VCPI will generate over 12 billion data points from profiling 100,000 compounds, establishing the first standardized pharmacology dataset designed for virtual cell modeling.
Why “More Data” Failed
In introducing VCPI, Ginkgo uses a telling analogy: imagine throwing a handful of pills into a cage of mice, then trying to determine which mouse consumed which pill. Now scale that to a million mice in one giant cage. This illustrates the fundamental flaw in pooled single-cell pharmacology experiments. They produce massive datasets, but the experimental design obscures the clear link between a specific compound and its resulting biological effect.
The issue isn't a lack of technology, but a flawed experimental architecture. The belief that larger datasets automatically create better AI models has proven incorrect. Ginkgo's blog post labels this mindset a "data addiction," arguing that without well-structured, high-quality inputs, even the most advanced AI will learn incorrect patterns.
VCPI represents a decisive break from this approach. It prioritizes biological traceability, experimental rigor, and controlled structure—the elements AI truly needs to learn pharmacology—over sheer data volume.
How VCPI Rebuilds the Data Pipeline
Moving away from pooled assays, VCPI employs DRUG-seq, a high-throughput bulk RNA-sequencing method. Each compound is tested in an isolated, barcoded well, enabling treatment-specific response measurements with a far cleaner signal-to-noise ratio than pooled methods allow. According to the press release, Ginkgo's automated infrastructure can process over one hundred 384-well plates weekly, generating millions of high-fidelity RNA measurements at an industrial scale.
Equally critical is the introduction of V-Ref293, a newly engineered, standardized reference cell line. By providing a universal biological baseline—an "organic twin" to virtual cells—VCPI eliminates the variability caused by different labs using mutated or genetically drifted versions of the same cell line. This addresses a major source of irreproducibility in pharmacogenomics and offers AI models the stable ground truth they require.
The initiative is building a community-driven dataset with several key features:
- Open participation for researchers, pharmaceutical teams, and AI developers
- Free high-throughput RNA profiling for submitted compounds
- Options for contributors to embargo data or retain permanent proprietary access
- Monthly data releases guided by community voting
- Opportunities for model sharing, compound prioritization, and early-access "super-user" status
A Community-Built Model, Not a Data Dump
One of VCPI's most distinctive aspects is its launch prior to the dataset's completion. Rather than presenting a finished resource, Ginkgo is inviting the scientific community to help decide which compounds are most valuable and to collaborate in real time as the dataset expands.
This structure also reduces risk for participants. Early-stage biotechs can submit compounds and receive real pharmacology data without the high cost of dedicated screening. AI teams can help ensure the dataset includes the specific biological perturbations needed for model training. Academic labs can contribute while potentially retaining a 90-day exclusive data window.
This approach transforms data generation from a static product into a dynamic, participatory scientific process.
What This Means for the Future of Bio-AI
The implications of VCPI extend beyond Ginkgo or any single virtual cell project. For virtual cell models to gain scientific credibility, they must be trained on reproducible, treatment-specific data anchored to a stable biological reference. Without this foundation, AI will continue to hallucinate, mispredict, or overfit to experimental artifacts.
Initiatives like VCPI mark a shift in how the field views data. Experimental design is now recognized as being as important as model architecture. Reproducibility is reclaiming its place as a core requirement, not an optional ideal. Community-driven, open-infrastructure projects are beginning to outperform closed proprietary datasets in their potential to accelerate innovation.
If virtual cells ever become reliable predictive tools—capable of ranking compounds, flagging toxicities, or illuminating biological pathways before wet-lab experiments begin—it will be because projects like VCPI created the structured, trustworthy data environment necessary for their development.
By prioritizing better data over simply more data, Ginkgo is reframing the foundations of AI-driven biology. VCPI doesn't just address the data crisis in drug discovery; it sets the stage for a new era where biological experiments and AI training pipelines co-evolve—openly and with clear purpose.
Cursor Composer 2 vs Claude Opus 4.6: Benchmark Test Ignites Fresh AI Coding Debate
On March 19, Cursor officially released its in-house coding model, Composer 2. The announcement sparked immediate discussion in the developer community – according to Cursor, Composer 2 scored 61.7% on Terminal-Bench 2.0, notably surpassing Claude Op
StrictlyVC San Francisco to Convene Leaders from TDK Ventures, Replit and More
The first StrictlyVC event of the year is coming to San Francisco sooner than you think. Tickets are still available for our April 30 gathering at the Sentro Filipino Cultural Center, featuring an impressive lineup of speakers. In addition to the net
Notion transforms its workspace into a hub for AI agents
Notion, the productivity software company, is entering the agentic era.During a live-streamed product announcement on Wednesday, Notion—best known for its collaborative note-taking app—unveiled a new developer platform that extends the capabilities o





Home






