Aethron Labs / Projects

Building foundation models for
scientific interpretation.

Transforming scientific data into actionable discovery. Starting with mass spectrometry — the language of molecules.

01Overview

What We Do

Aethron Labs is an independent research lab focused on developing large-scale machine learning systems for interpreting complex scientific data. Our work is centered on building foundational capabilities rather than narrow tools or application-specific models.

02NexaMol

Progress Log

YouTube / Problem Overview
Loom / Technical Demo
579 GiB
Dataset size
~18K/s
Sustained throughput
100%
Pipeline complete
0
Corruption events
Foundation & Data
  • Acquired large-scale MS/MS dataset (~579 GiB)
  • Built Rust + Python preprocessing pipeline
  • Achieved ~160K spectra/sec on commodity CPUs
  • Processed 100% into verified, versioned shards (GeMS v1)
  • Strict train/test/validation splits — no leakage
  • Arrow-based training shards + HDF5 workflows
Model Architecture
  • Foundation model roadmap: V1-V3 (3B→5B→7B params)
  • Unified heterogeneous instrument data (ToF, quad)
  • Designed evaluation: neighbor-retrieval accuracy
  • Instrument-agnostic representation strategy
  • Cloud-native data flow: HuggingFace → VM → storage
  • Jupyter notebook demo: embeddings + nearest-neighbor
Commercial Execution
  • Identified CROs as primary commercial segment
  • Drafted GTM messaging and LOI templates
  • Conducted direct CRO outreach
  • Applied to BoostVC, Convergent, Artizen, Founders Inc
  • Written technical docs and 1-2 page proposals
  • Capital-efficient ignition → pre-seed execution plan
02.5Milestone

GeMS v1 — Dataset Complete

MILESTONE ACHIEVED
GeMS v1 (General Mass Spectrometry Dataset v1) — ML-ready training corpus assembled. Week of March 10, 2026.
579.6 GiB
Total dataset size
338
Total shards
~18K/s
Spectra throughput
Train
270 shards
80% of corpus
Test
33 shards
10% of corpus
Validation
35 shards
10% of corpus
GeMS v1 / Shard Distribution VisualizationCOMPLETE
GeMS v1 dataset shard visualization
Pipeline Architecture
  • Rust + Python hybrid preprocessing
  • Arrow-format training shards
  • Strict no-leakage train/test/val splits
  • Versioned, reproducible shard generation
Data Quality
  • ~18K spectra/sec processing throughput
  • Instrument-agnostic normalization
  • HuggingFace → VM → storage pipeline
  • Verified checksums on all 338 shards
03Context

The Problem

Across the life sciences and molecular research, data generation has dramatically outpaced our ability to interpret it. Core analytical technologies produce enormous volumes of rich, high-dimensional measurements, yet downstream understanding still depends on fragile heuristics, limited reference data, and manual analysis.

This gap constrains discovery, slows research, and limits what can be reliably inferred from experimental data.

04Technical

Our Approach

We believe this is fundamentally a representation problem. Aethron Labs is building foundation models that learn directly from raw scientific data, capturing underlying structure in a way that generalizes across instruments, conditions, and experimental settings.

The goal is not to replace existing workflows, but to create a new computational substrate that makes scientific interpretation more scalable, reliable, and extensible.

05Business

Market Opportunity

Pre-commercial stage — projections based on preliminary market research and industry analysis
Top-down Context
$200B
Global pharma R&D annually
$90B
Global CRO market annually
$50B+
Addressable analytical services
Bottom-up Entry Wedge
30-50%
Reduction in manual interpretation time
10-100+
Instruments per large CRO
1M+
Spectra analyzed annually

Mid-to-large CROs typically operate 10s–100s of LC-MS/MS instruments processing millions of spectra per year, with teams of analysts whose time is the primary cost driver. This spend is recurring, operational, and directly tied to throughput and turnaround time.

Initial commercialization targets enterprise API licensing priced against analyst time and throughput. Targeting ~200–500 CROs and pharma analytical groups globally, with early adopters likely the top 10–50 CROs by analytical volume. Initial contracts plausibly in the $100K–$1M ARR range per customer — supporting a credible $50–200M serviceable obtainable market before broader expansion.

06Strategy

Go-To-Market

Simple and Credible

The initial GTM is intentionally narrow and execution-driven. Aethron Labs targets CROs first — the organizations that feel the MS/MS bottleneck most acutely.

Turnaround time, analyst throughput, and defensibility of results directly determine their margins and competitiveness. The goal is not rapid scaling at first, but credible proof that this infrastructure works in real workflows.

01
Direct CRO Outreach
Identify high-pain workflows: metabolite ID, impurity analysis, dereplication.
02
Scoped Pilots
Run alongside existing tools. Measured on time saved, coverage, analyst effort.
03
API Integration
Embed into existing pipelines — no UI disruption, no workflow replacement.
04
Validated Conversion
Convert pilots into paid API access or enterprise licensing.
07Roadmap

What This Becomes

What begins as programmatic molecular search for LC-MS/MS expands as models and representations mature:

Phase 1
Molecular Interpretation Infrastructure
  • LC-MS/MS annotation and search
  • Metabolomics and impurity identification
  • Direct integration into CRO and pharma pipelines
Phase 2
Embedded Discovery Infrastructure
  • Drug discovery, DMPK, metabolomics, materials research
  • Standard interpretation layer — not a standalone tool
  • Cross-instrument generalization
Phase 3
Scientific Foundation Infrastructure
  • Reusable substrate for molecular and materials science
  • TAM expands to multiple tens of billions
  • Core scientific computing infrastructure
08Team

Founder Profile

Allan
Founder
5 yrs
ML & Scientific Computing
[email protected]
3 yrs — Open-Source Research
  • Molecular science
  • Biomaterials
  • Quantum systems
  • Computational fluid dynamics
2 yrs — Industry
  • Startups
  • Large-scale production systems
  • Infrastructure engineering

This background spans the full stack required for this problem: scientific domain understanding, large-scale ML systems, and production engineering realities. Aethron Labs is structured to reflect this combination from day one.

09Vision

Motivation

This effort is motivated by a rare convergence:

Scientific fields are generating orders of magnitude more data

Interpretation remains the bottleneck — not collection

Modern ML can now operate at the scale and complexity required

The opportunity is not incremental optimization. It is to define a new category of scientific infrastructure that sits between raw experimental data and downstream discovery.

By starting with a concrete, economically grounded use case (CRO workflows) and expanding deliberately, Aethron Labs aims to accelerate scientific discovery, improve reproducibility, and create durable infrastructure with impact beyond a single domain.

This is a long-term bet on advancing science as a system, not just improving a workflow.

10Connect

Get in Touch

If you work in scientific research, analytical chemistry, pharma, or scientific machine learning, and are interested in exchanging perspectives — I welcome the conversation.