Realm
TechnologyFebruary 2026

GPU-Native Machine Learning.
From Scratch.

A complete ML training and inference stack built from scratch in C++. 9.6M parameters. 77K tokens per second. 93% VRAM utilization. 48 GB training corpus. No Python. No PyTorch. No vendor lock-in.

GPU-Native Machine Learning

Realm now ships a complete machine learning platform — training and inference, end to end — written entirely in C++ with GPU-native compute via the Oa library. No Python runtime. No PyTorch. No framework dependencies. A 9.6M parameter byte-level language model trained on a 48 GB multi-domain corpus, generating coherent English at 77K tokens per second on a single laptop GPU.

Why Build From Scratch?

The standard ML stack — Python, PyTorch, vendor-specific graphs, JIT compilation — was designed for research flexibility, not production efficiency. A typical PyTorch training loop utilizes 30-50% of the GPU. The rest is lost to the Python interpreter, framework abstraction layers, memory management overhead, and synchronization barriers between CPU and GPU.

We eliminated the entire software stack above the GPU. The result: 100% GPU utilization, 93% VRAM usage, and a training pipeline that saturates the memory bus. No interpreter. No JIT. No graph tracing. No CPU/GPU synchronization points. Data moves to the GPU once and stays there until training is complete.

Byte-Level Architecture

Most language models begin with a tokenizer — a preprocessing step that maps raw text into a learned vocabulary of 32K-128K subword tokens. This introduces a dependency on training data distribution, adds latency, and creates a modality boundary. Text, code, audio, and images each require different tokenization strategies.

Our model operates directly on bytes. The vocabulary is 256. Always. Every modality is just bytes. No tokenizer, no preprocessing, no language bias. The architecture is a hybrid Jamba-style design — Mamba-3 selective state spaces with interleaved sparse attention — giving us O(1) inference cost per token instead of the O(n) that transformers require as context grows. The same architecture handles text, code, and will handle audio and image without architectural changes.

48 GB Multi-Domain Corpus

The training corpus spans 48 GB of raw bytes drawn from multiple domains: OpenWebText, BookCorpus, Wikipedia, StackExchange, GitHub code, TinyStories structured narratives, and Shakespeare. The entire dataset pipeline is automated — a Python download stage pulls from HuggingFace, a preparation stage concatenates everything into a single binary file, and the training loop memory-maps it directly into GPU-addressable space.

No preprocessing. No tokenization. No vocabulary fitting. The model reads raw bytes from the mmap'd file, with a 64-bit RNG sampling uniformly across all 48 GB. The dataset opens in 0.02ms regardless of size — whether 1 MB or 100 GB, latency is identical.

VRAM Auto-Tuning

GPU memory is the binding constraint in model training. Waste it and you're leaving throughput on the table. Overestimate and you crash with an out-of-memory error halfway through a training run. Traditional ML frameworks leave this to the user — manually tuning batch sizes, sequence lengths, and gradient accumulation steps through trial and error.

Our trainer queries the GPU at startup, calculates exact memory requirements per token — model weights, optimizer states, forward activation saves, backward gradient workspace — and automatically selects the largest batch size and sequence length that fits. On an RTX 5090 with 24 GB VRAM, the auto-tuner achieves 93.3% utilization with zero manual configuration. Estimated vs actual VRAM usage matches within 2%.

The Numbers

All measurements taken on a single NVIDIA RTX 5090 Laptop GPU (24 GB VRAM, SM 12.0). FP32 precision — no mixed-precision optimizations applied yet.

  • 77K tokens/second: Training throughput on a 9.6M parameter model with B=19 S=8192 (155K tokens per step). Double-buffered pinned memory transfer overlaps compute with the next batch load. The GPU never stalls.
  • 100% GPU utilization: Measured via GPU profiling, not theoretical. Fused compute kernels eliminate all framework overhead. No Python interpreter, no JIT compilation, no graph tracing.
  • 93% VRAM utilization: The auto-tuner fills 22.4 GB out of 24 GB available. Every megabyte is working — model weights, optimizer states, forward saves, gradient workspace. Estimated vs actual matches within 2%.
  • 48 GB training corpus: Seven datasets, one binary file, zero preprocessing. Memory-mapped with uniform 64-bit random sampling across the entire corpus. The model sees every byte.
  • Crash-resilient training: Automatic checkpoint saving and resume. A system crash at step 36,000 resumes from the last checkpoint with zero wasted compute. Loss curve continues exactly where it left off.
  • Interactive chat inference: Trained models run in a live interactive chat mode — type a prompt, get a response, keep the conversation going. Autoregressive generation runs entirely on-device with zero CPU involvement.

What It Generates

The model learns sentence structure, dialogue, character names, paragraph breaks, and story arcs entirely from the byte stream. No tokenizer taught it where words begin or end. No vocabulary told it what a sentence is. It figured out language from 256 possible values.

With a 48 GB multi-domain corpus, the model sees everything from technical documentation to narrative fiction to source code. Early results show rapid loss convergence — perplexity drops from 257 to 5.6 in the first 700 steps, with the model already producing coherent English prose.

What's Next

The 9.6M parameter model validates the architecture and the training infrastructure. The path forward is vertical scaling — the same fused compute kernels, the same zero-overhead pipeline, applied to 150M and 1B+ parameter models. The byte-level architecture means the same model handles text, code, audio, and images without tokenizer changes or modality-specific preprocessing.

Mixed-precision training (BF16) will roughly double throughput on the same hardware. Multi-GPU training will scale linearly across vendors. The auto-tuner already handles arbitrary GPU memory sizes — the same code runs on a laptop GPU or a datacenter accelerator.

Inference will be served via bidirectional gRPC with post-quantum TLS, the same API layer that powers the Realm blockchain. The same GPU-native philosophy that achieves 1.26 million Dilithium-3 verifications per second now powers end-to-end machine learning.

One stack. One language. One accelerator. From training to inference to settlement.

Share this article