Realm

GPU-native ML.
End to end.

OaEngine — GPU-native compute substrate for training and inference. Byte-level LLMs with parallel hypothesis branches, adaptive compute depth, and self-improving architecture. Full GPU pipeline end to end — nothing touches the CPU. Any accelerator. Zero external dependencies.

9.6M

Parameters Trained

77K tok/s

Throughput

100%

GPU Utilization

10%

VRAM Footprint

End-to-end on a single accelerator.

Truly GPU-native. Embedding, attention, normalization, loss, gradients, optimizer — every operation dispatches to the accelerator. Nothing computes on the CPU. Built on the Oa compute substrate.

Byte Embedding
01

Byte Embedding

Fixed vocabulary of 256. No tokenizer, no preprocessing, no language bias. Text, code, audio — every modality is bytes.

Parallel Branch States
02

Parallel Branch States

N hypothesis branches run simultaneously. Each explores a different reasoning path. An internal scorer selects the winner. The model deliberates before answering.

Oa Compute
03

Oa Compute

OaEngine dispatches the full training pipeline as fused compute kernels. Any GPU — NVIDIA, AMD, Intel. No vendor lock-in. No framework overhead. Truly GPU-native end to end.

Deploy & Resume
04

Deploy & Resume

Self-contained .oam checkpoint with embedded compute kernels. Ship one file, run anywhere. Crash-resilient resume with zero wasted compute.

Every layer optimized for throughput.

Inline assembly memory operations. Topology-aware threading with spinlocks and work-stealing. Cross-vendor compute kernels. Post-quantum cryptography. Networking. The full Oa HPC stack powers the ML pipeline.

01

Truly GPU-Native

Full GPU pipeline end to end. No CPU fallbacks. No host-device ping-pong. Embedding, forward, backward, optimizer step, checkpointing — every operation dispatches to the accelerator. The CPU loads data and writes checkpoints. Everything else lives on device.

02

Constant-Time Inference

Fixed-size state vector compresses the full context history. The 100th token costs the same as the 100,000th. No KV cache growth, no memory scaling, no latency degradation.

03

Zero Framework Overhead

No interpreter, no JIT, no dynamic dispatch. OaEngine compiles the full training pipeline to a single binary with embedded compute kernels. Pure C++ — measured 80-160x faster than equivalent PyTorch.

04

Infinite Dataset Streaming

Zero-copy data pipeline with inline assembly memory operations. Datasets of any size stream directly to the accelerator without staging in system memory. 1 GB or 1 TB — the GPU never waits for data.

05

Config-Driven Scaling

The same codebase runs from 140K to 10B+ parameters. No architecture rewrites, no code changes. Validated at 9.6M with coherent text generation. Scaling to 8B with dedicated hardware.

06

Crash-Resilient Training

Metric-based checkpointing with automatic rotation and seamless resume. Training continues from the exact step after any interruption. Zero wasted compute, zero manual intervention.

ReaLLM Foundation Model

OaLlm

Byte-level language model built on OaEngine. Fixed vocabulary of 256 — no tokenizer, no vocabulary mismatch, no language bias. Text, code, images, audio — every modality is bytes.

Self-improving architecture with parallel hypothesis branches. N reasoning paths run simultaneously — an internal scorer selects the winner, losers immediately restart. Adaptive compute depth scales per token automatically. The model deliberates before answering. Validated at 9.6M parameters, scaling to 8B+.

Built in C++ on OaEngine. Runs on any GPU — NVIDIA, AMD, Intel. No vendor dependency. 100% GPU utilization at 10% VRAM footprint. Constant-time inference at any context length.

9.6M

Parameters (validated)

256 bytes

Vocabulary

100%

GPU Utilization

10%

VRAM Footprint

.oam — Ship the model, ship the compute.

A single .oam file contains everything needed to run a model — architecture config, weights, optimizer state, training progress, and the compiled compute kernels that power it. No external dependencies. No runtime compilation. No matching framework versions.

Drop the file on any machine with a GPU. The model runs. The kernels are embedded in the file. Nothing to install. Nothing to configure. One file is the entire deployment.

Built on the same substrate as the Oa library — the compute kernels in the .oam are identical to the ones used during training. What you trained is exactly what you deploy.

model.oam

.oam file layout
─────────────────────
Config Architecture, dimensions, vocab
Weights Named tensors (embed, layers, head)
Optimizer AdamW moments, hyperparams
Progress Step, bytes seen, best metric
Kernels Compiled compute kernels (model-selective)
One file. Load it. Run it. Any GPU.

1

File

0

Dependencies

Any

GPU

Model output at 9.6M parameters.

Trained on 1.8 GB of narrative data. 50,000 optimization steps. 35 minutes on a single GPU. Generation runs entirely on-device via OaEngine — zero CPU involvement, any GPU.

temp=0.3

> Once upon a time, there was a
< Once upon a time, there was a little girl named Lily.
She loved to play with her toys and her friends.
One day, Jack wanted to play with the ball.
He was very sad and scared.
He was very happy and hugged his mom.
They played with the ball and having fun.
One day, Timmy went to the park with his mom and dad.
interactive chat

temp=0.3

> Hi! My name is
< Hi! My name is Lily.
It is so pretty.
I want to see the bird and the birds are so many things to do."
The bird smiled and said, "Yes, we can still stay here
and the bug passed away.
When they got to the park, they saw a big box in the garden.

temp=0.7

> One day, a boy named Max
< One day, a boy named Max went to the store with his mom.
They played with the elephants and paper with her amazing
trail with his special world.
So his mom asked him to help her.
"Are you OK, you did not mean to break the dog.
They take off the bed and wanted to play with the rock."
The little boy was happy to share his favorite toy.

Platform benchmarks.

MetricRealmPyTorchJAXllama.cpp
Training (9.6M)77K tok/s~5-10K~8-15KN/A
VRAM (9.6M)2.6 GB (10%)~8+ GB~6+ GBN/A
GPU Utilization100%30-50%40-60%N/A
Inference ScalingO(1) constantO(n) per tokenO(n) per tokenO(n) per token
Dependencies0~150~800
RuntimeC++PythonPythonC/C++
GPU VendorsAllNVIDIA/AMDNVIDIA/TPUNVIDIA/CPU
TokenizerNone (bytes)RequiredRequiredRequired
GPU infrastructure

Verified on hardware.

All metrics measured on high-end consumer GPU hardware (24 GB VRAM). FP32 precision — no mixed precision optimizations applied yet. Production deployment targets dedicated accelerator clusters.

77K tok/s

Throughput (9.6M)

800K tok/s

Throughput (140K)

0.75

Final Loss (TinyStories)

35 min

Time to Convergence

Interested in what we're building?

We're building toward managed LLM inference with post-quantum security on the OaEngine compute substrate. Looking for infrastructure partners and early adopters.

Get in Touch