AI Infrastructure

GPU-native ML.
End to end.

OaEngine — GPU-native compute substrate for training and inference. Byte-level LLMs with parallel hypothesis branches, adaptive compute depth, and self-improving architecture. Full GPU pipeline end to end — nothing touches the CPU. Any accelerator. Zero external dependencies.

9.6M

Parameters Trained

77K tok/s

Throughput

100%

GPU Utilization

10%

VRAM Footprint

Pipeline

End-to-end on a single accelerator.

Truly GPU-native. Embedding, attention, normalization, loss, gradients, optimizer — every operation dispatches to the accelerator. Nothing computes on the CPU. Built on the Oa compute substrate.

Byte Embedding

Fixed vocabulary of 256. No tokenizer, no preprocessing, no language bias. Text, code, audio — every modality is bytes.

Parallel Branch States

N hypothesis branches run simultaneously. Each explores a different reasoning path. An internal scorer selects the winner. The model deliberates before answering.

Oa Compute

OaEngine dispatches the full training pipeline as fused compute kernels. Any GPU — NVIDIA, AMD, Intel. No vendor lock-in. No framework overhead. Truly GPU-native end to end.

Deploy & Resume

Self-contained .oam checkpoint with embedded compute kernels. Ship one file, run anywhere. Crash-resilient resume with zero wasted compute.

Capabilities

Every layer optimized for throughput.

Inline assembly memory operations. Topology-aware threading with spinlocks and work-stealing. Cross-vendor compute kernels. Post-quantum cryptography. Networking. The full Oa HPC stack powers the ML pipeline.

Truly GPU-Native

Full GPU pipeline end to end. No CPU fallbacks. No host-device ping-pong. Embedding, forward, backward, optimizer step, checkpointing — every operation dispatches to the accelerator. The CPU loads data and writes checkpoints. Everything else lives on device.

Constant-Time Inference

Fixed-size state vector compresses the full context history. The 100th token costs the same as the 100,000th. No KV cache growth, no memory scaling, no latency degradation.

Zero Framework Overhead

No interpreter, no JIT, no dynamic dispatch. OaEngine compiles the full training pipeline to a single binary with embedded compute kernels. Pure C++ — measured 80-160x faster than equivalent PyTorch.

Infinite Dataset Streaming

Zero-copy data pipeline with inline assembly memory operations. Datasets of any size stream directly to the accelerator without staging in system memory. 1 GB or 1 TB — the GPU never waits for data.

Config-Driven Scaling

The same codebase runs from 140K to 10B+ parameters. No architecture rewrites, no code changes. Validated at 9.6M with coherent text generation. Scaling to 8B with dedicated hardware.

Crash-Resilient Training

Metric-based checkpointing with automatic rotation and seamless resume. Training continues from the exact step after any interruption. Zero wasted compute, zero manual intervention.

Foundation Model

OaLlm

Byte-level language model built on OaEngine. Fixed vocabulary of 256 — no tokenizer, no vocabulary mismatch, no language bias. Text, code, images, audio — every modality is bytes.

Self-improving architecture with parallel hypothesis branches. N reasoning paths run simultaneously — an internal scorer selects the winner, losers immediately restart. Adaptive compute depth scales per token automatically. The model deliberates before answering. Validated at 9.6M parameters, scaling to 8B+.

Built in C++ on OaEngine. Runs on any GPU — NVIDIA, AMD, Intel. No vendor dependency. 100% GPU utilization at 10% VRAM footprint. Constant-time inference at any context length.

9.6M

Parameters (validated)

256 bytes

Vocabulary

100%

GPU Utilization

10%

VRAM Footprint

Model Format

.oam — Ship the model,
ship the compute.

A single .oam file contains everything needed to run a model — architecture config, weights, optimizer state, training progress, and the compiled compute kernels that power it. No external dependencies. No runtime compilation. No matching framework versions.

Drop the file on any machine with a GPU. The model runs. The kernels are embedded in the file. Nothing to install. Nothing to configure. One file is the entire deployment.

Built on the same substrate as the Oa library — the compute kernels in the .oam are identical to the ones used during training. What you trained is exactly what you deploy.

model.oam

.oam file layout
─────────────────────
  Config       Architecture, dimensions, vocab
  Weights      Named tensors (embed, layers, head)
  Optimizer    AdamW moments, hyperparams
  Progress     Step, bytes seen, best metric
  Kernels      Compiled compute kernels (model-selective)

One file. Load it. Run it. Any GPU.

File

Dependencies

Any

GPU

Output

Model output at 9.6M parameters.

Trained on 1.8 GB of narrative data. 50,000 optimization steps. 35 minutes on a single GPU. Generation runs entirely on-device via OaEngine — zero CPU involvement, any GPU.

temp=0.3

> Once upon a time, there was a

< Once upon a time, there was a little girl named Lily.
She loved to play with her toys and her friends.
One day, Jack wanted to play with the ball.
He was very sad and scared.
He was very happy and hugged his mom.
They played with the ball and having fun.

One day, Timmy went to the park with his mom and dad.

interactive chat

temp=0.3

> Hi! My name is

< Hi! My name is Lily.
It is so pretty.
I want to see the bird and the birds are so many things to do."

The bird smiled and said, "Yes, we can still stay here
and the bug passed away.
When they got to the park, they saw a big box in the garden.

temp=0.7

> One day, a boy named Max

< One day, a boy named Max went to the store with his mom.
They played with the elephants and paper with her amazing
trail with his special world.
So his mom asked him to help her.

"Are you OK, you did not mean to break the dog.
They take off the bed and wanted to play with the rock."

The little boy was happy to share his favorite toy.

Comparison

Platform benchmarks.

Metric	Realm	PyTorch	JAX	llama.cpp
Training (9.6M)	77K tok/s	~5-10K	~8-15K	N/A
VRAM (9.6M)	2.6 GB (10%)	~8+ GB	~6+ GB	N/A
GPU Utilization	100%	30-50%	40-60%	N/A
Inference Scaling	O(1) constant	O(n) per token	O(n) per token	O(n) per token
Dependencies	0	~150	~80	0
Runtime	C++	Python	Python	C/C++
GPU Vendors	All	NVIDIA/AMD	NVIDIA/TPU	NVIDIA/CPU
Tokenizer	None (bytes)	Required	Required	Required

Performance

Verified on hardware.

All metrics measured on high-end consumer GPU hardware (24 GB VRAM). FP32 precision — no mixed precision optimizations applied yet. Production deployment targets dedicated accelerator clusters.

77K tok/s

Throughput (9.6M)

800K tok/s

Throughput (140K)

0.75

Final Loss (TinyStories)

35 min

Time to Convergence

Early Access

Interested in what we're building?

We're building toward managed LLM inference with post-quantum security on the OaEngine compute substrate. Looking for infrastructure partners and early adopters.

Get in Touch

GPU-native ML.End to end.