
AI Infrastructure
GPU-native ML.
End to end.
OaEngine — GPU-native compute substrate for training and inference. Byte-level LLMs with parallel hypothesis branches, adaptive compute depth, and self-improving architecture. Full GPU pipeline end to end — nothing touches the CPU. Any accelerator. Zero external dependencies.
9.6M
Parameters Trained
77K tok/s
Throughput
100%
GPU Utilization
10%
VRAM Footprint
Pipeline
End-to-end on a single accelerator.
Truly GPU-native. Embedding, attention, normalization, loss, gradients, optimizer — every operation dispatches to the accelerator. Nothing computes on the CPU. Built on the Oa compute substrate.

Byte Embedding
Fixed vocabulary of 256. No tokenizer, no preprocessing, no language bias. Text, code, audio — every modality is bytes.

Parallel Branch States
N hypothesis branches run simultaneously. Each explores a different reasoning path. An internal scorer selects the winner. The model deliberates before answering.

Oa Compute
OaEngine dispatches the full training pipeline as fused compute kernels. Any GPU — NVIDIA, AMD, Intel. No vendor lock-in. No framework overhead. Truly GPU-native end to end.

Deploy & Resume
Self-contained .oam checkpoint with embedded compute kernels. Ship one file, run anywhere. Crash-resilient resume with zero wasted compute.
Capabilities
Every layer optimized for throughput.
Inline assembly memory operations. Topology-aware threading with spinlocks and work-stealing. Cross-vendor compute kernels. Post-quantum cryptography. Networking. The full Oa HPC stack powers the ML pipeline.
Truly GPU-Native
Full GPU pipeline end to end. No CPU fallbacks. No host-device ping-pong. Embedding, forward, backward, optimizer step, checkpointing — every operation dispatches to the accelerator. The CPU loads data and writes checkpoints. Everything else lives on device.
Constant-Time Inference
Fixed-size state vector compresses the full context history. The 100th token costs the same as the 100,000th. No KV cache growth, no memory scaling, no latency degradation.
Zero Framework Overhead
No interpreter, no JIT, no dynamic dispatch. OaEngine compiles the full training pipeline to a single binary with embedded compute kernels. Pure C++ — measured 80-160x faster than equivalent PyTorch.
Infinite Dataset Streaming
Zero-copy data pipeline with inline assembly memory operations. Datasets of any size stream directly to the accelerator without staging in system memory. 1 GB or 1 TB — the GPU never waits for data.
Config-Driven Scaling
The same codebase runs from 140K to 10B+ parameters. No architecture rewrites, no code changes. Validated at 9.6M with coherent text generation. Scaling to 8B with dedicated hardware.
Crash-Resilient Training
Metric-based checkpointing with automatic rotation and seamless resume. Training continues from the exact step after any interruption. Zero wasted compute, zero manual intervention.

Foundation Model
OaLlm
Byte-level language model built on OaEngine. Fixed vocabulary of 256 — no tokenizer, no vocabulary mismatch, no language bias. Text, code, images, audio — every modality is bytes.
Self-improving architecture with parallel hypothesis branches. N reasoning paths run simultaneously — an internal scorer selects the winner, losers immediately restart. Adaptive compute depth scales per token automatically. The model deliberates before answering. Validated at 9.6M parameters, scaling to 8B+.
Built in C++ on OaEngine. Runs on any GPU — NVIDIA, AMD, Intel. No vendor dependency. 100% GPU utilization at 10% VRAM footprint. Constant-time inference at any context length.
9.6M
Parameters (validated)
256 bytes
Vocabulary
100%
GPU Utilization
10%
VRAM Footprint
Model Format
.oam — Ship the model,
ship the compute.
A single .oam file contains everything needed to run a model — architecture config, weights, optimizer state, training progress, and the compiled compute kernels that power it. No external dependencies. No runtime compilation. No matching framework versions.
Drop the file on any machine with a GPU. The model runs. The kernels are embedded in the file. Nothing to install. Nothing to configure. One file is the entire deployment.
Built on the same substrate as the Oa library — the compute kernels in the .oam are identical to the ones used during training. What you trained is exactly what you deploy.
model.oam
.oam file layout─────────────────────Config Architecture, dimensions, vocabWeights Named tensors (embed, layers, head)Optimizer AdamW moments, hyperparamsProgress Step, bytes seen, best metricKernels Compiled compute kernels (model-selective)One file. Load it. Run it. Any GPU.
1
File
0
Dependencies
Any
GPU
Output
Model output at 9.6M parameters.
Trained on 1.8 GB of narrative data. 50,000 optimization steps. 35 minutes on a single GPU. Generation runs entirely on-device via OaEngine — zero CPU involvement, any GPU.
temp=0.3
> Once upon a time, there was a< Once upon a time, there was a little girl named Lily.She loved to play with her toys and her friends.One day, Jack wanted to play with the ball.He was very sad and scared.He was very happy and hugged his mom.They played with the ball and having fun.One day, Timmy went to the park with his mom and dad.
temp=0.3
> Hi! My name is< Hi! My name is Lily.It is so pretty.I want to see the bird and the birds are so many things to do."The bird smiled and said, "Yes, we can still stay hereand the bug passed away.When they got to the park, they saw a big box in the garden.
temp=0.7
> One day, a boy named Max< One day, a boy named Max went to the store with his mom.They played with the elephants and paper with her amazingtrail with his special world.So his mom asked him to help her."Are you OK, you did not mean to break the dog.They take off the bed and wanted to play with the rock."The little boy was happy to share his favorite toy.
Comparison
Platform benchmarks.
| Metric | Realm | PyTorch | JAX | llama.cpp |
|---|---|---|---|---|
| Training (9.6M) | 77K tok/s | ~5-10K | ~8-15K | N/A |
| VRAM (9.6M) | 2.6 GB (10%) | ~8+ GB | ~6+ GB | N/A |
| GPU Utilization | 100% | 30-50% | 40-60% | N/A |
| Inference Scaling | O(1) constant | O(n) per token | O(n) per token | O(n) per token |
| Dependencies | 0 | ~150 | ~80 | 0 |
| Runtime | C++ | Python | Python | C/C++ |
| GPU Vendors | All | NVIDIA/AMD | NVIDIA/TPU | NVIDIA/CPU |
| Tokenizer | None (bytes) | Required | Required | Required |

Performance
Verified on hardware.
All metrics measured on high-end consumer GPU hardware (24 GB VRAM). FP32 precision — no mixed precision optimizations applied yet. Production deployment targets dedicated accelerator clusters.
77K tok/s
Throughput (9.6M)
800K tok/s
Throughput (140K)
0.75
Final Loss (TinyStories)
35 min
Time to Convergence
Early Access
Interested in what we're building?
We're building toward managed LLM inference with post-quantum security on the OaEngine compute substrate. Looking for infrastructure partners and early adopters.
Get in Touch