Realm

The compute layer.
Every vendor, device, os.

Low-level C++ library for GPU-accelerated compute — memory, cryptography, and machine learning in a single substrate. Runs on NVIDIA, AMD, Intel, Qualcomm, and mobile. One codebase. One binary.

5.1x

GPU Speedup

5+

GPU Vendors

77K tok/s

ML Training

Single

Binary

Four pillars of the compute substrate.

Memory
01

Memory

Inline assembly optimizations. Zero-copy streaming buffers. No framework overhead, no dispatch layer — your data hits the hardware directly.

Compute Kernels
02

Compute Kernels

Unified compute kernel language that compiles once and runs everywhere. NVIDIA, AMD, Intel, Qualcomm — desktop, server, mobile. Write once, deploy anywhere.

Cryptography
03

Cryptography

Post-quantum signatures accelerated on GPU. 1.26M Dilithium-3 verifications per second. Batch verification at hardware speed.

Integration
04

Integration

Machine learning, post-quantum cryptography, and general compute share the same substrate. One library. One binary. Everything talks to everything.

Everything between your code and the hardware.

01

Cross-Vendor Compute Kernels

The same kernel binary executes on every GPU vendor. NVIDIA, AMD, Intel, Qualcomm, Apple — discrete, integrated, mobile. No vendor SDK. No porting. No green kool-aid required.

02

Infinite Dataset Streaming

Zero-copy memory pipeline streams data of any size directly to the accelerator without staging in system memory. 1 GB or 1 TB — ready in microseconds. The GPU never waits for data.

03

Embedded Compute Kernel Pipeline

Kernels compile at build time and embed directly into the binary. No runtime file I/O. No compilation stalls. No external shader files. Ship one executable, run it anywhere.

04

Post-Quantum Cryptography

Dilithium-3 (ML-DSA-65) signatures run on GPU at hardware speed. Batch verification at 1.26M ops/sec. Quantum-resistant from day one — not bolted on later.

05

Topology-Aware Threading

Work-stealing thread pool with P-core and E-core detection, NUMA affinity, and dedicated thread roles. Lock-free channels and reader-writer locks for low-latency coordination.

06

Zero Dependencies

No Python runtime. No vendor-specific SDK. No framework. The entire stack compiles to a single static library with deterministic, reproducible builds.

One binary. Every accelerator.

No vendor-specific SDK. No proprietary toolchain. No recompilation when the hardware changes. The compute kernel language compiles to a portable binary that every major GPU driver understands. Drop the executable. Run it.

NVIDIA

Discrete & data center

AMD

Discrete & integrated

Intel

Integrated, Arc & Xe

Qualcomm

Mobile & edge

The 10-year-old ultrabook

A ThinkPad X1 Carbon from 2017. Intel HD 620. No discrete GPU. No CUDA cores. No tensor cores. Runs the full HPC stack — compute kernels, cryptography, ML inference. 2-7x faster than CPU-only.

The multi-GPU laptop

Integrated GPU for light workloads. Discrete GPU for heavy compute. Both at the same time? Yes. Intel iGPU + NVIDIA dGPU dispatching in parallel. AMD integrated + NVIDIA discrete? Same binary.

The repair shop scenario

Your NVIDIA workstation goes in for service. Nobody knows when it comes back. You grab whatever machine is available — AMD, Intel, a MacBook. Build environment stays identical. Same code. Same tests. Same results.

The scale axis

From an integrated GPU in a thin laptop, through a discrete consumer card, to multi-GPU server racks in a data center. The binary doesn't change. The kernel language doesn't change. Only the hardware does.

Measured. Not promised.

Test Device

Intel HD Graphics 620

KBL GT2 — 24 Execution Units · ThinkPad X1 Carbon 5th Gen (2017)

Integrated GPU

No discrete GPU. No CUDA cores. No tensor cores. Shared system memory.

11/16

GPU Wins

benchmarks

5.1x

Best Speedup

SiLU N=1M

4.56x

Graph Replay

vs re-record

13.8K/s

PQC Verify

Dilithium-3

Compute Kernels

CPU vs GPU — same binary, same machine

KernelSizeCPUGPUSpeedup
SiLUN=4K35 µs162 µs4.6x CPU
SiLUN=64K358 µs251 µs1.4x GPU
SiLUN=1M5.0 ms968 µs5.1x GPU
Matmul128x1282.4 ms1.4 ms1.7x GPU
Matmul256x25617.6 ms7.1 ms2.5x GPU
RMSNorm4096x2565.5 ms1.9 ms2.9x GPU
CrossEntropyB=4K10.9 ms4.3 ms2.6x GPU

ML Pipeline

Matmul → SiLU — simulated forward pass

SizeCPUGPUSpeedup
32x128828 µs512 µs1.6x GPU
128x2569.6 ms5.6 ms1.7x GPU
512x25637.9 ms15.8 ms2.4x GPU

Post-Quantum Cryptography

Dilithium-3 (ML-DSA-65) post-quantum signatures

Dilithium KeyGen

18.1K ops/s

Dilithium Sign

6.3K ops/s

58 B

Dilithium Verify

13.8K ops/s

58 B

Batch Verify

18.0K ops/s

64 sigs

This is an integrated GPU.

No discrete card. No CUDA cores. No tensor cores. No RT cores. Just 24 execution units on shared system memory in a 2017 ultrabook. The full HPC stack — compute kernels, ML pipeline, post-quantum cryptography — running 2-5x faster than CPU. Imagine what a discrete GPU does with this.

RTX 5090Coming soon
MI250XPlanned
Arc A770Planned
Apple M-seriesPlanned

The Oa Library

Black magic mixed with alien technology. A hand-crafted C++ compute library — GPU acceleration, machine learning, post-quantum cryptography, and networking in a single static binary. No frameworks. No dependencies. No compromises.

Inspired by real-time engine architecture. Every function optimized by hand. Every abstraction earned. Written like calligraphy — with precision, at 3 AM, because the code has to be right.

The foundation for everything. Machine learning trains on it. The blockchain settles on it. The exchange matches on it. One library. One philosophy. Ship the binary.

C++20

Standard

Static

Linking

0

Runtime Deps

The Oa Library
OaEngine compute substrate

OaEngine

The all-seeing eye. One object owns the entire compute context — device, memory allocator, kernel registry, and stream pool. Create it once. Everything flows through it. Every tensor, every module, every dispatch — orchestrated from a single point of consciousness.

Real-time compute for artificial consciousness. Multithreaded kernel dispatch with work-stealing. Persistent compute streams with batched recording — dozens of operations submitted in a single call. Pipeline compilation at build time. First dispatch is as fast as the millionth.

The engine awakens with three lines of code. From that moment, the hardware obeys. No ceremony. No configuration. No delay between thought and execution.

Batched

Dispatch Model

Pooled

Streams

Build-time

Compilation

Global

Context

What it looks like

Three lines to GPU compute. No boilerplate. No ceremony.

engine

auto rt = OaEngine::Create({
.AppName = "MyApp"
});
// GPU dispatch ready.
01

Batched Command Recording

Record dozens of kernel dispatches into a single command buffer. Submit once, wait once. Eliminates per-dispatch overhead for complex pipelines.

compute

auto x = OaTensor::Rand({512, 256});
auto w = OaTensor::Rand({256, 128});
auto y = x.Matmul(w).Silu();
// Dispatched to GPU.
02

Persistent Compute Streams

Streams own their command pool and synchronization. Acquired from a pool, used, returned. No allocation in the hot path.

cryptography

auto sig = OaSign(msg, len, key);
auto ok = OaVerifyBatch(
keys, sigs, msgs, 64);
// 13.8K verifications/sec
03

Build-Time Kernel Compilation

Kernels compile at build time and embed into the binary as read-only data. Zero file I/O at runtime. Zero compilation stalls.

threading

auto pool = OaThreadPool::Create();
auto task = pool.Submit(Compute);
auto val = task->Wait();
channel.Send(val);
04

Automatic Device Selection

The engine detects all available accelerators, selects the best one, and configures memory. Discrete preferred, integrated as fallback.

OaComputeGraph

Record once. Replay thousands of times. The compute graph captures an entire pipeline of GPU operations — dependencies, memory barriers, dispatch order — and compiles it into a replayable unit. Every subsequent execution is near-zero CPU cost. The hardware replays the exact sequence without the CPU touching a single command.

Automatic dependency analysis tracks every buffer read and write across the graph. Barriers are inserted only where true data hazards exist — eliminating 60-70% of synchronization overhead. Operations that don't overlap in time share the same memory, cutting VRAM usage by up to 92%.

For ML training where the same computation repeats every step, this means compile once at initialization and replay for the entire training run. Zero per-step overhead. Zero re-recording. The graph just runs.

4.56x

Replay Speedup

92%

Memory Savings

17K

tok/s Training

0

CPU Cost / Step

OaComputeGraph — record once, replay forever

Pick your level of control.

The same GPU. The same shaders. The same checkpoint format. Start with the easy path. Drop to the compiled path when you need zero-overhead replay.

Easy

nn.Module + Autograd

PyTorch-style. Define layers, call Forward, call Backward. Automatic gradient computation. Best for research and iteration.

autograd

class TinyLlm : public OaModule {
public:
TinyLlm(OaI32 D, OaI32 DFF, OaI32 NL) {
Embed_ = OaMakeShared<OaEmbedding>(256, D);
for (OaI32 i = 0; i < NL; ++i) {
auto block = OaMakeShared<OaSequential>();
block->Add(OaMakeShared<OaRMSNorm>(D));
block->Add(OaMakeShared<OaLinear>(D, DFF));
block->Add(OaMakeShared<OaSiLU>());
block->Add(OaMakeShared<OaLinear>(DFF, D));
Layers_.push_back(block);
}
Head_ = OaMakeShared<OaLinear>(D, 256);
}
OaTensor Forward(const OaTensor& x) override {
auto h = Embed_->Forward(x);
for (auto& layer : Layers_) {
auto res = h;
h = layer->Forward(h).Add(res);
}
return Head_->Forward(h);
}
};
int main() {
auto rt = OaEngine::Create({.AppName = "Train"}).Unwrap();
TinyLlm model(64, 256, 2);
OaAdamW opt(model.Parameters(), 3e-4f);
for (OaI32 step = 0; step < 1000; ++step) {
auto loss = OaCrossEntropyLoss(
model.Forward(input), targets);
loss.Backward();
opt.Step();
opt.ZeroGrad();
}
model.Save("model.oam");
}
Production

GraphBuilder + Compile

Declare the graph. Compile once. Replay thousands of times with zero CPU overhead. Memory aliasing and barrier optimization included.

graph

struct TinyLlm {
OaI32 D = 64, DFF = 256, NL = 2;
void Build(OaGraphBuilder& g) {
auto x = g.Input("indices", {256},
OaScalarType::UInt32);
x = g.Op("byte_embed",
{x, g.Weight("embed", {256, D})});
for (OaI32 i = 0; i < NL; ++i) {
auto res = x;
auto s = std::to_string(i);
x = g.RmsNorm(x, g.Weight("norm_" + s));
x = g.Silu(g.Matmul(x,
g.Weight("up_" + s, {D, DFF})));
x = g.Add(g.Matmul(x,
g.Weight("down_" + s, {DFF, D})), res);
}
auto logits = g.Matmul(x,
g.Weight("head", {D, 256}));
g.SetLoss(g.CrossEntropy(
logits, g.Input("targets")));
}
};
int main() {
auto rt = OaEngine::Create({.AppName = "Train"}).Unwrap();
TinyLlm model;
OaGraphBuilder builder;
model.Build(builder);
auto compiled = builder.Compile(rt);
for (OaI32 step = 0; step < 1000; ++step) {
compiled.Upload("indices", data);
compiled.ReplayForward(rt);
compiled.ReplayBackward(rt);
compiled.ReplayStep(rt, step, 3e-4f);
}
compiled.Save("model.oam");
}