
High Performance Compute
The compute layer.
Every vendor, device, os.
Low-level C++ library for GPU-accelerated compute — memory, cryptography, and machine learning in a single substrate. Runs on NVIDIA, AMD, Intel, Qualcomm, and mobile. One codebase. One binary.
5.1x
GPU Speedup
5+
GPU Vendors
77K tok/s
ML Training
Single
Binary
Architecture
Four pillars of the compute substrate.

Memory
Inline assembly optimizations. Zero-copy streaming buffers. No framework overhead, no dispatch layer — your data hits the hardware directly.

Compute Kernels
Unified compute kernel language that compiles once and runs everywhere. NVIDIA, AMD, Intel, Qualcomm — desktop, server, mobile. Write once, deploy anywhere.

Cryptography
Post-quantum signatures accelerated on GPU. 1.26M Dilithium-3 verifications per second. Batch verification at hardware speed.

Integration
Machine learning, post-quantum cryptography, and general compute share the same substrate. One library. One binary. Everything talks to everything.
Capabilities
Everything between your code and the hardware.
Cross-Vendor Compute Kernels
The same kernel binary executes on every GPU vendor. NVIDIA, AMD, Intel, Qualcomm, Apple — discrete, integrated, mobile. No vendor SDK. No porting. No green kool-aid required.
Infinite Dataset Streaming
Zero-copy memory pipeline streams data of any size directly to the accelerator without staging in system memory. 1 GB or 1 TB — ready in microseconds. The GPU never waits for data.
Embedded Compute Kernel Pipeline
Kernels compile at build time and embed directly into the binary. No runtime file I/O. No compilation stalls. No external shader files. Ship one executable, run it anywhere.
Post-Quantum Cryptography
Dilithium-3 (ML-DSA-65) signatures run on GPU at hardware speed. Batch verification at 1.26M ops/sec. Quantum-resistant from day one — not bolted on later.
Topology-Aware Threading
Work-stealing thread pool with P-core and E-core detection, NUMA affinity, and dedicated thread roles. Lock-free channels and reader-writer locks for low-latency coordination.
Zero Dependencies
No Python runtime. No vendor-specific SDK. No framework. The entire stack compiles to a single static library with deterministic, reproducible builds.
Built on HPC
The foundation for everything we ship.
Compatibility
One binary. Every accelerator.
No vendor-specific SDK. No proprietary toolchain. No recompilation when the hardware changes. The compute kernel language compiles to a portable binary that every major GPU driver understands. Drop the executable. Run it.
Discrete & data center
Discrete & integrated
Integrated, Arc & Xe
Mobile & edge
The 10-year-old ultrabook
A ThinkPad X1 Carbon from 2017. Intel HD 620. No discrete GPU. No CUDA cores. No tensor cores. Runs the full HPC stack — compute kernels, cryptography, ML inference. 2-7x faster than CPU-only.
The multi-GPU laptop
Integrated GPU for light workloads. Discrete GPU for heavy compute. Both at the same time? Yes. Intel iGPU + NVIDIA dGPU dispatching in parallel. AMD integrated + NVIDIA discrete? Same binary.
The repair shop scenario
Your NVIDIA workstation goes in for service. Nobody knows when it comes back. You grab whatever machine is available — AMD, Intel, a MacBook. Build environment stays identical. Same code. Same tests. Same results.
The scale axis
From an integrated GPU in a thin laptop, through a discrete consumer card, to multi-GPU server racks in a data center. The binary doesn't change. The kernel language doesn't change. Only the hardware does.
Performance
Measured. Not promised.
Test Device
Intel HD Graphics 620
KBL GT2 — 24 Execution Units · ThinkPad X1 Carbon 5th Gen (2017)
Integrated GPU
No discrete GPU. No CUDA cores. No tensor cores. Shared system memory.
11/16
GPU Wins
benchmarks
5.1x
Best Speedup
SiLU N=1M
4.56x
Graph Replay
vs re-record
13.8K/s
PQC Verify
Dilithium-3
Compute Kernels
CPU vs GPU — same binary, same machine
| Kernel | Size | CPU | GPU | Speedup |
|---|---|---|---|---|
| SiLU | N=4K | 35 µs | 162 µs | 4.6x CPU |
| SiLU | N=64K | 358 µs | 251 µs | 1.4x GPU |
| SiLU | N=1M | 5.0 ms | 968 µs | 5.1x GPU |
| Matmul | 128x128 | 2.4 ms | 1.4 ms | 1.7x GPU |
| Matmul | 256x256 | 17.6 ms | 7.1 ms | 2.5x GPU |
| RMSNorm | 4096x256 | 5.5 ms | 1.9 ms | 2.9x GPU |
| CrossEntropy | B=4K | 10.9 ms | 4.3 ms | 2.6x GPU |
ML Pipeline
Matmul → SiLU — simulated forward pass
| Size | CPU | GPU | Speedup |
|---|---|---|---|
| 32x128 | 828 µs | 512 µs | 1.6x GPU |
| 128x256 | 9.6 ms | 5.6 ms | 1.7x GPU |
| 512x256 | 37.9 ms | 15.8 ms | 2.4x GPU |
Post-Quantum Cryptography
Dilithium-3 (ML-DSA-65) post-quantum signatures
Dilithium KeyGen
18.1K ops/s
Dilithium Sign
6.3K ops/s
58 B
Dilithium Verify
13.8K ops/s
58 B
Batch Verify
18.0K ops/s
64 sigs
This is an integrated GPU.
No discrete card. No CUDA cores. No tensor cores. No RT cores. Just 24 execution units on shared system memory in a 2017 ultrabook. The full HPC stack — compute kernels, ML pipeline, post-quantum cryptography — running 2-5x faster than CPU. Imagine what a discrete GPU does with this.
Foundation
The Oa Library
Black magic mixed with alien technology. A hand-crafted C++ compute library — GPU acceleration, machine learning, post-quantum cryptography, and networking in a single static binary. No frameworks. No dependencies. No compromises.
Inspired by real-time engine architecture. Every function optimized by hand. Every abstraction earned. Written like calligraphy — with precision, at 3 AM, because the code has to be right.
The foundation for everything. Machine learning trains on it. The blockchain settles on it. The exchange matches on it. One library. One philosophy. Ship the binary.
C++20
Standard
Static
Linking
0
Runtime Deps


Compute Engine
OaEngine
The all-seeing eye. One object owns the entire compute context — device, memory allocator, kernel registry, and stream pool. Create it once. Everything flows through it. Every tensor, every module, every dispatch — orchestrated from a single point of consciousness.
Real-time compute for artificial consciousness. Multithreaded kernel dispatch with work-stealing. Persistent compute streams with batched recording — dozens of operations submitted in a single call. Pipeline compilation at build time. First dispatch is as fast as the millionth.
The engine awakens with three lines of code. From that moment, the hardware obeys. No ceremony. No configuration. No delay between thought and execution.
Batched
Dispatch Model
Pooled
Streams
Build-time
Compilation
Global
Context
What it looks like
Three lines to GPU compute. No boilerplate. No ceremony.
engine
auto rt = OaEngine::Create({.AppName = "MyApp"});// GPU dispatch ready.
Batched Command Recording
Record dozens of kernel dispatches into a single command buffer. Submit once, wait once. Eliminates per-dispatch overhead for complex pipelines.
compute
auto x = OaTensor::Rand({512, 256});auto w = OaTensor::Rand({256, 128});auto y = x.Matmul(w).Silu();// Dispatched to GPU.
Persistent Compute Streams
Streams own their command pool and synchronization. Acquired from a pool, used, returned. No allocation in the hot path.
cryptography
auto sig = OaSign(msg, len, key);auto ok = OaVerifyBatch(keys, sigs, msgs, 64);// 13.8K verifications/sec
Build-Time Kernel Compilation
Kernels compile at build time and embed into the binary as read-only data. Zero file I/O at runtime. Zero compilation stalls.
threading
auto pool = OaThreadPool::Create();auto task = pool.Submit(Compute);auto val = task->Wait();channel.Send(val);
Automatic Device Selection
The engine detects all available accelerators, selects the best one, and configures memory. Discrete preferred, integrated as fallback.
Compute Scheduler
OaComputeGraph
Record once. Replay thousands of times. The compute graph captures an entire pipeline of GPU operations — dependencies, memory barriers, dispatch order — and compiles it into a replayable unit. Every subsequent execution is near-zero CPU cost. The hardware replays the exact sequence without the CPU touching a single command.
Automatic dependency analysis tracks every buffer read and write across the graph. Barriers are inserted only where true data hazards exist — eliminating 60-70% of synchronization overhead. Operations that don't overlap in time share the same memory, cutting VRAM usage by up to 92%.
For ML training where the same computation repeats every step, this means compile once at initialization and replay for the entire training run. Zero per-step overhead. Zero re-recording. The graph just runs.
4.56x
Replay Speedup
92%
Memory Savings
17K
tok/s Training
0
CPU Cost / Step

Two Paths
Pick your level of control.
The same GPU. The same shaders. The same checkpoint format. Start with the easy path. Drop to the compiled path when you need zero-overhead replay.
nn.Module + Autograd
PyTorch-style. Define layers, call Forward, call Backward. Automatic gradient computation. Best for research and iteration.
autograd
class TinyLlm : public OaModule {public:TinyLlm(OaI32 D, OaI32 DFF, OaI32 NL) {Embed_ = OaMakeShared<OaEmbedding>(256, D);for (OaI32 i = 0; i < NL; ++i) {auto block = OaMakeShared<OaSequential>();block->Add(OaMakeShared<OaRMSNorm>(D));block->Add(OaMakeShared<OaLinear>(D, DFF));block->Add(OaMakeShared<OaSiLU>());block->Add(OaMakeShared<OaLinear>(DFF, D));Layers_.push_back(block);}Head_ = OaMakeShared<OaLinear>(D, 256);}OaTensor Forward(const OaTensor& x) override {auto h = Embed_->Forward(x);for (auto& layer : Layers_) {auto res = h;h = layer->Forward(h).Add(res);}return Head_->Forward(h);}};int main() {auto rt = OaEngine::Create({.AppName = "Train"}).Unwrap();TinyLlm model(64, 256, 2);OaAdamW opt(model.Parameters(), 3e-4f);for (OaI32 step = 0; step < 1000; ++step) {auto loss = OaCrossEntropyLoss(model.Forward(input), targets);loss.Backward();opt.Step();opt.ZeroGrad();}model.Save("model.oam");}
GraphBuilder + Compile
Declare the graph. Compile once. Replay thousands of times with zero CPU overhead. Memory aliasing and barrier optimization included.
graph
struct TinyLlm {OaI32 D = 64, DFF = 256, NL = 2;void Build(OaGraphBuilder& g) {auto x = g.Input("indices", {256},OaScalarType::UInt32);x = g.Op("byte_embed",{x, g.Weight("embed", {256, D})});for (OaI32 i = 0; i < NL; ++i) {auto res = x;auto s = std::to_string(i);x = g.RmsNorm(x, g.Weight("norm_" + s));x = g.Silu(g.Matmul(x,g.Weight("up_" + s, {D, DFF})));x = g.Add(g.Matmul(x,g.Weight("down_" + s, {DFF, D})), res);}auto logits = g.Matmul(x,g.Weight("head", {D, 256}));g.SetLoss(g.CrossEntropy(logits, g.Input("targets")));}};int main() {auto rt = OaEngine::Create({.AppName = "Train"}).Unwrap();TinyLlm model;OaGraphBuilder builder;model.Build(builder);auto compiled = builder.Compile(rt);for (OaI32 step = 0; step < 1000; ++step) {compiled.Upload("indices", data);compiled.ReplayForward(rt);compiled.ReplayBackward(rt);compiled.ReplayStep(rt, step, 3e-4f);}compiled.Save("model.oam");}

