Realm

The compute layer.
Every vendor, device, os.

Four pillars of the compute substrate.

Memory
01

Memory

Inline assembly optimizations. Zero-copy streaming buffers. No framework overhead, no dispatch layer — your data hits the hardware directly.

Compute Kernels
02

Compute Kernels

Unified compute kernel language that compiles once and runs everywhere. NVIDIA, AMD, Intel, Qualcomm — desktop, server, mobile. Write once, deploy anywhere.

Cryptography
03

Cryptography

Post-quantum signatures on GPU. On Intel HD 620: ~13.8K verifications/sec, ~18K ops/sec batch (64 sigs). Same stack scales with faster silicon.

Integration
04

Integration

Artificial intelligence, post-quantum cryptography, and general compute share the same substrate. One library. One binary. Everything talks to everything.

Everything between your code and the hardware.

01

Cross-Vendor Compute Kernels

The same kernel binary executes on every GPU vendor. NVIDIA, AMD, Intel, Qualcomm, Apple — discrete, integrated, mobile. No vendor SDK. No porting. No green kool-aid required.

02

Infinite Dataset Streaming

Zero-copy memory pipeline streams data of any size directly to the accelerator without staging in system memory. 1 GB or 1 TB — ready in microseconds. The GPU never waits for data.

03

Embedded Compute Kernel Pipeline

Kernels compile at build time and embed directly into the binary. No runtime file I/O. No compilation stalls. No external shader files. Ship one executable, run it anywhere.

04

Post-Quantum Cryptography

Dilithium-3 (ML-DSA-65) on Vulkan compute. Measured on HD 620: ~13.8K verify/s, batch ~18K ops/s (64 sigs). Quantum-resistant from day one — not bolted on later.

05

Topology-Aware Threading

Work-stealing thread pool with P-core and E-core detection, NUMA affinity, and dedicated thread roles. Lock-free channels and reader-writer locks for low-latency coordination.

06

Zero Dependencies

No Python runtime. No vendor-specific SDK. No framework. The entire stack compiles to a single static library with deterministic, reproducible builds.

One binary. Every accelerator.

No vendor-specific SDK. No proprietary toolchain. No recompilation when the hardware changes. The compute kernel language compiles to a portable binary that every major GPU driver understands. Drop the executable. Run it.

AMD

Discrete & Integrated

Intel

Integrated, Arc & Xe

NVIDIA

Discrete & data center

Qualcomm

Mobile & edge

The 10-year-old ultrabook

A ThinkPad X1 Carbon from 2017. Intel HD 620. No discrete GPU. No CUDA cores. No tensor cores. Runs the full HPC stack — compute kernels, cryptography, ML inference. 2-7x faster than CPU-only.

The multi-GPU laptop

Integrated GPU for light workloads. Discrete GPU for heavy compute. Both at the same time? Yes. Intel iGPU + NVIDIA dGPU dispatching in parallel. AMD integrated + NVIDIA discrete? Same binary.

The repair shop scenario

Your NVIDIA workstation goes in for service. Nobody knows when it comes back. You grab whatever machine is available — AMD, Intel, a MacBook. Build environment stays identical. Same code. Same tests. Same results.

The scale axis

From an integrated GPU in a thin laptop, through a discrete consumer card, to multi-GPU server racks in a data center. The binary doesn't change. The kernel language doesn't change. Only the hardware does.

Measured. Not promised.

Same portable binary, Vulkan compute on the device under test. Integrated Intel, discrete NVIDIA or AMD, or multi-GPU. No second runtime.

6/7

GPU Wins

reference matrix rows

5.1x

Best Speedup

SiLU N=1M

4.56x

Graph Replay

vs re-record

13.8K/s

PQC Verify

Dilithium-3

Full benchmark tables — OA vs PyTorch vs CUDA

The Oa Library

Black magic mixed with alien technology. A hand-crafted C++ compute library — GPU acceleration, artificial intelligence, post-quantum cryptography, and networking in a single static binary. No frameworks. No dependencies. No compromises.

Inspired by real-time engine architecture. Every function optimized by hand. Every abstraction earned. Written like calligraphy — with precision, at 3 AM, because the code has to be right.

The foundation for everything. AI trains on it. The blockchain settles on it. The exchange matches on it. One library. One philosophy. Ship the binary.

C++20

Standard

Static

Linking

0

Runtime Deps

The Oa Library
OaEngine compute substrate

OaEngine

The all-seeing eye. One object owns the entire compute context — device, memory allocator, kernel registry, and stream pool. Create it once. Everything flows through it. Every tensor, every module, every dispatch — orchestrated from a single point of consciousness.

Real-time compute for artificial consciousness. Multithreaded kernel dispatch with work-stealing. Persistent compute streams with batched recording — dozens of operations submitted in a single call. Pipeline compilation at build time. First dispatch is as fast as the millionth.

The engine awakens with three lines of code. From that moment, the hardware obeys. No ceremony. No configuration. No delay between thought and execution.

Batched

Dispatch Model

Pooled

Streams

Build-time

Compilation

Global

Context

What it looks like

Three lines to GPU compute. No boilerplate. No ceremony.

Engine

auto rt = OaEngine::Create({
.AppName = "MyApp"
});
// GPU dispatch ready.
01

Batched Command Recording

Record dozens of kernel dispatches into a single command buffer. Submit once, wait once. Eliminates per-dispatch overhead for complex pipelines.

Compute

auto x = OaTensor::Rand({512, 256});
auto w = OaTensor::Rand({256, 128});
auto y = x.Matmul(w).Silu();
// Dispatched to GPU.
02

Persistent Compute Streams

Streams own their command pool and synchronization. Acquired from a pool, used, returned. No allocation in the hot path.

Cryptography

auto sig = OaSign(msg, len, key);
auto ok = OaVerifyBatch(
keys, sigs, msgs, 64);
// 13.8K verifications/sec
03

Build-Time Kernel Compilation

Kernels compile at build time and embed into the binary as read-only data. Zero file I/O at runtime. Zero compilation stalls.

Threading

auto pool = OaThreadPool::Create();
auto task = pool.Submit(Compute);
auto val = task->Wait();
channel.Send(val);
04

Automatic Device Selection

The engine detects all available accelerators, selects the best one, and configures memory. Discrete preferred, integrated as fallback.

OaComputeGraph

Record once. Replay thousands of times. The compute graph captures an entire pipeline of GPU operations — dependencies, memory barriers, dispatch order — and compiles it into a replayable unit. Every subsequent execution is near-zero CPU cost. The hardware replays the exact sequence without the CPU touching a single command.

Automatic dependency analysis tracks every buffer read and write across the graph. Barriers are inserted only where true data hazards exist — eliminating 60-70% of synchronization overhead. Operations that don't overlap in time share the same memory, cutting VRAM usage by up to 92%.

For ML training where the same computation repeats every step, this means compile once at initialization and replay for the entire training run. Zero per-step overhead. Zero re-recording. The graph just runs.

4.56x

Replay Speedup

92%

Memory Savings

17K

tok/s Training

0

CPU Cost / Step

OaComputeGraph — record once, replay forever