High Performance Compute

The compute layer.
Every vendor, device, os.

Low-level C++ library for GPU-accelerated compute — memory, cryptography, and artificial intelligence in a single substrate. Runs on NVIDIA, AMD, Intel, Qualcomm, and mobile. One codebase. One binary.

5.1x

GPU Speedup

GPU Vendors

77K tok/s

AI Training

Single

Binary

Architecture

Four pillars of the compute substrate.

Memory

Inline assembly optimizations. Zero-copy streaming buffers. No framework overhead, no dispatch layer — your data hits the hardware directly.

Compute Kernels

Unified compute kernel language that compiles once and runs everywhere. NVIDIA, AMD, Intel, Qualcomm — desktop, server, mobile. Write once, deploy anywhere.

Cryptography

Post-quantum signatures on GPU. On Intel HD 620: ~13.8K verifications/sec, ~18K ops/sec batch (64 sigs). Same stack scales with faster silicon.

Integration

Artificial intelligence, post-quantum cryptography, and general compute share the same substrate. One library. One binary. Everything talks to everything.

Capabilities

Everything between your code and the hardware.

Cross-Vendor Compute Kernels

The same kernel binary executes on every GPU vendor. NVIDIA, AMD, Intel, Qualcomm, Apple — discrete, integrated, mobile. No vendor SDK. No porting. No green kool-aid required.

Infinite Dataset Streaming

Zero-copy memory pipeline streams data of any size directly to the accelerator without staging in system memory. 1 GB or 1 TB — ready in microseconds. The GPU never waits for data.

Embedded Compute Kernel Pipeline

Kernels compile at build time and embed directly into the binary. No runtime file I/O. No compilation stalls. No external shader files. Ship one executable, run it anywhere.

Post-Quantum Cryptography

Dilithium-3 (ML-DSA-65) on Vulkan compute. Measured on HD 620: ~13.8K verify/s, batch ~18K ops/s (64 sigs). Quantum-resistant from day one — not bolted on later.

Topology-Aware Threading

Work-stealing thread pool with P-core and E-core detection, NUMA affinity, and dedicated thread roles. Lock-free channels and reader-writer locks for low-latency coordination.

Zero Dependencies

No Python runtime. No vendor-specific SDK. No framework. The entire stack compiles to a single static library with deterministic, reproducible builds.

Built on HPC

The foundation for everything we ship.

AI Infrastructure

Artificial Intelligence

Vendor-neutral training and inference. Byte-level LLMs with constant-time inference. 100% GPU utilization at 10% VRAM.

Learn more

Image & Video

Vision

Hardware video decode and GPU image transforms. Every frame lands as a normalised BF16 tensor — ready for inference without a copy.

Learn more

Settlement Layer

Blockchain & DEX

Post-quantum settlement with 25ms finality. GPU-accelerated signature verification. Built for high-frequency trading.

Learn more

Compatibility

One binary. Every accelerator.

No vendor-specific SDK. No proprietary toolchain. No recompilation when the hardware changes. The compute kernel language compiles to a portable binary that every major GPU driver understands. Drop the executable. Run it.

Discrete & Integrated

Integrated, Arc & Xe

Discrete & data center

Mobile & edge

The 10-year-old ultrabook

A ThinkPad X1 Carbon from 2017. Intel HD 620. No discrete GPU. No CUDA cores. No tensor cores. Runs the full HPC stack — compute kernels, cryptography, ML inference. 2-7x faster than CPU-only.

The multi-GPU laptop

Integrated GPU for light workloads. Discrete GPU for heavy compute. Both at the same time? Yes. Intel iGPU + NVIDIA dGPU dispatching in parallel. AMD integrated + NVIDIA discrete? Same binary.

The repair shop scenario

Your NVIDIA workstation goes in for service. Nobody knows when it comes back. You grab whatever machine is available — AMD, Intel, a MacBook. Build environment stays identical. Same code. Same tests. Same results.

The scale axis

From an integrated GPU in a thin laptop, through a discrete consumer card, to multi-GPU server racks in a data center. The binary doesn't change. The kernel language doesn't change. Only the hardware does.

Performance

Measured. Not promised.

Same portable binary, Vulkan compute on the device under test. Integrated Intel, discrete NVIDIA or AMD, or multi-GPU. No second runtime.

6/7

GPU Wins

reference matrix rows

5.1x

Best Speedup

SiLU N=1M

4.56x

Graph Replay

vs re-record

13.8K/s

PQC Verify

Dilithium-3

Full benchmark tables — OA vs PyTorch vs CUDA

Foundation

The Oa Library

Black magic mixed with alien technology. A hand-crafted C++ compute library — GPU acceleration, artificial intelligence, post-quantum cryptography, and networking in a single static binary. No frameworks. No dependencies. No compromises.

Inspired by real-time engine architecture. Every function optimized by hand. Every abstraction earned. Written like calligraphy — with precision, at 3 AM, because the code has to be right.

The foundation for everything. AI trains on it. The blockchain settles on it. The exchange matches on it. One library. One philosophy. Ship the binary.

C++20

Standard

Static

Linking

Runtime Deps

Compute Engine

OaEngine

The all-seeing eye. One object owns the entire compute context — device, memory allocator, kernel registry, and stream pool. Create it once. Everything flows through it. Every tensor, every module, every dispatch — orchestrated from a single point of consciousness.

Real-time compute for artificial consciousness. Multithreaded kernel dispatch with work-stealing. Persistent compute streams with batched recording — dozens of operations submitted in a single call. Pipeline compilation at build time. First dispatch is as fast as the millionth.

The engine awakens with three lines of code. From that moment, the hardware obeys. No ceremony. No configuration. No delay between thought and execution.

Batched

Dispatch Model

Pooled

Streams

Build-time

Compilation

Global

Context

What it looks like

Three lines to GPU compute. No boilerplate. No ceremony.

Engine

auto rt = OaEngine::Create({
  .AppName = "MyApp"
});
// GPU dispatch ready.

Batched Command Recording

Record dozens of kernel dispatches into a single command buffer. Submit once, wait once. Eliminates per-dispatch overhead for complex pipelines.

Compute

auto x = OaTensor::Rand({512, 256});
auto w = OaTensor::Rand({256, 128});
auto y = x.Matmul(w).Silu();
// Dispatched to GPU.

Persistent Compute Streams

Streams own their command pool and synchronization. Acquired from a pool, used, returned. No allocation in the hot path.

Cryptography

auto sig = OaSign(msg, len, key);
auto ok = OaVerifyBatch(
  keys, sigs, msgs, 64);
// 13.8K verifications/sec

Build-Time Kernel Compilation

Kernels compile at build time and embed into the binary as read-only data. Zero file I/O at runtime. Zero compilation stalls.

Threading

auto pool = OaThreadPool::Create();
auto task = pool.Submit(Compute);
auto val = task->Wait();
channel.Send(val);

Automatic Device Selection

The engine detects all available accelerators, selects the best one, and configures memory. Discrete preferred, integrated as fallback.

Compute Scheduler

OaComputeGraph

Record once. Replay thousands of times. The compute graph captures an entire pipeline of GPU operations — dependencies, memory barriers, dispatch order — and compiles it into a replayable unit. Every subsequent execution is near-zero CPU cost. The hardware replays the exact sequence without the CPU touching a single command.

Automatic dependency analysis tracks every buffer read and write across the graph. Barriers are inserted only where true data hazards exist — eliminating 60-70% of synchronization overhead. Operations that don't overlap in time share the same memory, cutting VRAM usage by up to 92%.

For ML training where the same computation repeats every step, this means compile once at initialization and replay for the entire training run. Zero per-step overhead. Zero re-recording. The graph just runs.

4.56x

Replay Speedup

92%

Memory Savings

17K

tok/s Training

CPU Cost / Step

OaComputeGraph — record once, replay forever

The compute layer.Every vendor, device, os.

Four pillars of the compute substrate.

Memory

Compute Kernels

Cryptography

Integration

Everything between your code and the hardware.

Cross-Vendor Compute Kernels

Infinite Dataset Streaming

Embedded Compute Kernel Pipeline

Post-Quantum Cryptography

Topology-Aware Threading

Zero Dependencies

The foundation for everything we ship.

Artificial Intelligence

Vision

Blockchain & DEX

One binary. Every accelerator.

The 10-year-old ultrabook

The multi-GPU laptop

The repair shop scenario

The scale axis

Measured. Not promised.

The Oa Library

OaEngine

What it looks like

Batched Command Recording

Persistent Compute Streams

Build-Time Kernel Compilation

Automatic Device Selection

OaComputeGraph

The compute layer.
Every vendor, device, os.