Realm
TechnologyMarch 2026

Breaking Free From
the Green Kool-Aid.

How we built a cross-vendor GPU compute library that runs on every accelerator. From a 2017 ultrabook to data center racks. No vendor lock-in. No proprietary SDK. One binary. Every device.

Breaking free from vendor lock-in

There's a running joke in high-performance computing: you don't choose a GPU vendor — the vendor chooses you. You pick a card, learn its SDK, write thousands of lines of vendor-specific code, and then one day the hardware goes to the repair shop. Nobody knows when it's coming back. Your entire development environment just evaporated. Sound familiar?

The Problem With Monoculture

The dominant GPU compute ecosystem has a name everyone knows and a color everyone recognizes. It's brilliant technology. It's also a proprietary walled garden with its own compiler, its own memory model, its own profiler, and its own way of thinking about parallelism. Write code for it and you're locked in. Your ML training pipeline, your crypto batch verification, your real-time inference server — all of it tied to one vendor's hardware.

We've been there. We drank the kool-aid. We wrote fused kernels in the vendor's language. We optimized for their specific memory hierarchy. The benchmarks were great. Then the laptop went in for repairs.

Nobody knew when it was coming back.

The Moment of Clarity

We had a ThinkPad X1 Carbon from 2017. Fifth generation. Intel HD Graphics 620. 24 execution units. Shared system memory. No discrete GPU. No tensor cores. No RT cores. The kind of machine most people would consider a paperweight for anything compute-related.

We thought: what if this ran our entire HPC stack? Not as a fallback. Not as a debugging target. As actual compute hardware. The same binary that runs ML inference on a data center GPU — what if it just... worked on this thing?

It does. SiLU activations run 5.1x faster on the integrated GPU than on the CPU. Matrix multiplications: 2.5x. RMSNorm: 2.9x. Post-quantum signature verification: 13,800 operations per second. On a 2017 ultrabook. On shared memory. Without a single vendor-specific line of code.

How It Works (Without Telling You How)

We built a unified compute kernel language. Write the kernel once. It compiles to a portable binary format that every major GPU driver understands. NVIDIA discrete and data center. AMD discrete and integrated. Intel integrated, Arc, and Xe. Qualcomm mobile and edge. The same binary. The same performance characteristics. No recompilation.

The kernels are embedded directly into the application at build time. No runtime file I/O. No shader compilation stalls. No external dependencies. Ship one executable, drop it on any machine with a GPU, run it. That's it.

We're not going to tell you exactly which open standard makes this possible. What they don't know, they can't copy. But we will tell you it's not proprietary, it's not vendor-specific, and it ships with every major GPU driver released in the last three years.

The Repair Shop Scenario

Your workstation with the beefy GPU goes in for service. Nobody at the repair center knows when it's coming back. “Two weeks, maybe three. Could be a month.” You've heard this before.

In the old world, this is a crisis. Your development environment is tied to that specific hardware. The vendor SDK, the profiler, the compiled kernels — all of it assumes one vendor's GPU. You grab a borrowed machine with a different GPU and... nothing works. Different SDK. Different compiler. Different memory model. You spend two days porting code instead of shipping product.

In our world: same code, same binary, same tests, same results. Intel integrated? Works. AMD discrete? Works. That random laptop someone left in the office? If it has any GPU made in the last decade, it works. The build environment doesn't change. The CI pipeline doesn't change. Nothing changes except which transistors execute the math.

Baby Results, Big Implications

We measured everything on the X1 Carbon. Not because it's impressive hardware — it's 2017 ultrabook silicon. We measured it there because if an integrated GPU with 24 execution units and shared memory can accelerate compute kernels 2-5x over the CPU, imagine what a modern discrete GPU does with the same code.

The answer is: a lot more. We've run the same binary on RTX-class hardware. 77,000 tokens per second ML training throughput. 1.26 million post-quantum signature verifications per second. The same code. The same library. Just faster transistors.

Ever wondered why a game runs fine on an integrated GPU but most ML frameworks demand a specific vendor's high-end hardware? It's not a hardware limitation. It's a software limitation. The frameworks were built for one vendor. We built for all of them.

What We Actually Ship

The Oa library. Hand-crafted C++. Single static binary. Zero runtime dependencies. GPU-native compute, machine learning, post-quantum cryptography, and networking — all in one library that compiles on any platform with a C++20 compiler and a GPU driver.

  • Memory: Inline assembly optimizations. Zero-copy streaming buffers. Your data hits the hardware without a framework in between.
  • Compute kernels: Unified kernel language. Compiles once, runs on every GPU vendor. Embedded into the binary at build time.
  • Cryptography: Post-quantum signatures (Dilithium-3) and hashing (SHAKE-256) accelerated on GPU. Batch verification at hardware speed.
  • Threading: Work-stealing thread pool with topology-aware scheduling. P-core and E-core detection. NUMA affinity. Lock-free channels.

Written at 3 AM

This isn't a product that fell out of a framework generator. Every function was written by hand. Every abstraction was earned through profiling, not assumed from a textbook. The code reads like calligraphy — because it was written like calligraphy. Slowly. Carefully. Usually around 3 AM when the world is quiet and the only thing that matters is whether the instruction pipeline stalls.

Inspired by the philosophy of real-time engine architecture — where every microsecond matters, where memory layout is art, where you don't get to blame the garbage collector because there isn't one.

The Oa library is the foundation. Machine learning trains on it. The blockchain settles on it. The exchange matches on it. One library. One compute engine. Every vendor. Every device.

Stop drinking the kool-aid. Ship the binary.

Share this article