🟒 🏠 SCALE comes home: NVIDIA GPU support is here. Compile your CUDA code to any target. Read the online documentation to learn more.

🟒 🏠 SCALE comes home: NVIDIA GPU support is here. Compile your CUDA code to any target. Read the online documentation to learn more.
Back to all posts

How CUDA Won by NOT being a Standard

Author photo

Michael SΓΈndergaard

Wed May 13 2026

The way we standardize accelerated computing is broken, and it has been broken for long enough that the workaround: single-vendor capture, is now the default. It is worth saying out loud why the consortium model failed, what replaced it inside NVIDIA's walls, and what a cross-vendor version of the replacement actually looks like.

Khronos was right about the problem

For thirty years, cross-vendor portability in graphics and parallel compute was delivered by consortia. A group of vendors would convene, draft a specification, ratify it, and then each member would ship a conforming implementation. OpenGL, Vulkan, OpenCL, SYCL β€” all variations on the same model. The Khronos Group is the canonical example, and to be clear: the diagnosis is right. If you want code to run across competing hardware, you need a neutral cross-vendor body. That part is not in dispute.

The mechanism, however, is a product of a different era.

The Khronos model works when the spec-ratification clock runs faster than the workload-evolution clock. Graphics APIs are a clean example. Triangles have not changed much in twenty years, so a spec ratified in 2016 still gets you most of what you need in 2026. The standardization clock ran ahead of the application clock, so the model held.

In accelerated computing, the clocks have inverted.

Design by committee on an AI clock

The relevant workload for accelerated computing today is not graphics. It is dense linear algebra wrapped in increasingly exotic scheduling primitives, evolving on a release cycle measured in months. Flash Attention v1 to v3 took roughly two years and introduced fundamentally new programming patterns β€” warp specialization, asynchronous tile movement, persistent kernels. None of these patterns existed when SYCL 2020 was being drafted. They are unlikely to fit cleanly into SYCL 2030 either, because by then the patterns will have moved again.

A standards body cannot move at the pace of AI research. The structural reason is that a specification is, by construction, the intersection of what every vendor will agree to ship. Intersections move slowly because every member can veto. Meanwhile the workload moves at the pace of the fastest research team on Earth. In practice the mismatch is closer to 100x.

So in the field where standardization matters most, the historical mechanism is no longer fit for purpose.

CUDA already proved the alternative

CUDA is not a standard. It has no consortium, no ratified spec, and no member companies. It is whatever NVIDIA ships in the next release of the toolkit. The "spec" is the implementation. The reference is the binary.

This is supposed to be a weakness β€” vendor lock-in, a single point of failure. It is all of those things. But it is also why CUDA wins. The CUDA team can introduce a new memory model, a new tensor core instruction, a new asynchronous copy primitive, and it ships. There is no committee meeting. There is no compatibility working group. The implementation moves at the pace of NVIDIA's hardware roadmap, which happens to match the pace of the workload, because NVIDIA is also the workload's largest direct customer.

Implementation-defined standardization is faster than specification-defined standardization. CUDA is the proof.

The catch is the obvious one: CUDA is one vendor's implementation. The standard is fast because it is captured. That is not an acceptable end state for the rest of the industry, and the rest of the industry knows it.

The for-profit Khronos analogue

Spectral exists to resolve this contradiction. We preserve the right diagnosis from Khronos β€” that accelerated computing needs a neutral cross-vendor body β€” and replace the mechanism. Instead of a consortium that publishes a specification and waits for member implementations, we are a focused, funded engineering team that ships the implementation itself, for every vendor that wants to participate.

SCALE is that implementation. It compiles unmodified CUDA source β€” the de-facto programming model of the field β€” and produces native code for AMD and NVIDIA GPUs today, with additional architectures in flight. The "standard" is what SCALE compiles. The conformance test is whether your code runs, and runs fast. There is no document to argue about. There is a toolchain.

This makes Spectral structurally different from both a standards body and a chip vendor.

We do not build silicon, so we have no incentive to favour any architecture, and we do not run a consortium, so we are not bottlenecked on member agreement. We work directly with chip vendors, take feedback from the developers writing kernels in production, and ship the implementation that closes the gap on a timescale of weeks.

The for-profit structure is load-bearing here. A volunteer consortium can publish a PDF. It cannot maintain a production-grade compiler across multiple architectures at the pace AI demands. That requires a funded engineering team with hard deadlines and direct accountability to paying customers. Khronos identified the right role β€” the neutral cross-vendor body β€” and chose the wrong form. The form has to be a company.

Why agents make this urgent

There is one more reason the specification model is finished, and the full force of it has not landed yet.

Code is increasingly written by language models. An agent that produces a CUDA kernel produces it in seconds. If you want that kernel to run on a non-NVIDIA target, the path from agent-generated source to executing binary has to operate at the same wall-clock speed. A spec-based portability story β€” wait for the consortium to define an equivalent primitive, then for the vendor to ship it, then for a developer to port the code β€” does not work when the producer of the code is operating at machine pace.

Implementation-defined portability does work. If the kernel compiles, it runs. If a new primitive shows up in the wild, the team shipping the implementation adds it. There is no third actor in the loop.

The pace of agentic code generation is going to make committee-driven standardization look the way ISO 9001 looks to a software startup: a process from a different era.

The question is not whether implementation-defined standards win. They have already won, inside one vendor.

The question is who builds the cross-vendor version.

We are. SCALE compiles unmodified CUDA code on AMD and NVIDIA today, with more architectures in flight: Install SCALE now and try it out