CUDA was always cross-platform | SCALE - A better CUDA toolchain. Faster, on any GPU.

Part 3 of a series on ‘why Spectral exists’, ~10 minute read.

Part 2 argued that the compiler stack matters more in the agent era, not less, and that the substrate that agents write against should stay stable while the hardware varies. There's a hole in that argument I didn't address. It assumes CUDA can be the stable substrate even across hardware that looks fundamentally different from what it was designed for. A reasonable reader could grant everything in Parts 1 and 2 and still ask: sure, but why CUDA? Isn't it just NVIDIA's hardware description language with extra syntax?

That's the question this post answers. The claim: the CUDA programming model is more general than the hardware it grew up on, and this is not coincidental. For the past two decades, NVIDIA hardware has changed significantly, yet the core parts of CUDA really haven’t.

The wrong mental model

The mental model most people carry around is that CUDA is a thin layer over NVIDIA hardware. Threads map to CUDA cores, warps map to scheduler quanta, shared memory maps to a specific SRAM block on the streaming multiprocessor. The abstraction, in this view, is the hardware with a C++ syntax glued on top.

If that were true, porting CUDA to anything else would be a translation exercise. Find the AMD equivalent of each NVIDIA concept, substitute it in, and call it a day. The ceiling of the approach would be hardware that already looks like an NVIDIA GPU. Anything architecturally different — TPUs, systolic arrays, neuromorphic chips, whatever comes next — would be outside the model's reach by construction.

It isn't true. That's the wrong mental model, and almost everything interesting about CUDA's longevity follows from why it's wrong.

The right mental model

CUDA is a way of describing parallel computation. Specifically: a hierarchy of parallel work (grid, block, thread), an explicit hierarchy of memory locality (global, shared, register), and an execution model — SIMT — that lets the programmer reason about parallelism without micromanaging the hardware underneath.

None of those concepts are NVIDIA-specific. They're statements about how parallel computation decomposes. The grid-block-thread hierarchy is a way of expressing "this work is parallel across the whole problem, this work is parallel within a group that can share state, this work happens inside a single worker" — a structure that exists in any massively parallel computation, on any hardware that does massively parallel computation. The memory hierarchy is the same: every accelerator architecture has the property that some memory is close and fast and small, some is far and slow and large, and the programmer's life is easier if the model exposes that distinction explicitly rather than pretending it doesn't exist.

The hardware is the implementation of the model. The model is separable from it. This is the same move LLVM IR made twenty years ago — define the abstraction at the right level, and the backend becomes a code generation problem rather than a translation problem. The reason LLVM works across architectures isn't that x86 and ARM are secretly the same. It's that the IR was designed at a level of abstraction where the differences become a matter of code generation, not a matter of conceptual mismatch.

CUDA's longevity comes from the same property. The model was, by design, pitched at a level of abstraction general enough to outlive the specific GPU it was born on.

AMD is the easy case

At the ISA level, the gap between AMD and NVIDIA GPUs is comparable to the gap between different generations of NVIDIA GPUs. CUDA was designed to abstract away the hardware differences within NVIDIA's own catalogue — and that machinery turns out to be exactly the machinery you need for cross-vendor portability:

Different warp sizes are handled by the warpSize constant, and SCALE's compiler diagnostics to catch mistakes.
Different shared memory sizes are handled by occupancy APIs, launch bounds, and device property queries. NVIDIA already ships devices ranging from 64KB to 220KB, so CUDA programs deal with this already.
Different occupancy characteristics are handled by the same occupancy APIs already required to get optimal results across NVIDIA's own devices.
Low-level microarchitectural details sit below the programming model, where the compiler backend handles them.

None of this is new machinery added for AMD support. It's CUDA being used the way it was always meant to be used. Programs that aren't using these features will benefit from starting to, on NVIDIA hardware too.

Tensor cores are the exception. They aren't abstracted by the programming model at all, and they're radically different between vendors — and between NVIDIA generations. But even here, the big picture is the same: a warp or group of warps performs a matmul in registers. The layout of matrix elements across threads and the size of the matmul differ; the fundamental operation doesn't. This is the same situation CPU compilers faced when SIMD instruction sets started proliferating, and the resolution was the same one that has to happen here: build the compiler technology to model the operation and pattern-match naturally-written code onto it. Auto-vectorised SIMD CPU code is now routinely as good as or better than hand-written, and it's portable. The equivalent transform for GPU matmul accelerators is complex but tractable — fundamentally a pile of permutations to be cancelled out or fused into memory ops.

AMD is the proof of concept, not the proof of generality. If the model only ported to architecturally similar hardware, the skeptic would still be right that CUDA is a GPU-shaped abstraction. The interesting case is the one where the hardware doesn't look like a GPU at all.

The interesting case: GEMM on a systolic array

The workload that pays for everything in modern AI is matrix multiplication. So start there.

A well-written CUDA GEMM kernel does something specific. It tiles the output matrix into blocks, assigns each block to a thread block, and has the thread block cooperatively load tiles of the two input matrices into shared memory. Threads within the block then do a register-level accumulation, multiplying their share of the loaded tiles into a partial sum that lives in registers. When the block finishes accumulating across the K dimension, the result tile is written back to global memory. This is the CUTLASS decomposition, and it's the basic shape of every serious GEMM kernel in the wild.

Now look at how a systolic array does the same thing. A systolic array — the design that defines TPUs and most contemporary AI accelerators — is a fixed grid of multiply-accumulate units, wired together so that operands stream through the grid in a coordinated pattern. The A matrix streams in from one edge, the B matrix from another, and each processing element receives operands from its neighbours, multiplies them, accumulates into a local register, and passes the operands along to the next PE. The output accumulates in place. The entire matmul happens as a pipelined dance across the grid.

These look like fundamentally different mental models. They aren't.

The CUDA block becomes the systolic tile. The shared memory load becomes the edge feed into the array. The thread-level multiply-accumulate becomes the PE-level multiply-accumulate. The K-dimension loop in the kernel becomes the pipeline depth of the systolic pass. The same logical decomposition — load a tile, compute on it, write the result — describes both the GPU kernel and the systolic array execution. The CUDA source is describing a structure of computation, and the systolic array is one possible schedule for that structure, where the GPU is another.

This isn't a clever observation about computation in the abstract. The mapping from a tiled matmul to a systolic schedule is well understood — XLA implicitly relies on it, and every TPU compiler in the world does some version of it. The catch is the direction. TPU compilers don't start from CUDA. They start from tile-level abstractions designed to lower cleanly onto systolic hardware — Pallas, Triton, etc. — because nobody has built the compiler IP to raise a low-level representation like CUDA up to something a systolic backend can target. The lowering pathway exists; the raising pathway doesn't. That's why TPUs and other exotic accelerators have ended up with completely separate programming models, even though the structure of the computations they run is the same as what a CUDA kernel describes.

So the question isn't whether the CUDA model can express systolic computation. It can. The question is whether the compiler can recover enough structure from CUDA source to schedule it onto hardware that doesn't look like a GPU. That's a much harder problem than lowering from a high-level tile representation — and it's the problem that has to be solved if the substrate is going to stay stable while the silicon varies.

The novelty is the next step.

The general claim

If GEMM ports because the CUDA model describes structure rather than hardware, the same is true for other kernels — though the work the compiler has to do varies with the target. On SIMT-like hardware, the source already describes what the silicon does, and the raising stack is light. On something like a TPU, the compiler has to lift the source much further before it can schedule it onto the systolic execution model. Stencils, reductions, fused custom ops, attention kernels — the structural information is there in the CUDA source. The question is how much raising it takes to recover that structure for a given target, not whether the structure exists.

The compiler is what realizes this. The CUDA source expresses the structure of the computation at a level of abstraction high enough that the backend has room to be clever about how to execute it. This is the same property that lets LLVM turn serial C into vectorized SIMD: the source describes the computation, the compiler picks the implementation that fits the target. CUDA, properly understood, is in the same category. The model is a description of parallel structure, and the backend's job is to find the implementation that fits the silicon.

The implication matters. CUDA is not a candidate for the standard substrate of GPU programming. It's a candidate for the standard substrate of accelerated computing, full stop. The space of hardware the model can target is much larger than the space of hardware it was designed for, and the gap between those two spaces is the work.

NVIDIA itself has been extending the substrate in exactly this direction. CUDA 13.1, released at the end of 2025, introduced cuTile — a tile-based programming model that sits alongside SIMT, lets the developer describe the algorithm at a higher level, and has the compiler and runtime handle the partitioning onto threads. The underlying representation, CUDA Tile IR, is MLIR-based, in the same compiler-infrastructure family that Pallas, Triton, and the TPU stacks use. CUDA programmers now have both registers: SIMT when the workload wants explicit thread-level control, tile abstractions when the workload is more naturally expressed at the tile level. Same substrate, more ways to write into it. The argument of this post is that the CUDA model was always more general than the hardware it grew up on; cuTile is the model continuing to grow into that generality, on NVIDIA's own roadmap. Having the cake and eating it is the right outcome for a substrate that intends to outlast any specific generation of silicon.

The obvious objection is that other candidate substrates exist. OpenCL was designed for portability and has middling vendor support to show for it. SYCL is gaining traction in HPC but the corpus is small and the tooling is recent. Triton sits one level above CUDA and is excellent for kernel authoring, but it's an addition to the ecosystem rather than a replacement for the layer below. Mojo is new and unproven. The argument for CUDA isn't that it's uniquely well-designed as an abstraction — several of the alternatives are arguably cleaner in some areas. But in addition to serving NVIDIA’s hardware evolution well for 20 years, CUDA has the corpus, the muscle memory of every working GPU programmer, and the largest training distribution for any agent that will generate kernels in the future. The substrate question is partly technical and mostly path-dependent.

Why this matters for the thesis

Parts 1 and 2 made claims about what the substrate should be and who should provide it. This part makes the claim about why CUDA specifically deserves that role. The model is general, the corpus is large, the semantics are stable. If you were going to pick a programming model to be the standard across all accelerated computing — across GPUs from every vendor, across systolic arrays, across whatever the next generation of AI silicon turns out to look like — you would want one with exactly those properties. CUDA already has all three. No other candidate has them at the same scale.

The compiler is where the generality gets realized. Without a compiler that knows how to target each architecture, the model's generality is a theoretical property — interesting to think about, useless in production. With one, it's the basis for a future where the code people have already written, the kernels agents have already learned to generate, and the muscle memory accumulated across two decades of GPU programming all keep working as the hardware underneath changes shape.

Closing

CUDA is not a hardware description language. It's a programming model that happened to be born on one piece of hardware and has spent two decades being refined into something far more general than its origins suggest. The shape of AI hardware over the next decade is going to surprise people. New vendors, new architectures, new ideas about how silicon should be organized to run the workloads that matter. Some of those bets will pay off and some won't. We don't know which.

What we do know is that whichever hardware wins, the people building on top of it will want their existing code to keep working, their agents to keep generating code they can trust, and their substrate to stay stable across the change. That's the future SCALE is built to support — whichever vendor, whichever shape, whatever comes next. The model is general enough to get there. The work is making the compiler match.