Why hardware-agnostic isn't the same as lowest-common-denominator | SCALE - A better CUDA toolchain. Faster, on any GPU.

Part 4 of a series on 'why Spectral exists', ~10 minute read.

Part 3 argued that the CUDA programming model is more general than the hardware it grew up on, and that the gap between the hardware it can target and the hardware it was designed for is the work. That argument has a cost attached to it, and a careful reader will already be reaching for the bill.

The objection goes like this. The moment you commit to running across hardware from more than one vendor, you have committed to expressing your program in terms of what those vendors have in common. Anything one architecture can do that the others can't is off the table, because the abstraction has to hold for all of them. So a cross-vendor system is a lowest-common-denominator system by construction, and lowest-common-denominator means slow. You can have portability or you can have peak performance. SCALE has chosen portability, and the performance tax is the price.

This is the most serious thing a technical investor or a competitor will say about what we do, and they are right about a whole category of systems. They are wrong about ours, and the reason they're wrong is the entire engineering bet. This post is about the difference.

Two things called portable

The word "portable" covers two mechanisms that have almost nothing in common, and the objection is true of one of them.

The first is portability by abstraction. You define a layer that sits above the hardware, you target the layer, and at runtime something maps the layer onto whatever machine you happen to be on. An interpreter does this. A runtime shim does this. A naive source-to-source translator does this. The layer can only express what every target supports, because it has no knowledge of the specific machine at the point where the program is written — that knowledge arrives too late to change the code. This is genuinely lowest-common-denominator, and it genuinely costs performance. The objection describes it perfectly.

The second is portability by compilation. You transform the program into a representation that preserves its structure, and then you generate native code for each target separately, with full knowledge of that target, ahead of time. The representation is shared; the code that comes out the other end is not. Nothing about the program is pinned to the intersection of what the targets have in common, because the decisions that matter are made per target, after the target is known.

These two things get filed under the same word, and the objection quietly assumes the first when the system in question is the second. SCALE is the second. Everything else in this post follows from that.

The LLVM proof

The cleanest way to see that cross-vendor and native-quality aren't in tension is to look at a place where the argument has already been settled, by twenty years of production use, in a system nobody accuses of being slow.

C compiled through LLVM runs on x86, ARM, and RISC-V. It is competitive with hand-written assembly on all of them. Nobody describes LLVM as a lowest-common-denominator layer, even though it is, by definition, an abstraction over multiple instruction set architectures — which is exactly the property the objection claims must cost performance. The LLVM IR does not express the intersection of what x86 and ARM can do. It expresses the structure of the computation, and then each backend generates native code with full knowledge of its own instruction set, register file, cost model, and scheduling constraints. The differences between the architectures simply shift into the realm of code generation.

That is the whole trick, and it is not a new one. Cross-architecture and native-quality coexist when the abstraction sits at the right level and the target-specific decisions are made by a backend that knows the target. SCALE is the same category of system, one layer up. The CUDA source is the high-level description; the per-vendor backend is where the machine gets exploited. If you accept that LLVM doesn't make C slow, you have already accepted the mechanism that lets a compiler be cross-vendor without being lowest-common-denominator. The only open question is whether we've built the backends well, which is an engineering question, not a question about whether the approach can work at all.

Where the objection is right

It is worth being precise about when "abstraction costs performance" is true, because it often is, and pretending otherwise would be the kind of marketing claim this series exists to avoid.

The tax is real whenever the target is decided at runtime rather than compile time. A dynamic translation layer that rewrites instructions on the fly pays for the translation and can't afford the expensive analysis that good code generation needs. A thin runtime shim that forwards calls to whatever device is present has no opportunity to specialise, because it doesn't know what it's running on until it's already running. Anything that targets the common subset by construction — because it was designed as a portability layer first and a performance system second — inherits the ceiling the objection describes.

What these have in common is the moment the target becomes known. If the program is committed to its final form before the machine is known, the machine can't inform the code, and the common-denominator ceiling is real. SCALE makes its decisions at compile time, per target, with the target fully in hand. Fundamentally, this differs from a runtime shim for strictly structural reasons: it is all about when the vendor-specific knowledge gets to act on the code. For us, it acts before a single instruction is emitted.

What a compiler does that a translator can't

Here is the claim the objection gets exactly backwards. Vendor-specific optimization is not erased by the cross-vendor model. The cross-vendor model is where the vendor-specific optimization happens. A translator substitutes one vendor's concepts for another's and stops. A compiler re-derives the right code for each machine from scratch, and the machine it's deriving for is the one it specialises against.

Concretely, when SCALE targets AMD, the backend is not looking up the AMD spelling of an NVIDIA concept. It is reasoning about AMD's hardware:

Occupancy is computed against the actual target. The number of wavefronts a compute unit can keep in flight depends on register and LDS pressure, and the tradeoff curve is AMD's, not NVIDIA's. The backend allocates registers and schedules against AMD's curve, which is a different optimisation problem with a different answer, not a translated version of the NVIDIA one.
Shared memory and LDS decisions are made for the target's sizes and banking. What lives in fast on-chip memory versus what stays in registers is a target-specific call, and the banking behaviour that determines whether those accesses conflict is AMD's banking behaviour.
The wavefront width is part of the cost model, not a find-and-replace. A warp is 32 lanes and a wavefront is commonly 64. That changes how reductions are shaped, how divergence costs, and how memory accesses coalesce. The backend knows the width and generates code that's right for it.
Memory access patterns are laid out for the target's coalescing rules. What counts as a coalesced access — and therefore how the compiler should arrange loads and stores — is a property of the specific memory subsystem, and the backend lays out for the one it's compiling for.

None of this is available to a translator, because a translator's unit of work is the concept, and these decisions live below the concept, in the code generation. The reason a compiler can match native code is that it is doing what the native programmer would do: making target-specific decisions with target-specific knowledge. The cross-vendor part is upstream of all of it.

The evidence

This stops being an argument the moment there's a number, because the objection makes a falsifiable prediction. A lowest-common-denominator system cannot match a hand-targeted native port. That's what lowest-common-denominator means — it's leaving the vendor-specific performance on the table by construction. So if SCALE matches a native HIP port on a real workload, the system is not lowest-common-denominator. The parity is the proof.

It does match. On GROMACS, unmodified CUDA compiled through SCALE runs on AMD at performance on par with the native HIP port, and on the MI355X work with TensorWave the picture is the same — native-quality code out of source that was never written for the target. A common-denominator layer is incapable of reaching parity with code that was hand-targeted to the machine. Reaching it anyway is the empirical refutation of the objection, not a rhetorical one. The benchmark and the claim are the same fact stated two ways.

The harder version of the objection

The sophisticated version of the pushback concedes the parity and moves the line. Fine — you match the HIP port. But you'll never beat a kernel that an expert hand-tuned for one specific piece of silicon, so there's still a gap, and the gap is the tax.

Two answers. The first is that hand-tuned-per-kernel is not the baseline that real codebases run at. Almost nobody is hand-tuning every kernel for every target; they're shipping the port they could afford to write, and the comparison that matters in practice is against that, not against an idealised expert who tuned everything. Against the realistic baseline, the gap is frequently zero or negative.

The second is that closing the remaining gap against the idealised baseline is precisely what compiler technology has always done, given time. Auto-vectorised SIMD started well behind hand-written assembly and is now routinely as good or better, and it's portable across instruction sets in a way hand-written assembly never is. The same arc applies here. As Part 3 put it for the tensor-core case, the hard transforms are fundamentally a pile of permutations to be cancelled or fused — complex, but tractable, and the kind of thing a compiler closes over time rather than a wall it hits. The gap is the work. It is not a ceiling, and a ceiling is what the objection needs it to be.

Why this matters for the thesis

If the objection were right, the series collapses. A substrate that taxes performance cannot be the standard for accelerated computing, because nobody standardises on the thing that makes their code slower — the entire history of this field is people abandoning portable-but-slow for fast-but-locked-in the instant the performance gap becomes load-bearing. Native-quality code generation isn't a nice property of SCALE on top of the argument in Parts 1 through 3. It's the precondition for any of that argument to hold.

Part 3 said the model is general. This post says the generality is free — that you don't pay for cross-vendor reach with performance, because the compiler makes the per-vendor decisions a native programmer would make, with the same information. The model being general and the performance being native are not two separate claims that happen to both be true. They're the same claim, and the compiler is the thing that makes both of them true at the same point in the pipeline.

Closing

Hardware-agnostic describes where the code runs. It does not describe how well it runs, and the assumption that the two are the same is an intuition imported from a different kind of system. The intuition is correct about runtimes, interpreters, and translators, where the target shows up too late to shape the code. It is wrong about compilers, where the target shows up exactly in time.

The lowest-common-denominator objection is the best argument against what we do, and it is an argument about a system we didn't build. The model is general, the performance is native, and the compiler is where both of those stop being a tension and start being the same sentence.

Still not convinced? Join our Discord. We love talking compilers, GPU programming, and having exactly these sorts of debates.