SCALE 1.7 is out | SCALE - A better CUDA toolchain. Faster, on any GPU.

This month's SCALE release is a meaty one. It pushes on three fronts at once: fleet deployability for cloud providers, raw HPC performance, and the PyTorch coverage that AI shops need to take AMD seriously.

Head over to our install page to get the new release, and let us know what you find. Bug reports and feedback are welcome on our Discord.

Multi-architecture AMD builds

SCALE can now produce a single binary that targets multiple AMD GPU architectures at once, with the right code path picked at runtime - the same multi-ISA bundling HIP supports. One artifact, one validation pass, the whole AMD fleet.

This matters for anyone shipping a CUDA workload across a mixed AMD fleet:

For HPC enterprises and Independent Software Vendors, it collapses the per-SKU build matrix - one binary covers all their AMD targets.
For neoclouds and inference-as-a-service providers building their own runtime images, it does the same thing plus gives the scheduler full placement freedom across the fleet.
For CSPs renting bare GPU-hours, it makes their AMD catalogue materially easier for customers to actually adopt - especially the long-tail SKUs.

One last thing to note: SCALE provides first-class support for older AMD architectures no longer supported by ROCm. See our documentation for details. If you have any of that hardware in your fleet, SCALE keeps it earning.

Kokkos support

As you may know, Kokkos is the C++ performance-portability library behind a huge fraction of DOE simulation codes, hosted by the Linux Foundation and developed primarily by Sandia National Laboratories (SNL), Oak Ridge National Laboratory (ORNL), and the French Alternative Energies and Atomic Energy Commission (CEA). You write your computation once against Kokkos's abstractions, then pick a backend at compile time - CUDA for NVIDIA, HIP for AMD, and so on - and the same source runs on whichever hardware you targeted.

With SCALE 1.7, Kokkos's CUDA backend runs on AMD GPUs directly - no porting to the HIP backend, no second validation cycle. You keep the backend your code was written and validated for, and SCALE handles the AMD targeting underneath. Note that this is the initial support release and we are still ironing out bumps in collaboration with the Kokkos maintainers.

For national labs and HPC enterprises with existing Kokkos+CUDA codebases, this removes the largest single barrier to evaluating AMD hardware on real workloads.

Managed memory: over 100x faster on HPC workloads

Substantial improvements to managed memory put SCALE more than 100x faster on many HPC workloads - up to an order of magnitude ahead of HIP on those same applications. Our CTO Chris has published a small self-contained microbenchmark - a single CUDA file that builds under both HIP and SCALE - that isolates the cost of managed-memory page migration so you can reproduce the difference yourself. On AMD hardware SCALE relocates those pages dramatically faster than HIP: on an MI355, the first iteration drops from roughly 648,000 ms under HIP to about 6,200 ms under SCALE (yes, you read correctly: 10.8 minutes vs 6.2 seconds). It's a deliberately simple test, and even on a purely unidirectional transfer the gap is stark; workloads that alternate between host and GPU use would feel it even more. Our documentation covers Managed Memory more in detail here.

If you're running scientific code that lean heavily on unified/managed memory - CFD, molecular dynamics, climate, reservoir simulation - this feature alone makes this release worth picking up. We'll be publishing fresh benchmarks soon.

What PyTorch needs

Three big steps toward making PyTorch run cleanly on SCALE:

Full cuBLASLt coverage. The fused-GEMM hot path for modern transformer inference and training: GEMM+bias+GELU, and the rest of the fused-epilogue surface PyTorch leans on. On transformer workloads, this is where most of your GPU-hours get spent - a surface where hipBLASLt coverage is still maturing.
All the Graph API that PyTorch needs. CUDA Graphs are how PyTorch captures and replays sequences of GPU operations with low launch overhead - important for inference and many training workloads.
Most of the cuSOLVER that PyTorch needs. cuSOLVER is NVIDIA's dense linear algebra library (LU, QR, Cholesky, eigensolvers), and PyTorch leans on it for a fair amount of its linear algebra surface.

We're not done with PyTorch yet, but we're a meaningful chunk closer.

OpenGL interop

SCALE now supports CUDA's OpenGL interop APIs - cudaGraphicsGLRegisterBuffer, cudaGraphicsGLRegisterImage, and friends. These let CUDA kernels and OpenGL share GPU resources without round-tripping through CPU memory.

This unblocks a category of CUDA projects that combine compute and rendering in the same pipeline: scientific visualization, particle and fluid simulations, real-time graphics tools, and a long tail of HPC/visualization software that previously had no AMD path at all.

It's also foundational groundwork for upcoming TMA (Tensor Memory Accelerator) support - the texture-descriptor and multi-dimensional addressing infrastructure that OpenGL interop relies on overlaps significantly with what TMA needs.

FFmpeg on AMD

FFmpeg now builds and runs on SCALE, which means its CUDA-accelerated filters work on AMD GPUs. Media transcoding, scaling, and filtering pipelines built around NVIDIA's CUDA filters now have a drop-in AMD path - relevant for CDNs, video platforms, and any GPU-accelerated media workload looking to diversify off a single vendor.

Other compiler optimisations

Several new compiler optimisations across the board, applying to both NVIDIA and AMD targets. Worth singling out: shuffle and reduction improvements on NVIDIA targets, including automatic use of the redux instruction - something NVIDIA's own documentation tells programmers to apply by hand. SCALE does it for you on the recompile.

Individually none of the rest is dramatic, but on long-running HPC and inference workloads, the compounding is the point.

Bug fixes

The usual mix of fixes, including one worth calling out: llm.c now works on SCALE. You can see it running in this video. Alien support has also been fixed, and now the CUDA-powered artificial life simulation program works across all supported AMD GPUs via SCALE.