Unpacking SCALE

Author photo

Matthew Ireland

Fri Aug 02 2024

It’s been just over two weeks since the first beta version of SCALE launched. We’re thrilled by the response we’ve received: it’s clear that the world is as excited as we are at the prospect of diversifying the GPGPU market!

SCALE enables CUDA code, previously restricted to NVIDIA GPUs, to be run on AMD GPUs. This is a massive step forward in the GPGPU programming world: it opens up a huge number of GPGPU projects to AMD customers who previously could not access them. Even more importantly, SCALE enables end-users to make a free choice of GPU hardware (for example, to opt for powerful but lower-cost GPUs from AMD). As more vendors enter the GPGPU market over the next decade, SCALE will provide a unified cross-vendor platform for software developers which will motivate GPU vendors to compete based on their hardware offering, rather than relying on software lock-in to achieve market dominance.

What is SCALE?

SCALE is a GPGPU programming toolkit that enables CUDA applications to be natively compiled for AMD GPUs. Before SCALE, programs written in the CUDA programming language could only be compiled for NVIDIA hardware, using tools provided by NVIDIA. Using SCALE, CUDA programs can be run, unmodified, on AMD GPUs. The great thing about SCALE’s approach is that it doesn’t add any overhead, allowing the performance of CUDA code on AMD GPUs to equal or exceed the performance of the same code on an NVIDIA GPU.

What SCALE is not

SCALE is:

  1. Not a new language that has incompatibilities with CUDA.

  2. Not a “translation layer”. It doesn’t use NVIDIA runtime libraries in any way.

  3. Not based on NVIDIA tools in any way (it was built in a cleanroom environment).

  4. Not an emulator. There’s no inherent slowdown when using SCALE.

  5. Not based on HIP, or any other framework that’s incompatible with CUDA. We’ve written our own CUDA-compatible framework from the ground up.

SCALE is not affiliated with AMD, NVIDIA, or any other hardware vendor. SCALE has been built by Spectral Compute, a small UK company, whose staff are truly committed to diversifying and evolving the GPGPU market.

Why did we build SCALE?

CUDA code dominates the GPGPU landscape, and it’s clear why: CUDA is a great language! Millions of lines of CUDA code have been written in thousands of projects across applications such as deep learning, AI, bioinformatics, computational fluid dynamics, weather forecasting, and many more. Large numbers of organisations have invested years of engineering effort into building large CUDA codebases, and built up significant expertise in the CUDA language.

Now, if any project or organisation with a significant CUDA codebase wishes to support not just NVIDIA GPUs, but also AMD and other vendors, they would have had a significant problem before the release of SCALE. They would have had several options:

  1. Manually rewrite their entire code-base in a different language that can be compiled for both NVIDIA and AMD GPUs, in particular using open standards such as SYCL. However, this is a huge undertaking and could take many man years for a large codebase. Subtle bugs would likely remain after the translation effort is complete, and would take further time to resolve.

    Adopting an open standard would be a viable approach for new projects (in fact, we’re huge fans of SYCL!). However, the overhead of translating an existing large codebase to a new language is simply not viable for many organisations.

    There’s also the issue that vendors will never be incentivised to support open standards to the extent that they support their own implementations. Use of API wrappers would remain necessary. Vendors are likely to always focus their efforts on their own ecosystems.

  2. Translate their codebase into a new language such as HIP, which comes with an automated tool for translating from CUDA. This can work well, but HIP is not intended to be a drop-in replacement for CUDA. Therefore, this approach can still incur significant overheads in terms of extensive manual tuning and bug fixing.

    The CUDA code that has been rewritten using HIP can also no longer be compiled using NVIDIA’s own tools, and so must rely on wrapper APIs. For this reason, many projects end up maintaining separate CUDA and HIP codebases.

  3. Use a JIT compiler such as ZLUDA. There’s a key advantage to this approach: since ZLUDA operates on PTX assembly code, an end-user can run a CUDA project on an AMD GPU without having access to the original source code, and without support from the project developers. While JIT necessarily incurs a modest slowdown compared to a natively-compiled program, the performance is likely to be sufficient in practice for most applications.

    However, code run using ZLUDA can never have the same performance as binaries that have been natively compiled and opportunities for target-specific optimisation are limited. Furthermore, ZLUDA relies on DLL injection, which can be fragile, and tends to interact badly with antivirus software. Careful attention must also be given to NVIDIA’s EULAs, which are unfavourable towards the use of translation layers.

    ZLUDA is not currently under active development. This is unfortunate, since it does fill a useful niche.

  4. Write your own CUDA emulator. However, this would be a huge undertaking and incur a slowdown. This would defeat the entire purpose of GPGPU programming: performance.

We set out to give any organisation with a CUDA codebase a real choice of hardware platform, without needing to rewrite any code, and without the migration introducing any new bugs into their software. In this way, SCALE was born: a toolkit that allows CUDA applications to be natively compiled for, and automatically optimised for, AMD GPUs.

SCALE is a unified cross-vendor platform that will constitute the future of GPGPU programming. CUDA programmers will be able to devote their time to adding new features to their codebases, rather than rewriting them for cross-vendor support. The market will be diversified: new vendors will be able to enter the market and existing CUDA code will just work on their platforms. Existing vendors will be forced to compete based on the value their hardware actually brings to end-users, rather than achieving dominance through software lock-in.

Is SCALE really 100% CUDA-compatible?

Yes. We even built support for inline PTX. But there’s some nuance: there are multiple dialects of CUDA, and we built SCALE without any reference to the NVIDIA implementations (as required by the NVIDIA licence agreements). It’s a requirement to:

  1. Achieve bug-for-bug compatibility with the dialect of CUDA used, and

  2. Support all the undocumented features of the dialect of CUDA used, as well as the documented ones.

This might sound complicated. However, it’s actually quite straightforward: we offer two compilers (one for each of the two major CUDA dialects) and our automated test framework includes all the tests for a large number of third-party projects (in addition to a huge number of our own tests!). This enables us to understand what semantics people expect from CUDA code in the real world, and make sure that SCALE provides those semantics.

In a way, whenever someone writes a unit test in an open-source CUDA project, they’re also writing a unit test for SCALE. We benefit from a huge number of third-party unit tests that allow us to ensure the correctness of SCALE, without relying on any NVIDIA tools.

However, SCALE is still in beta status: there are currently some known defects in SCALE where we do not yet achieve perfect compatibility with CUDA. We have taken care to document these defects, and are working hard to get them fixed. There are also a number of missing CUDA APIs and features, which we’re continuing to add to SCALE.

And there’s really no slowdown when using SCALE?

There’s no inherent reason that code compiled with SCALE should be slower than code compiled with NVIDIA’s toolchain. On the contrary: SCALE has the potential to outperform NVIDIA’s toolset.

However, it is important to remember that AMD hardware is not NVIDIA hardware. There’ll be some things that AMD can do better, and some things that NVIDIA can do better. This is the great thing about SCALE facilitating cross-vendor support: it will drive forward performance of the underlying hardware as vendors are forced to compete based on the features they can offer in silicon.

Lots of CUDA code has been heavily-optimised for NVIDIA GPUs (and, in fact, for specific NVIDIA targets). Therefore, it’s likely that some performance tuning will be necessary when migrating to AMD GPUs, just as tuning would be necessary when migrating to a newer model of NVIDIA GPU.

Code compiled using SCALE will work correctly on an AMD GPU straight away. Performance tuning is a continuous process, as new generations of hardware evolve, and will never stop. Adding a few #ifdefs to your code will always be cheaper than a complete rewrite.

Further advantages of SCALE

There’s a further advantage to our independence from NVIDIA and other GPU vendors: we can add new features to CUDA, while still maintaining 100% compatibility with it. We’re passionate about enabling programming productivity through the use of features that are expected in a modern programming language, and using static analysis to find as many problems in code as possible at compile time.

We’ve already announced a few of the features that SCALE adds to CUDA (including the option to use exceptions instead of C-style return codes), and there are many more to come in the near future.

Conclusions

Even in its current beta status, SCALE facilitates the diversification of the GPGPU market by enabling CUDA code to be compiled for AMD GPUs. This enables AMD customers to use a huge number of projects that were previously inaccessible to them, and enables everyone to benefit from a choice of GPU platform based on the underlying performance of the hardware.

SCALE is not an emulator or a translation layer. Therefore the performance of SCALE can match and even has the potential to exceed the performance of code compiled using NVIDIA’s own toolchain. SCALE is a true cross-platform solution and maintains independence from NVIDIA, AMD and other hardware vendors.

We see a future where organisations can speed up their codebases by picking the vendor that provides the best hardware features for them. We see a future where organisations can support their users regardless of which hardware platform their users opt for. We see a future where vendors focus on novel hardware developments, and new vendors can break into the GPGPU market. And we’re really excited about this future!