Accelerate your CUDA development.

Your CUDA skills are now universal. SCALE compiles your unmodified applications to run natively on any accelerator, ending the nightmare of maintaining multiple codebases.

Your Existing Code (NVIDIA)

nvcc my_app.cu -o my_app_nvidia

With SCALE (On Any Accelerator)

nvcc my_app.cu -o my_app_portable

View SCALE Tests

What is SCALE?

Decoupling Code from Silicon.

CPU developers don't rewrite their software for every new chip architecture—they simply recompile. SCALE brings this standard of portability to GPU computing.

It is a comprehensive toolkit—combining a cross-compiler, drop-in libraries, and language extensions—that acts as an agnostic interface between your code and the hardware.

With SCALE, you can take your existing HPC applications (starting with CUDA) and deploy them to your accelerated compute platform of choice.

Your CUDA codebase is the single source of truth with zero porting required.

Unlock new efficiency gains through advanced software optimization, not just hardware upgrades.

Access enhanced UX and language extensions built by HPC developers, for HPC developers.

Break vendor lock-in and choose hardware based on lower cost or higher performance.

From CUDA Source to Native Code

True Compilation, Not Emulation

SCALE compiles CUDA source code directly to native machine instructions for GPUs, delivering native performance with no intrinsic overhead.

Your *.cu
Source

SCALE
Compiler (`nvcc`)

Built on LLVM to leverage existing vendor backends

Read Docs

Native AMD Machine Code

Native NVIDIA Machine Code

Native $Any_AI_Accelerator Code

Competitive Advantage

Why SCALE Over Other Solutions?

	Our Approach:	Auto Source-to-Source: HIPIFY	Alternative Languages: OpenCL
Codebase	Single CUDA codebase	Two+ Codebases to maintain	Complete rewrite needed
Process	Direct Compilation	Fragile Source Translation	New Language, New Ecosystem
Result	“Just make CUDA work”	A “compatibility tax” on developers	Abandons existing CUDA investment

Benchmarks

Native Performance on your Favorite Hardware

SCALE often outperforms existing solutions, and we're just getting started.

Up to 8%

faster than nvcc

0.99x

Avg. performance on par with nvcc*

First Release

Will improve soon!

Rodinia Benchmarks: SCALE vs nvcc

Speed-up over nvcc on NVIDIA B300 - CUDA v13, NVIDIA Driver v580.65.06

up to 25.7x

faster than HIP

5.94x

average speed up

10/14

faster workloads

Rodinia Benchmarks: SCALE vs HIP

Speed-up over HIP on AMD Instinct MI355x - ROCm v7.2.2

*The same measurement process was used to evaluate both the NVIDIA and SCALE toolchains. The reported running time for each benchmark was measured from 150 total runs, split across 5 independent rounds, which were started after initialisation and after a warm-up phase. The reported execution time excludes CPU-only work such as file I/O or host-only memory allocations. The statistic used is the minimum recorded value, as the maximum likelihood estimator for the running time assuming that all measurements are subject to positive noise and follow a Pareto-like distribution. Other statistics including the standard deviation and median were also recorded to enable further analysis in future.

Key Features

Fixing Common PTX Pitfalls

Inline PTX asm is common in CUDA programs, because it is the only way to access certain valuable features. However, NVIDIA's compiler provides virtually no validation for this part of the language. Since we have to parse it to compile it for AMD, we also provide proper warnings/errors, making this dark corner of the language much easier to work with.

Trivial Mistakes

Even trivial mistakes are a pain with NVCC:

Truncated Pointer

A common mistake is to pass a C++ pointer directly into a PTX asm block:

Multiple Definitions

A function that declares a PTX variable but is inlined repeatedly will cause strange errors due to the variable declaration being duplicated:

CUDA → SCALE Comparison

example.cu

__device__ int ptxAdd(int x, int y) {
    int out;
    asm("add.u32 %0, %1, %2" : "=r"(out) : "r"(x), "r"(y));
    return out;
}

SCALE

$nvcc example.cu -o example

error: missing semicolon in inline PTX
    4 |    asm("add.u32 %0, %1, %2" : "=r"(out) : "r"(x), "r"(y));
      |                         ^

nvcc

$nvcc example.cu -o example

ptxas /tmp/tmpxft_001e4e3c_00000000-6_add.ptx, line 28; fatal   : Parsing error near 'st': syntax error
ptxas fatal   : Ptx assembly aborted due to errors

View more code samples in our Documentation

Key Features

Compiler Feedback You'll Actually Love

Get clear, actionable diagnostics that help you pinpoint issues faster. If you've ever been stumped by a cryptic nvcc error, we're sorry and we feel you.

CUDA → SCALE Comparison

example.cu

#include <cstdio>
__global__ void hello() {
    printf("Hello, world\n");
}

int main() {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);

    printf("CUDA Device: %s\n", prop.name);

    hello<<<1,1>>>();
    cudaDeviceSynchronize();
}

SCALE

$nvcc example.cu -o example

deviceinfo.cu:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    9 |     cudaGetDeviceProperties(&prop, 0);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
deviceinfo.cu:14:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   14 |     cudaDeviceSynchronize();
      |     ^~~~~~~~~~~~~~~~~~~~~~~
2 warnings generated when compiling for gfx90a.
deviceinfo.cu:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
    9 |     cudaGetDeviceProperties(&prop, 0);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
deviceinfo.cu:14:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   14 |     cudaDeviceSynchronize();
      |     ^~~~~~~~~~~~~~~~~~~~~~~
2 warnings generated when compiling for host.

$./example

CUDA Device: AMD Instinct MI210 - gfx90a (AMD) <amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack->
Hello, world

nvcc

$nvcc example.cu -o example

nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

$./example

CUDA Device: NVIDIA GeForce RTX 3080 Ti
Hello, world

View more code samples in our Documentation

Licensing

Free for non-commercial purposes.

Paid license for commercial use; available for design partnership and support.

Free

Paid

Research & Non-Commercial

For non-commercial, educational, and research purposes on all client, workstation and data-center GPUs.

Get Started

Commercial

Standard license for commercial deployment and use. Contact us for pricing.

Contact Sales

Enterprise

Collaborate with our team on custom solutions, optimizations, dedicated support, and roadmap prioritization.

Contact Sales

FAQs

Frequently Asked Questions

If this section doesn't answer your question, please check our FAQ section on the official documentation or reach out to us on our social-media.

SCALE is free for non-commercial use including research and academia. For commercial use, a license agreement is required. Read more here.

SCALE does not currently support Pytorch, but support is in development and estimated for release in early Q2 2026. For a detailed overview of the currently supported CUDA projects, see this table of our validation suite.

SCALE supports a wide range of both consumer and enterprise GPUs, and will support more in the future. For a detailed overview, see this section of the official SCALE documentation.

In many cases, yes, it does. Reducing compute costs can be a good reason to choose SCALE. For the latest performance benchmarks, see this section of our website.

SCALE is centered around CUDA and allows you write your code once, and run everywhere with zero code rewrite. It is a drop-in replacement for nvcc. For full explanation of all the differentiators of SCALE, see this section of our technical documentation.

By design, SCALE does not infringe NVIDIA’s EULAs or copyright. We think CUDA is amazing and we follow the guidelines set by NVIDIA. Check out this post for more information.

Unlock Your Potential with SCALE

Discover how our innovative solutions can transform your workflow and enhance your productivity today.

Get Started Read Docs

SCALE Blog

Tue Mar 10 2026

CUDA, the De Facto Standard of HPC?

For nearly two decades, NVIDIA’s proprietary programming model, CUDA, has been the "gold standard" for unlocking the capabilities of the GPU. In June 2024, Jensen Huang...

SCALE
Giulio Malitesta
2026

Mon Jan 19 2026

Optimizing CUDA Shuffles with SCALE

Compilation of CUDA shuffles to use AMD's Data-Parallel Primitives (DPP) is a novel and unique compiler optimization in SCALE, first added in version 1.4.2. It's...

SCALE
Justine Khoo
2026

Tue Dec 16 2025

GROMACS on SCALE: A Quick Update

After our first GROMACS post back in May, we kept digging into the parts of the CUDA backend that GROMACS relies on, especially after getting a very helpful email from...

SCALE
Manos Pavlidakis
2025

Socials

We're also on other platforms. Connect with us everywhere else.

SCALE Community

Join us on our Discord server: Chat with the team, get help, and see what others are building.

Join the discussion

r/CUDAUnlocked

A community dedicated to running CUDA code on any GPU and accelerated platforms.

Join the subreddit

@SpectralCom

Our Professional Hub

Follow our page for official company news, industry insights and career opportunities at the forefront of hardware freedom