Skip to content
Home Β» Fixed: CUDA error: an illegal memory access was encountered

Fixed: CUDA error: an illegal memory access was encountered

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This typically means:

  • Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
  • It causes crashes in scaler.scale(loss).backward() or during backward pass.
  • The real error often happens earlier than where it’s reported.

Let’s go over a realistic and mockable example where:

  • Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
  • It causes crashes in scaler.scale(loss).backward() or during backward pass.

After that, we’ll go through an example where

  • You compute two losses: one works fine under AMP, the other doesn’t.
  • You combine them into total_loss = loss1 + loss2.
  • But using scaler.scale(total_loss).backward() crashes with: RuntimeError: CUDA error: an illegal memory access was encountered

Then we’ll fix it by isolating the unsafe loss.

After that, we’ll list which operations are unsafe in mixed precision (float16).


πŸ§ͺ Mock Example (that causes this error)

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

# 🧨 Unsafe loss: uses in-place mask indexing, which is not AMP-safe
def unsafe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error[mask] = 0  # 🧨 This in-place write with a mask on half precision can crash
    return (error ** 2).mean()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()
    
    with autocast(dtype=torch.float16):
        output = model(x)
        loss = unsafe_loss_fn(output, target)  # 🧨 AMP can't handle this properly

    # πŸ’₯ This will crash with illegal memory access
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

πŸ› οΈ How to Fix It

βœ… Option 1: Disable AMP just for the loss

with autocast(dtype=torch.float16):
    output = model(x)

# πŸ‘‰ Compute loss in full precision
with autocast(enabled=False):
    loss = unsafe_loss_fn(output.float(), target.float())

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

βœ… Option 2: Fix the loss function to avoid in-place ops

def safe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error = error.clone()  # avoid in-place op
    error[mask] = 0
    return (error ** 2).mean()

Then continue using AMP as normal.


🧠 Debugging Tips

If you’re not sure where the error is coming from:

# Will give exact line where the crash occurs
CUDA_LAUNCH_BLOCKING=1 python train.py

This forces all CUDA operations to run synchronously and show real stack traces.


CauseFix
In-place ops with masks in AMPUse .clone() or move loss to float32
Indexing with bool tensors in AMPUse float32
Cross-device tensorsMake sure all are on same .cuda() device
Unstable ops in custom lossRun loss in autocast(enabled=False)

πŸ’₯ Mock Example: Crashes When Combining Two Losses in AMP

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

mse_loss = nn.MSELoss()

# 🧨 Problematic loss: causes illegal memory access in AMP
def unsafe_custom_loss(pred, target):
    diff = pred - target
    mask = (diff > 0.2)
    diff[mask] = 0  # ❌ in-place write using bool mask (bad in float16)
    return diff.abs().sum()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)  # βœ… safe in AMP
        loss2 = unsafe_custom_loss(out, target)  # ❌ crashes in AMP

        total_loss = loss1 + loss2

    # πŸ’₯ This will throw: CUDA error: illegal memory access
    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

βœ… Fix: Run the Unsafe Loss in float32 Only

for epoch in range(1):
    optimizer.zero_grad()

    # Safe forward pass with AMP
    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)

    # Unsafe loss computed outside AMP in float32
    with autocast(enabled=False):
        loss2 = unsafe_custom_loss(out.float(), target.float())

    # Combine in float32
    total_loss = loss1 + loss2

    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

🧠 Explanation

ComponentAMP-safe?Fix
nn.MSELossβœ…Safe inside autocast
diff[mask] = 0❌Breaks in float16, use .float() and autocast(enabled=False)
total_loss = loss1 + loss2βœ…Works as long as loss2 is float32

βœ… Summary

When combining multiple loss terms:

  • Run each loss where it’s compatible:
    • Keep AMP-safe ones under autocast.
    • Move unsafe ones outside AMP with autocast(enabled=False) and .float().
  • This lets you retain the benefits of mixed precision while avoiding crashes.

Great question β€” understanding which operations are unsafe in mixed precision (float16) helps you write stable AMP-compatible code without hitting runtime errors like:

RuntimeError: CUDA error: an illegal memory access was encountered

πŸ”₯ Commonly Unsafe or Risky Operations in float16 (AMP)

These ops often break or behave poorly in autocast(float16):


🧨 1. In-place tensor updates using boolean masks

x[mask] = 0  # 🚫 not safe in float16 if `mask` is boolean
  • This is a common cause of illegal memory access under AMP.
  • βœ… Fix: avoid in-place ops, or move to float32 context.

🧨 2. Boolean indexing

selected = x[mask]  # 🚫 float16 + boolean masks = crash risk
  • AMP does not support all internal CUDA kernels for mixed precision + bool masking.
  • βœ… Fix: move to .float() before using boolean masks.

🧨 3. Advanced indexing or slicing with mismatched types

x[torch.nonzero(mask)]  # 🚫 risky under AMP if mask or indexing is not float32
  • Especially problematic when combining float16 tensors with long/byte/boolean indices.

🧨 4. Reductions on half precision tensors with large values

loss = (x ** 2).sum()  # 🚫 can overflow in float16
  • Float16 has very limited dynamic range.
  • βœ… Fix: convert to float32 before reduction if large values possible.

🧨 5. Perceptual losses / LPIPS / complex multi-stage functions

lpips_loss = lpips_model(pred, target)  # 🚫 often requires float32 internally
  • Many prebuilt models (e.g. LPIPS, SSIM, MONAI metrics) crash or give NaN in float16.
  • βœ… Fix: run these outside AMP using autocast(enabled=False)

🧨 6. Division with small values

x / y  # 🚫 can underflow in float16
  • Float16 can lose precision easily with small divisors.
  • βœ… Fix: use .float() if dividing small numbers.

🧨 7. Custom CUDA kernels or extensions

  • If you’re using a compiled CUDA op (e.g., from a research repo), and it doesn’t support float16, it will crash.
  • βœ… Fix: check if the kernel supports mixed precision; otherwise run outside AMP.

βœ… General Rule of Thumb

Operation TypeAMP-Safe (float16)?Notes
Standard convs, linear layersβœ… YesSupported under AMP
ReLU, GELU, Softmaxβœ… YesSafe
nn.MSELoss, nn.CrossEntropyLossβœ… YesFully supported
Masked operations, slicing❌ RiskyUse in float32
Reductions (sum, mean)❌ Risky if largeWatch for overflow
Non-differentiable ops (e.g., .item(), print)βœ…Safe but ignored by backward

βœ… Safe Pattern

with autocast(dtype=torch.float16):
    # safe model forward + supported losses
    output = model(input)
    loss1 = mse_loss(output, target)

# unsafe or risky loss outside AMP
with autocast(enabled=False):
    loss2 = unsafe_custom_loss(output.float(), target.float())

total_loss = loss1 + loss2
scaler.scale(total_loss).backward()

Leave a Reply

error: Content is protected !!