Fixed: CUDA error: an illegal memory access was encountered

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This typically means:

Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
It causes crashes in scaler.scale(loss).backward() or during backward pass.
The real error often happens earlier than where it’s reported.

Let’s go over a realistic and mockable example where:

Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
It causes crashes in scaler.scale(loss).backward() or during backward pass.

After that, we’ll go through an example where

You compute two losses: one works fine under AMP, the other doesn’t.
You combine them into total_loss = loss1 + loss2.
But using scaler.scale(total_loss).backward() crashes with: RuntimeError: CUDA error: an illegal memory access was encountered

Then we’ll fix it by isolating the unsafe loss.

After that, we’ll list which operations are unsafe in mixed precision (float16).

🧪 Mock Example (that causes this error)

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

# 🧨 Unsafe loss: uses in-place mask indexing, which is not AMP-safe
def unsafe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error[mask] = 0  # 🧨 This in-place write with a mask on half precision can crash
    return (error ** 2).mean()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()
    
    with autocast(dtype=torch.float16):
        output = model(x)
        loss = unsafe_loss_fn(output, target)  # 🧨 AMP can't handle this properly

    # 💥 This will crash with illegal memory access
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

🛠️ How to Fix It

✅ Option 1: Disable AMP just for the loss

with autocast(dtype=torch.float16):
    output = model(x)

# 👉 Compute loss in full precision
with autocast(enabled=False):
    loss = unsafe_loss_fn(output.float(), target.float())

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

✅ Option 2: Fix the loss function to avoid in-place ops

def safe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error = error.clone()  # avoid in-place op
    error[mask] = 0
    return (error ** 2).mean()

Then continue using AMP as normal.

🧠 Debugging Tips

If you’re not sure where the error is coming from:

# Will give exact line where the crash occurs
CUDA_LAUNCH_BLOCKING=1 python train.py

This forces all CUDA operations to run synchronously and show real stack traces.

Cause	Fix
In-place ops with masks in AMP	Use `.clone()` or move loss to float32
Indexing with bool tensors in AMP	Use float32
Cross-device tensors	Make sure all are on same `.cuda()` device
Unstable ops in custom loss	Run loss in `autocast(enabled=False)`

💥 Mock Example: Crashes When Combining Two Losses in AMP

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

mse_loss = nn.MSELoss()

# 🧨 Problematic loss: causes illegal memory access in AMP
def unsafe_custom_loss(pred, target):
    diff = pred - target
    mask = (diff > 0.2)
    diff[mask] = 0  # ❌ in-place write using bool mask (bad in float16)
    return diff.abs().sum()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)  # ✅ safe in AMP
        loss2 = unsafe_custom_loss(out, target)  # ❌ crashes in AMP

        total_loss = loss1 + loss2

    # 💥 This will throw: CUDA error: illegal memory access
    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

✅ Fix: Run the Unsafe Loss in `float32` Only

for epoch in range(1):
    optimizer.zero_grad()

    # Safe forward pass with AMP
    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)

    # Unsafe loss computed outside AMP in float32
    with autocast(enabled=False):
        loss2 = unsafe_custom_loss(out.float(), target.float())

    # Combine in float32
    total_loss = loss1 + loss2

    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

🧠 Explanation

Component	AMP-safe?	Fix
`nn.MSELoss`	✅	Safe inside `autocast`
`diff[mask] = 0`	❌	Breaks in float16, use `.float()` and `autocast(enabled=False)`
`total_loss = loss1 + loss2`	✅	Works as long as `loss2` is float32

✅ Summary

When combining multiple loss terms:

Run each loss where it’s compatible:
- Keep AMP-safe ones under autocast.
- Move unsafe ones outside AMP with autocast(enabled=False) and .float().
This lets you retain the benefits of mixed precision while avoiding crashes.

Great question — understanding which operations are unsafe in mixed precision (float16) helps you write stable AMP-compatible code without hitting runtime errors like:

RuntimeError: CUDA error: an illegal memory access was encountered

🔥 Commonly Unsafe or Risky Operations in `float16` (AMP)

These ops often break or behave poorly in autocast(float16):

🧨 1. In-place tensor updates using boolean masks

x[mask] = 0  # 🚫 not safe in float16 if `mask` is boolean

This is a common cause of illegal memory access under AMP.
✅ Fix: avoid in-place ops, or move to float32 context.

🧨 2. Boolean indexing

selected = x[mask]  # 🚫 float16 + boolean masks = crash risk

AMP does not support all internal CUDA kernels for mixed precision + bool masking.
✅ Fix: move to .float() before using boolean masks.

🧨 3. Advanced indexing or slicing with mismatched types

x[torch.nonzero(mask)]  # 🚫 risky under AMP if mask or indexing is not float32

Especially problematic when combining float16 tensors with long/byte/boolean indices.

🧨 4. Reductions on half precision tensors with large values

loss = (x ** 2).sum()  # 🚫 can overflow in float16

Float16 has very limited dynamic range.
✅ Fix: convert to float32 before reduction if large values possible.

🧨 5. Perceptual losses / LPIPS / complex multi-stage functions

lpips_loss = lpips_model(pred, target)  # 🚫 often requires float32 internally

Many prebuilt models (e.g. LPIPS, SSIM, MONAI metrics) crash or give NaN in float16.
✅ Fix: run these outside AMP using autocast(enabled=False)

🧨 6. Division with small values

x / y  # 🚫 can underflow in float16

Float16 can lose precision easily with small divisors.
✅ Fix: use .float() if dividing small numbers.

🧨 7. Custom CUDA kernels or extensions

If you’re using a compiled CUDA op (e.g., from a research repo), and it doesn’t support float16, it will crash.
✅ Fix: check if the kernel supports mixed precision; otherwise run outside AMP.

✅ General Rule of Thumb

Operation Type	AMP-Safe (float16)?	Notes
Standard convs, linear layers	✅ Yes	Supported under AMP
ReLU, GELU, Softmax	✅ Yes	Safe
`nn.MSELoss`, `nn.CrossEntropyLoss`	✅ Yes	Fully supported
Masked operations, slicing	❌ Risky	Use in float32
Reductions (sum, mean)	❌ Risky if large	Watch for overflow
Non-differentiable ops (e.g., `.item()`, `print`)	✅	Safe but ignored by backward

✅ Safe Pattern

with autocast(dtype=torch.float16):
    # safe model forward + supported losses
    output = model(input)
    loss1 = mse_loss(output, target)

# unsafe or risky loss outside AMP
with autocast(enabled=False):
    loss2 = unsafe_custom_loss(output.float(), target.float())

total_loss = loss1 + loss2
scaler.scale(total_loss).backward()

Fixed: CUDA error: an illegal memory access was encountered

🧪 Mock Example (that causes this error)

🛠️ How to Fix It

✅ Option 1: Disable AMP just for the loss

✅ Option 2: Fix the loss function to avoid in-place ops

🧠 Debugging Tips

💥 Mock Example: Crashes When Combining Two Losses in AMP

✅ Fix: Run the Unsafe Loss in `float32` Only

🧠 Explanation

✅ Summary

🔥 Commonly Unsafe or Risky Operations in `float16` (AMP)

🧨 1. In-place tensor updates using boolean masks

🧨 2. Boolean indexing

🧨 3. Advanced indexing or slicing with mismatched types

🧨 4. Reductions on half precision tensors with large values

🧨 5. Perceptual losses / LPIPS / complex multi-stage functions

🧨 6. Division with small values

🧨 7. Custom CUDA kernels or extensions

✅ General Rule of Thumb

✅ Safe Pattern

Like this:

Related

Leave a ReplyCancel reply

Fixed: CUDA error: an illegal memory access was encountered

🧪 Mock Example (that causes this error)

🛠️ How to Fix It

✅ Option 1: Disable AMP just for the loss

✅ Option 2: Fix the loss function to avoid in-place ops

🧠 Debugging Tips

💥 Mock Example: Crashes When Combining Two Losses in AMP

✅ Fix: Run the Unsafe Loss in float32 Only

🧠 Explanation

✅ Summary

🔥 Commonly Unsafe or Risky Operations in float16 (AMP)

🧨 1. In-place tensor updates using boolean masks

🧨 2. Boolean indexing

🧨 3. Advanced indexing or slicing with mismatched types

🧨 4. Reductions on half precision tensors with large values

🧨 5. Perceptual losses / LPIPS / complex multi-stage functions

🧨 6. Division with small values

🧨 7. Custom CUDA kernels or extensions

✅ General Rule of Thumb

✅ Safe Pattern

Share this:

Like this:

Related

Leave a ReplyCancel reply

✅ Fix: Run the Unsafe Loss in `float32` Only

🔥 Commonly Unsafe or Risky Operations in `float16` (AMP)