RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
This typically means:
- Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
- It causes crashes in scaler.scale(loss).backward() or during backward pass.
- The real error often happens earlier than where it’s reported.
Let’s go over a realistic and mockable example where:
- Some operation on the GPU is invalid or unsafe, often in
float16
under AMP. - It causes crashes in
scaler.scale(loss).backward()
or during backward pass.
After that, we’ll go through an example where
- You compute two losses: one works fine under AMP, the other doesnβt.
- You combine them into
total_loss = loss1 + loss2
. - But using
scaler.scale(total_loss).backward()
crashes with:RuntimeError: CUDA error: an illegal memory access was encountered
Then we’ll fix it by isolating the unsafe loss.
After that, we’ll list which operations are unsafe in mixed precision (float16).
π§ͺ Mock Example (that causes this error)
import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')
# 𧨠Unsafe loss: uses in-place mask indexing, which is not AMP-safe
def unsafe_loss_fn(pred, target):
error = pred - target
mask = (error > 0.5)
error[mask] = 0 # 𧨠This in-place write with a mask on half precision can crash
return (error ** 2).mean()
# Training loop
for epoch in range(1):
optimizer.zero_grad()
with autocast(dtype=torch.float16):
output = model(x)
loss = unsafe_loss_fn(output, target) # 𧨠AMP can't handle this properly
# π₯ This will crash with illegal memory access
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
π οΈ How to Fix It
β Option 1: Disable AMP just for the loss
with autocast(dtype=torch.float16):
output = model(x)
# π Compute loss in full precision
with autocast(enabled=False):
loss = unsafe_loss_fn(output.float(), target.float())
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
β Option 2: Fix the loss function to avoid in-place ops
def safe_loss_fn(pred, target):
error = pred - target
mask = (error > 0.5)
error = error.clone() # avoid in-place op
error[mask] = 0
return (error ** 2).mean()
Then continue using AMP as normal.
π§ Debugging Tips
If you’re not sure where the error is coming from:
# Will give exact line where the crash occurs
CUDA_LAUNCH_BLOCKING=1 python train.py
This forces all CUDA operations to run synchronously and show real stack traces.
Cause | Fix |
---|---|
In-place ops with masks in AMP | Use .clone() or move loss to float32 |
Indexing with bool tensors in AMP | Use float32 |
Cross-device tensors | Make sure all are on same .cuda() device |
Unstable ops in custom loss | Run loss in autocast(enabled=False) |
π₯ Mock Example: Crashes When Combining Two Losses in AMP
import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()
x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')
mse_loss = nn.MSELoss()
# 𧨠Problematic loss: causes illegal memory access in AMP
def unsafe_custom_loss(pred, target):
diff = pred - target
mask = (diff > 0.2)
diff[mask] = 0 # β in-place write using bool mask (bad in float16)
return diff.abs().sum()
# Training loop
for epoch in range(1):
optimizer.zero_grad()
with autocast(dtype=torch.float16):
out = model(x)
loss1 = mse_loss(out, target) # β
safe in AMP
loss2 = unsafe_custom_loss(out, target) # β crashes in AMP
total_loss = loss1 + loss2
# π₯ This will throw: CUDA error: illegal memory access
scaler.scale(total_loss).backward()
scaler.step(optimizer)
scaler.update()
β
Fix: Run the Unsafe Loss in float32
Only
for epoch in range(1):
optimizer.zero_grad()
# Safe forward pass with AMP
with autocast(dtype=torch.float16):
out = model(x)
loss1 = mse_loss(out, target)
# Unsafe loss computed outside AMP in float32
with autocast(enabled=False):
loss2 = unsafe_custom_loss(out.float(), target.float())
# Combine in float32
total_loss = loss1 + loss2
scaler.scale(total_loss).backward()
scaler.step(optimizer)
scaler.update()
π§ Explanation
Component | AMP-safe? | Fix |
---|---|---|
nn.MSELoss | β | Safe inside autocast |
diff[mask] = 0 | β | Breaks in float16, use .float() and autocast(enabled=False) |
total_loss = loss1 + loss2 | β | Works as long as loss2 is float32 |
β Summary
When combining multiple loss terms:
- Run each loss where itβs compatible:
- Keep AMP-safe ones under
autocast
. - Move unsafe ones outside AMP with
autocast(enabled=False)
and.float()
.
- Keep AMP-safe ones under
- This lets you retain the benefits of mixed precision while avoiding crashes.
Great question β understanding which operations are unsafe in mixed precision (float16) helps you write stable AMP-compatible code without hitting runtime errors like:
RuntimeError: CUDA error: an illegal memory access was encountered
π₯ Commonly Unsafe or Risky Operations in float16
(AMP)
These ops often break or behave poorly in autocast(float16)
:
𧨠1. In-place tensor updates using boolean masks
x[mask] = 0 # π« not safe in float16 if `mask` is boolean
- This is a common cause of illegal memory access under AMP.
- β Fix: avoid in-place ops, or move to float32 context.
𧨠2. Boolean indexing
selected = x[mask] # π« float16 + boolean masks = crash risk
- AMP does not support all internal CUDA kernels for mixed precision + bool masking.
- β
Fix: move to
.float()
before using boolean masks.
𧨠3. Advanced indexing or slicing with mismatched types
x[torch.nonzero(mask)] # π« risky under AMP if mask or indexing is not float32
- Especially problematic when combining float16 tensors with long/byte/boolean indices.
𧨠4. Reductions on half precision tensors with large values
loss = (x ** 2).sum() # π« can overflow in float16
- Float16 has very limited dynamic range.
- β Fix: convert to float32 before reduction if large values possible.
𧨠5. Perceptual losses / LPIPS / complex multi-stage functions
lpips_loss = lpips_model(pred, target) # π« often requires float32 internally
- Many prebuilt models (e.g. LPIPS, SSIM, MONAI metrics) crash or give NaN in float16.
- β
Fix: run these outside AMP using
autocast(enabled=False)
𧨠6. Division with small values
x / y # π« can underflow in float16
- Float16 can lose precision easily with small divisors.
- β
Fix: use
.float()
if dividing small numbers.
𧨠7. Custom CUDA kernels or extensions
- If you’re using a compiled CUDA op (e.g., from a research repo), and it doesnβt support
float16
, it will crash. - β Fix: check if the kernel supports mixed precision; otherwise run outside AMP.
β General Rule of Thumb
Operation Type | AMP-Safe (float16)? | Notes |
---|---|---|
Standard convs, linear layers | β Yes | Supported under AMP |
ReLU, GELU, Softmax | β Yes | Safe |
nn.MSELoss , nn.CrossEntropyLoss | β Yes | Fully supported |
Masked operations, slicing | β Risky | Use in float32 |
Reductions (sum, mean) | β Risky if large | Watch for overflow |
Non-differentiable ops (e.g., .item() , print ) | β | Safe but ignored by backward |
β Safe Pattern
with autocast(dtype=torch.float16):
# safe model forward + supported losses
output = model(input)
loss1 = mse_loss(output, target)
# unsafe or risky loss outside AMP
with autocast(enabled=False):
loss2 = unsafe_custom_loss(output.float(), target.float())
total_loss = loss1 + loss2
scaler.scale(total_loss).backward()