RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
This typically means:
- Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
- It causes crashes in scaler.scale(loss).backward() or during backward pass.
- The real error often happens earlier than where it’s reported.
Let’s go over a realistic and mockable example where:
- Some operation on the GPU is invalid or unsafe, often in
float16under AMP. - It causes crashes in
scaler.scale(loss).backward()or during backward pass.
After that, we’ll go through an example where
- You compute two losses: one works fine under AMP, the other doesn’t.
- You combine them into
total_loss = loss1 + loss2. - But using
scaler.scale(total_loss).backward()crashes with:RuntimeError: CUDA error: an illegal memory access was encountered
Then we’ll fix it by isolating the unsafe loss.
After that, we’ll list which operations are unsafe in mixed precision (float16).
🧪 Mock Example (that causes this error)
import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()
x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')
# 🧨 Unsafe loss: uses in-place mask indexing, which is not AMP-safe
def unsafe_loss_fn(pred, target):
error = pred - target
mask = (error > 0.5)
error[mask] = 0 # 🧨 This in-place write with a mask on half precision can crash
return (error ** 2).mean()
# Training loop
for epoch in range(1):
optimizer.zero_grad()
with autocast(dtype=torch.float16):
output = model(x)
loss = unsafe_loss_fn(output, target) # 🧨 AMP can't handle this properly
# 💥 This will crash with illegal memory access
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
🛠️ How to Fix It
✅ Option 1: Disable AMP just for the loss
with autocast(dtype=torch.float16):
output = model(x)
# 👉 Compute loss in full precision
with autocast(enabled=False):
loss = unsafe_loss_fn(output.float(), target.float())
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
✅ Option 2: Fix the loss function to avoid in-place ops
def safe_loss_fn(pred, target):
error = pred - target
mask = (error > 0.5)
error = error.clone() # avoid in-place op
error[mask] = 0
return (error ** 2).mean()
Then continue using AMP as normal.
🧠 Debugging Tips
If you’re not sure where the error is coming from:
# Will give exact line where the crash occurs
CUDA_LAUNCH_BLOCKING=1 python train.py
This forces all CUDA operations to run synchronously and show real stack traces.
| Cause | Fix |
|---|---|
| In-place ops with masks in AMP | Use .clone() or move loss to float32 |
| Indexing with bool tensors in AMP | Use float32 |
| Cross-device tensors | Make sure all are on same .cuda() device |
| Unstable ops in custom loss | Run loss in autocast(enabled=False) |
💥 Mock Example: Crashes When Combining Two Losses in AMP
import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler
model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()
x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')
mse_loss = nn.MSELoss()
# 🧨 Problematic loss: causes illegal memory access in AMP
def unsafe_custom_loss(pred, target):
diff = pred - target
mask = (diff > 0.2)
diff[mask] = 0 # ❌ in-place write using bool mask (bad in float16)
return diff.abs().sum()
# Training loop
for epoch in range(1):
optimizer.zero_grad()
with autocast(dtype=torch.float16):
out = model(x)
loss1 = mse_loss(out, target) # ✅ safe in AMP
loss2 = unsafe_custom_loss(out, target) # ❌ crashes in AMP
total_loss = loss1 + loss2
# 💥 This will throw: CUDA error: illegal memory access
scaler.scale(total_loss).backward()
scaler.step(optimizer)
scaler.update()
✅ Fix: Run the Unsafe Loss in float32 Only
for epoch in range(1):
optimizer.zero_grad()
# Safe forward pass with AMP
with autocast(dtype=torch.float16):
out = model(x)
loss1 = mse_loss(out, target)
# Unsafe loss computed outside AMP in float32
with autocast(enabled=False):
loss2 = unsafe_custom_loss(out.float(), target.float())
# Combine in float32
total_loss = loss1 + loss2
scaler.scale(total_loss).backward()
scaler.step(optimizer)
scaler.update()
🧠 Explanation
| Component | AMP-safe? | Fix |
|---|---|---|
nn.MSELoss | ✅ | Safe inside autocast |
diff[mask] = 0 | ❌ | Breaks in float16, use .float() and autocast(enabled=False) |
total_loss = loss1 + loss2 | ✅ | Works as long as loss2 is float32 |
✅ Summary
When combining multiple loss terms:
- Run each loss where it’s compatible:
- Keep AMP-safe ones under
autocast. - Move unsafe ones outside AMP with
autocast(enabled=False)and.float().
- Keep AMP-safe ones under
- This lets you retain the benefits of mixed precision while avoiding crashes.
Great question — understanding which operations are unsafe in mixed precision (float16) helps you write stable AMP-compatible code without hitting runtime errors like:
RuntimeError: CUDA error: an illegal memory access was encountered
🔥 Commonly Unsafe or Risky Operations in float16 (AMP)
These ops often break or behave poorly in autocast(float16):
🧨 1. In-place tensor updates using boolean masks
x[mask] = 0 # 🚫 not safe in float16 if `mask` is boolean
- This is a common cause of illegal memory access under AMP.
- ✅ Fix: avoid in-place ops, or move to float32 context.
🧨 2. Boolean indexing
selected = x[mask] # 🚫 float16 + boolean masks = crash risk
- AMP does not support all internal CUDA kernels for mixed precision + bool masking.
- ✅ Fix: move to
.float()before using boolean masks.
🧨 3. Advanced indexing or slicing with mismatched types
x[torch.nonzero(mask)] # 🚫 risky under AMP if mask or indexing is not float32
- Especially problematic when combining float16 tensors with long/byte/boolean indices.
🧨 4. Reductions on half precision tensors with large values
loss = (x ** 2).sum() # 🚫 can overflow in float16
- Float16 has very limited dynamic range.
- ✅ Fix: convert to float32 before reduction if large values possible.
🧨 5. Perceptual losses / LPIPS / complex multi-stage functions
lpips_loss = lpips_model(pred, target) # 🚫 often requires float32 internally
- Many prebuilt models (e.g. LPIPS, SSIM, MONAI metrics) crash or give NaN in float16.
- ✅ Fix: run these outside AMP using
autocast(enabled=False)
🧨 6. Division with small values
x / y # 🚫 can underflow in float16
- Float16 can lose precision easily with small divisors.
- ✅ Fix: use
.float()if dividing small numbers.
🧨 7. Custom CUDA kernels or extensions
- If you’re using a compiled CUDA op (e.g., from a research repo), and it doesn’t support
float16, it will crash. - ✅ Fix: check if the kernel supports mixed precision; otherwise run outside AMP.
✅ General Rule of Thumb
| Operation Type | AMP-Safe (float16)? | Notes |
|---|---|---|
| Standard convs, linear layers | ✅ Yes | Supported under AMP |
| ReLU, GELU, Softmax | ✅ Yes | Safe |
nn.MSELoss, nn.CrossEntropyLoss | ✅ Yes | Fully supported |
| Masked operations, slicing | ❌ Risky | Use in float32 |
| Reductions (sum, mean) | ❌ Risky if large | Watch for overflow |
Non-differentiable ops (e.g., .item(), print) | ✅ | Safe but ignored by backward |
✅ Safe Pattern
with autocast(dtype=torch.float16):
# safe model forward + supported losses
output = model(input)
loss1 = mse_loss(output, target)
# unsafe or risky loss outside AMP
with autocast(enabled=False):
loss2 = unsafe_custom_loss(output.float(), target.float())
total_loss = loss1 + loss2
scaler.scale(total_loss).backward()