Skip to content

Fixed: CUDA error: an illegal memory access was encountered

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This typically means:

  • Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
  • It causes crashes in scaler.scale(loss).backward() or during backward pass.
  • The real error often happens earlier than where it’s reported.

Let’s go over a realistic and mockable example where:

  • Some operation on the GPU is invalid or unsafe, often in float16 under AMP.
  • It causes crashes in scaler.scale(loss).backward() or during backward pass.

After that, we’ll go through an example where

  • You compute two losses: one works fine under AMP, the other doesn’t.
  • You combine them into total_loss = loss1 + loss2.
  • But using scaler.scale(total_loss).backward() crashes with: RuntimeError: CUDA error: an illegal memory access was encountered

Then we’ll fix it by isolating the unsafe loss.

After that, we’ll list which operations are unsafe in mixed precision (float16).


🧪 Mock Example (that causes this error)

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters())
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

# 🧨 Unsafe loss: uses in-place mask indexing, which is not AMP-safe
def unsafe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error[mask] = 0  # 🧨 This in-place write with a mask on half precision can crash
    return (error ** 2).mean()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()
    
    with autocast(dtype=torch.float16):
        output = model(x)
        loss = unsafe_loss_fn(output, target)  # 🧨 AMP can't handle this properly

    # 💥 This will crash with illegal memory access
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

🛠️ How to Fix It

✅ Option 1: Disable AMP just for the loss

with autocast(dtype=torch.float16):
    output = model(x)

# 👉 Compute loss in full precision
with autocast(enabled=False):
    loss = unsafe_loss_fn(output.float(), target.float())

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

✅ Option 2: Fix the loss function to avoid in-place ops

def safe_loss_fn(pred, target):
    error = pred - target
    mask = (error > 0.5)
    error = error.clone()  # avoid in-place op
    error[mask] = 0
    return (error ** 2).mean()

Then continue using AMP as normal.

See also  Fixed: TorchText install issue: The procedure entry point?? could not be located in the dynamic link library

🧠 Debugging Tips

If you’re not sure where the error is coming from:

# Will give exact line where the crash occurs
CUDA_LAUNCH_BLOCKING=1 python train.py

This forces all CUDA operations to run synchronously and show real stack traces.


CauseFix
In-place ops with masks in AMPUse .clone() or move loss to float32
Indexing with bool tensors in AMPUse float32
Cross-device tensorsMake sure all are on same .cuda() device
Unstable ops in custom lossRun loss in autocast(enabled=False)

💥 Mock Example: Crashes When Combining Two Losses in AMP

import torch
from torch import nn
from torch.cuda.amp import autocast, GradScaler

model = nn.Linear(10, 10).cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = GradScaler()

x = torch.randn(4, 10, device='cuda')
target = torch.randn(4, 10, device='cuda')

mse_loss = nn.MSELoss()

# 🧨 Problematic loss: causes illegal memory access in AMP
def unsafe_custom_loss(pred, target):
    diff = pred - target
    mask = (diff > 0.2)
    diff[mask] = 0  # ❌ in-place write using bool mask (bad in float16)
    return diff.abs().sum()

# Training loop
for epoch in range(1):
    optimizer.zero_grad()

    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)  # ✅ safe in AMP
        loss2 = unsafe_custom_loss(out, target)  # ❌ crashes in AMP

        total_loss = loss1 + loss2

    # 💥 This will throw: CUDA error: illegal memory access
    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

✅ Fix: Run the Unsafe Loss in float32 Only

for epoch in range(1):
    optimizer.zero_grad()

    # Safe forward pass with AMP
    with autocast(dtype=torch.float16):
        out = model(x)
        loss1 = mse_loss(out, target)

    # Unsafe loss computed outside AMP in float32
    with autocast(enabled=False):
        loss2 = unsafe_custom_loss(out.float(), target.float())

    # Combine in float32
    total_loss = loss1 + loss2

    scaler.scale(total_loss).backward()
    scaler.step(optimizer)
    scaler.update()

🧠 Explanation

ComponentAMP-safe?Fix
nn.MSELossSafe inside autocast
diff[mask] = 0Breaks in float16, use .float() and autocast(enabled=False)
total_loss = loss1 + loss2Works as long as loss2 is float32

✅ Summary

When combining multiple loss terms:

  • Run each loss where it’s compatible:
    • Keep AMP-safe ones under autocast.
    • Move unsafe ones outside AMP with autocast(enabled=False) and .float().
  • This lets you retain the benefits of mixed precision while avoiding crashes.
See also  Pandas import error fixed

Great question — understanding which operations are unsafe in mixed precision (float16) helps you write stable AMP-compatible code without hitting runtime errors like:

RuntimeError: CUDA error: an illegal memory access was encountered

🔥 Commonly Unsafe or Risky Operations in float16 (AMP)

These ops often break or behave poorly in autocast(float16):


🧨 1. In-place tensor updates using boolean masks

x[mask] = 0  # 🚫 not safe in float16 if `mask` is boolean
  • This is a common cause of illegal memory access under AMP.
  • ✅ Fix: avoid in-place ops, or move to float32 context.

🧨 2. Boolean indexing

selected = x[mask]  # 🚫 float16 + boolean masks = crash risk
  • AMP does not support all internal CUDA kernels for mixed precision + bool masking.
  • ✅ Fix: move to .float() before using boolean masks.

🧨 3. Advanced indexing or slicing with mismatched types

x[torch.nonzero(mask)]  # 🚫 risky under AMP if mask or indexing is not float32
  • Especially problematic when combining float16 tensors with long/byte/boolean indices.

🧨 4. Reductions on half precision tensors with large values

loss = (x ** 2).sum()  # 🚫 can overflow in float16
  • Float16 has very limited dynamic range.
  • ✅ Fix: convert to float32 before reduction if large values possible.

🧨 5. Perceptual losses / LPIPS / complex multi-stage functions

lpips_loss = lpips_model(pred, target)  # 🚫 often requires float32 internally
  • Many prebuilt models (e.g. LPIPS, SSIM, MONAI metrics) crash or give NaN in float16.
  • ✅ Fix: run these outside AMP using autocast(enabled=False)

🧨 6. Division with small values

x / y  # 🚫 can underflow in float16
  • Float16 can lose precision easily with small divisors.
  • ✅ Fix: use .float() if dividing small numbers.
See also  Fixed: OSError: You are trying to access a gated repo. Make sure to have access to it at https://huggingface.co....

🧨 7. Custom CUDA kernels or extensions

  • If you’re using a compiled CUDA op (e.g., from a research repo), and it doesn’t support float16, it will crash.
  • ✅ Fix: check if the kernel supports mixed precision; otherwise run outside AMP.

✅ General Rule of Thumb

Operation TypeAMP-Safe (float16)?Notes
Standard convs, linear layers✅ YesSupported under AMP
ReLU, GELU, Softmax✅ YesSafe
nn.MSELoss, nn.CrossEntropyLoss✅ YesFully supported
Masked operations, slicing❌ RiskyUse in float32
Reductions (sum, mean)❌ Risky if largeWatch for overflow
Non-differentiable ops (e.g., .item(), print)Safe but ignored by backward

✅ Safe Pattern

with autocast(dtype=torch.float16):
    # safe model forward + supported losses
    output = model(input)
    loss1 = mse_loss(output, target)

# unsafe or risky loss outside AMP
with autocast(enabled=False):
    loss2 = unsafe_custom_loss(output.float(), target.float())

total_loss = loss1 + loss2
scaler.scale(total_loss).backward()

Leave a Reply

error: Content is protected !!