CUDA Kernel Error with PyTorch - Mismatched CUDA Versions?

zhenliu · December 19, 2024, 2:52pm

Hi,

I’m encountering an issue when running PyTorch on NVIDIA A100 GPU using the “106a CUDA 12.3 (GPU)” software stack. Below is the test code I used:

import torch
print(f"PyTorch version: {torch.version}“)
print(f"CUDA version: {torch.version.cuda}”)
print(f"Device name: {torch.cuda.get_device_name(0)}")

import subprocess
try:
nvidia_smi_output = subprocess.check_output([“nvidia-smi”]).decode(‘utf-8’)
print(“\nnvidia-smi output:”)
print(nvidia_smi_output)
except subprocess.CalledProcessError:
print(“nvidia-smi command not available”)

x = torch.rand(5, 3).to(‘cuda’)
print(f"Tensor on GPU: {x}")

Here’s the output I got:

PyTorch version: 2.3.1
CUDA version: 12.3
Device name: NVIDIA A100-PCIE-40GB

nvidia-smi output:
Thu Dec 19 15:21:03 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:00:06.0 Off | 0 |
| N/A 36C P0 43W / 250W | 422MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

The error occurs at the last line:

—> 17 print(f"Tensor on GPU: {x}")

RuntimeError: CUDA error: no kernel image is available for execution on the device.

This error only happens when I’m assigned an A100 GPU. When I’m assigned a Tesla T4, the code works fine.

Anyone has the same problem? Is this a problem with the PyTorch version?
Any help or suggestions would be greatly appreciated. Thanks in advance!

etejedor · December 19, 2024, 3:24pm

Dear @zhenliu ,

Thank you for reporting, we will investigate.

The CUDA version on the driver (12.4) is higher than on the toolkit (12.3), but that is fine, it should be backwards compatible. It’s interesting that you only see the error on the A100 but not on the T4.

We will get back to you once we have more information.

Best,

Enric

etejedor · December 19, 2024, 4:02pm

Dear @zhenliu ,

I was able to reproduce.

I reported the issue to the librarians who maintain the LCG releases:

https://its.cern.ch/jira/browse/SPI-2704

The issue seems to be an updated version of pytorch in LCG 106a that was not compiled for the A100 architecture.

I suggest we follow the issue in the ticket above.

Best,

Enric

zhenliu · December 19, 2024, 4:06pm

Dear Enric,
Thank you very much for looking into this and reporting the issue. I’ll follow the ticket for updates.

Best regards,
Zhen