SECURITYNVIDIA

NVIDIA Driver Vulnerabilities: Deep Dive and Runtime Detection Strategies

NOV 2025By André Brandão, Stealthium Security Research Team
NVIDIA Driver Vulnerabilities: Deep Dive and Runtime Detection Strategies

Executive Summary

The AI revolution is happening whether you're on board or not, and that means something you're doing requires GPUs, probably something business critical. And when NVIDIA's kernel modules leaked two privilege-escalation vulnerabilities this fall, most organizations learned about GPU attack surfaces the hard way—through CVE notices rather than telemetry.

Here's what actually broke, how attackers could've exploited it, and why waiting for patches isn't a strategy you want your organization to count on. Multiple privilege escalation and denial-of-service vulnerabilities were uncovered in NVIDIA's Linux GPU kernel drivers, in a set of issues referred to as CUDA de Grâce. In this analysis, we'll detail root cause investigations and demonstrate how exploitation attempts can be detected at runtime through Stealthium's GPU observability platform, leveraging advanced telemetry and behavioral analytics.

The vulnerabilities were identified by Valentina Palmiotti and Sam Lovejoy in NVIDIA's open-source GPU kernel modules. These issues enable unprivileged local attackers to escalate privileges and cause denial of service. Although fixes were released by NVIDIA in the October's 2025 driver update, the sophistication of these vulnerabilities underscores the importance of continuous runtime monitoring and detection, capabilities natively provided by Stealthium to protect AI and GPU-accelerated workloads in production environments.

CVE-2025-23282: Race Condition Leading to Privilege Escalation

Impact

Potential for privilege escalation from unprivileged user space to kernel resulting in breaking isolation layers from container-based sandboxing with a CVSS score of 7.0 (AV:L/AC:H/PR:L/UI:N/S:U/C:H/I:H/A:H).

A public demonstration of this vulnerability being exploited inside an Azure GPU VM environment has confirmed that the vulnerability is reliably exploitable for local privilege escalation in real-world environments.

Root Cause Analysis

Multiple race conditions were addressed in the September 2025 driver release. The specific vulnerability under analysis was located in the handling of the NV_ESC_ATTACH_GPUS_TO_FD ioctl (command 212) in the nvidia_ioctl function in kernel-open/nvidia/nv.c.

In the vulnerable code path, a buffer is allocated via NV_KMALLOC and its pointer is stored in the file-handle context nvlfp->attached_gpus:

File: kernel-open/nvidia/nv.c
Commit: 87c0b1247370e42bd22bb487a683ec513a177b3b

2528         case NV_ESC_ATTACH_GPUS_TO_FD:
2529         {
....
2546
2547             NV_KMALLOC(nvlfp->attached_gpus, arg_size);

User-provided data is then copied into the newly allocated buffer:

2553             memcpy(nvlfp->attached_gpus, arg_copy, arg_size);
2554             nvlfp->num_attached_gpus = num_arg_gpus;

Each GPU ID in the buffer is mapped to a device reference. On failure (line 2563), the allocated buffer is freed (line 2571) and nvlfp->num_attached_gpus is cleared before returning:

2556             for (i = 0; i < nvlfp->num_attached_gpus; i++)
2557             {
2558                 if (nvlfp->attached_gpus[i] == 0)
2559                 {
2560                     continue;
2561                 }
2562
2563                 if (nvidia_dev_get(nvlfp->attached_gpus[i], sp))
2564                 {
2565                     while (i--)
2566                     {
2567                         if (nvlfp->attached_gpus[i] != 0)
2568                             nvidia_dev_put(nvlfp->attached_gpus[i], sp);
2569                     }
2570
2571                     NV_KFREE(nvlfp->attached_gpus, arg_size);
2572                     nvlfp->num_attached_gpus = 0;
2573
2574                     status = -EINVAL;
2575                     break;
2576                 }
2577             }

No synchronization primitives protect this code path. As the ioctl may be invoked concurrently by multiple threads, nvlfp (and its members attached_gpus and num_attached_gpus) may be accessed or written to concurrently. This allowed interleavings in which one thread's allocation/pointer write could be overwritten by another thread before deallocation occurred, producing potential memory leaks, use-after-free conditions, and reliable double-free primitives.

Attack vectors

  • Memory-leak vector: When multiple threads concurrently invoked NV_ESC_ATTACH_GPUS_TO_FD, earlier allocations are overwritten by subsequent assignments to nvlfp->attached_gpus without being freed, which could lead to kernel memory exhaustion.

  • Use-after-free and double-free vector (race): A plausible interleaving was observed in which:

    1. Thread A allocated a buffer at address A and copies the user data into it.
    2. Thread B allocates buffer B and overwrites nvlfp->attached_gpus with B.
    3. Thread A enters the error path and executes NV_KFREE(nvlfp->attached_gpus), attempting to free A but instead frees B.
    4. Thread B then hits its own error path, and frees nvlfp->attached_gpus again, freeing B twice.

An attacker that can control the userspace data and force predictable reuse of the freed memory regions, can manipulate kernel memory and potentially achieve privilege escalation.

NVIDIA's Fix

NVIDIA addressed the issue in driver version 580.95.05 (released September 30, 2025 - Commit 2b43605 - October's 2025 Security Bulletin) by introducing synchronization primitives around the attached_gpus state using a semaphore to serialize concurrent allocations and frees:

Commit: 2b436058a616676ec888ef3814d1db6b2220f2eb
@@ -2538,8 +2544,12 @@ nvidia_ioctl(
                 goto done;
             }

+            /* atomically check and alloc attached_gpus */
+            down(&nvl->ldata_lock);
+
             if (nvlfp->num_attached_gpus != 0)
             {
+                up(&nvl->ldata_lock);
                 status = -EINVAL;
                 goto done;
             }
@@ -2547,12 +2557,15 @@ nvidia_ioctl(
             NV_KMALLOC(nvlfp->attached_gpus, arg_size);
             if (nvlfp->attached_gpus == NULL)
             {
+                up(&nvl->ldata_lock);
                 status = -ENOMEM;
                 goto done;
             }
             memcpy(nvlfp->attached_gpus, arg_copy, arg_size);
             nvlfp->num_attached_gpus = num_arg_gpus;

+            up(&nvl->ldata_lock);
+
             for (i = 0; i < nvlfp->num_attached_gpus; i++)
             {
                 if (nvlfp->attached_gpus[i] == 0)
@@ -2568,9 +2581,14 @@ nvidia_ioctl(
                             nvidia_dev_put(nvlfp->attached_gpus[i], sp);
                     }

+                    /* atomically free attached_gpus */
+                    down(&nvl->ldata_lock);
+
                     NV_KFREE(nvlfp->attached_gpus, arg_size);
                     nvlfp->num_attached_gpus = 0;

+                    up(&nvl->ldata_lock);
+
                     status = -EINVAL;
                     break;
                 }

Stealthium Detection Strategy

Stealthium's GPU runtime observability platform detects exploitation attempts for CVE-2025-23282 at the lowest levels by using layered telemetry and behavioral analytics. Here's how we detect this vulnerability:

Ioctl Call Monitoring

Stealthium introspects ioctl calls on NVIDIA device files (for example, /dev/nvidiactl) by attaching eBPF probes to the kernel entry and exit points of the ioctl handler. For CVE-2025-23282, the following heuristic was implemented:

Detection heuristic: We monitor for rapid, overlapping invocations of NV_ESC_ATTACH_GPUS_TO_FD (ioctl 212) on the same underlying file object, rather than just the integer file descriptor. Since file descriptors can be inherited or passed between processes (fork, dup, pidfd_getfd, etc..), tracking the kernel file object provides a reliable way to detect this race, even when attackers coordinate across processes.

Typical Nvidia GPU workloads do not invoke this ioctl pattern, and the behavior required to exploit this vulnerability is strongly indicative of malicious activity.

Telemetry Captured:

  • Kernel file object
  • Ioctl command and argument size
  • Timing/overlap information
  • Optional: PID/TID (for context, not for detection)

By correlating activity on the same kernel file object and overlapping timestamps, a high-confidence signal can be raised when multiple threads/processes attempt NV_ESC_ATTACH_GPUS_TO_FD concurrently. Even a single misbehaving program that legitimately calls the ioctl may indicate corruption, if subsequent anomalous behaviour is observed, it can be traced to this event.

CVE-2025-23332: Incorrect ZERO_SIZE_PTR Handling

Impact

This vulnerability allows an unprivileged user to trigger a denial of service in the NVIDIA kernel driver, leading to GPU driver crashes that can disrupt other applications, including workloads running inside containers. This inefficiency has been classified with a CVSS score of 5.0 (AV:L/AC:L/PR:L/UI:R/S:U/C:N/I:N/A:H).

Root Cause Analysis

The bug originates in the nvidia_ioctl function within kernel-open/nvidia/nv.c, during the handling of the ioctl command NV_ESC_WAIT_OPEN_COMPLETE (218).

The function allocates memory to hold user-provided data (line 2438):

File: kernel-open/nvidia/nv.c
Commit: 87c0b1247370e42bd22bb487a683ec513a177b3b
2376 int
2377 nvidia_ioctl(
2378     struct inode *inode,
2379     struct file *file,
2380     unsigned int cmd,
2381     unsigned long i_arg)
2382 {
...
2405     arg_size = _IOC_SIZE(cmd);
2406     arg_cmd  = _IOC_NR(cmd);
...
2438     NV_KMALLOC(arg_copy, arg_size);
2439     if (arg_copy == NULL)
2440     {
2441         nv_printf(NV_DBG_ERRORS, "NVRM: failed to allocate ioctl memory\\\\n");
2442         status = -ENOMEM;
2443         goto done_early;
2444     }

Finally, when processing NV_ESC_WAIT_OPEN_COMPLETE, it copies the open_rc and adapter_status fields into the allocated buffer:

2446     if (NV_COPY_FROM_USER(arg_copy, arg_ptr, arg_size))
2447     {
2448         nv_printf(NV_DBG_ERRORS, "NVRM: failed to copy in ioctl data!\\\\n");
2449         status = -EFAULT;
2450         goto done_early;
2451     }

And finally it copies the member variables open_rc and adapter_status to the freshly allocated memory:

2457     if (arg_cmd == NV_ESC_WAIT_OPEN_COMPLETE)
2458     {
2459         nv_ioctl_wait_open_complete_t *params = arg_copy;
2460
2461         params->rc = nvlfp->open_rc;
2462         params->adapterStatus = nvlfp->adapter_status;
2463         goto done_early;
2464     }

At first glance, nothing appears wrong in the code above. However, there's a subtle kernel behaviour at play:

When kmalloc() is called with a zero-size allocation, it does not return NULL, it instead returns a special pointer called ZERO_SIZE_PTR (defined as address 0x10). Because the driver only checks for NULL (if (arg_copy == NULL)), it fails to detect this invalid allocation and continues execution. The subsequent memcpy call does nothing since the size is zero, but the real issue appears at line 2461, where the code dereferences params->rc.

At that point, it's dereferencing ZERO_SIZE_PTR (0x10), which triggers a page fault, crashing the NVIDIA kernel driver.

NVIDIA's fix

NVIDIA addressed the issue in driver version 580.95.05 (released September 30, 2025 - Commit 2b43605 - October's 2025 Security Bulletin) by adding explicit size validation for the NV_ESC_WAIT_OPEN_COMPLETE ioctl. The new guard rejects improperly sized requests, including zero-size requests that previously produced a ZERO_SIZE_PTR dereference:

Commit: 2b436058a616676ec888ef3814d1db6b2220f2eb
@@ -2458,6 +2458,12 @@ nvidia_ioctl(
     {
         nv_ioctl_wait_open_complete_t *params = arg_copy;

+        if (arg_size != sizeof(nv_ioctl_wait_open_complete_t))
+        {
+            status = -EINVAL;
+            goto done_early;
+        }
+

This simple check prevents the code path from dereferencing the ZERO_SIZE_PTR when kmalloc(0) is used, and ensures that only properly formed ioctl requests are processed.`

Stealthium's Detection Strategy

Stealthium's observability stack was designed to catch both the exploit attempts that target this class of bug and the resulting impact when they succeed. Below are the pragmatic detection layers we apply for CVE-2025-23332.

Ioctl Parameter Validation

Watch ioctl invocations for NV_ESC_WAIT_OPEN_COMPLETE (ioctl 218) and flag any calls whose arg_size does not match the expected payload size (i.e., sizeof(nv_ioctl_wait_open_complete_t), typically 16 bytes). Zero-size requests are trivial to spot and are highly anomalous for this ioctl.

Telemetry Captured:

  • Ioctl command number and argument size
  • Calling process details (PID, UID, executable path, command line)
  • NVIDIA driver version

A lightweight eBPF probe on ioctl entry is sufficient to collect these fields with minimal overhead. When we observe an incorrectly sized call to ioctl NV_ESC_WAIT_OPEN_COMPLETE from an untrusted binary or from a process on an unpatched host, we can raise a high-confidence alert.

Conclusion

The vulnerabilities discovered by security researchers in NVIDIA's GPU kernel drivers demonstrate the expanding attack surface of GPU-accelerated computing infrastructure. As AI workloads become increasingly critical to business operations, the security implications of GPU driver vulnerabilities grow correspondingly severe.

Traditional security approaches focused solely on patching are insufficient given:

  • The lag between vulnerability discovery and patch deployment
  • The complexity of GPU driver ecosystems across multiple branches
  • The sophistication of modern exploitation techniques
  • The multi-tenant nature of cloud GPU environments

Stealthium's comprehensive GPU observability platform addresses these challenges by providing:

  • Real-time detection of exploitation attempts against known vulnerabilities
  • Behavioural anomaly detection capable of identifying zero-day attacks
  • Deep visibility across the entire NVIDIA software stack (driver, CUDA, frameworks)
  • Production-safe deployment with minimal performance impact

By correlating low-level GPU telemetry with high-level workload context, Stealthium transforms raw GPU metrics into actionable security intelligence, enabling organisations to defend their AI infrastructure against both known and emerging threats.

It may be early in your understanding of GPU vulnerabilities, but your organization cannot wait for its most innovative digital assets to be compromised just because you're new to it. Stealthium can give you the confidence you need to run workloads on GPUs. Get in touch today.

Stealthium - GPU-Powered Security Intelligence