Get a demo

Voyager18 (research)

How to fix CVE-2024-0132 in Nvidia AI

Get the breakdown on the critical CVE-2024-0132 affecting the NVIDIA Container Toolkit and GPU Operator

Yair Divinsky | September 26, 2024

A critical flaw, identified as CVE-2024-0132, has been discovered in the NVIDIA Container Toolkit and GPU Operator, posing a significant threat to environments running AI workloads.

In their findings report, researchers say they have found over 35% of Cloud Environments which have been affected, exposing their Containers Using NVIDIA GPUs.  

Here’s everything you need to know:

 

TL;DR

Affected products: 

 – NVIDIA Container Toolkit: All versions up to and including v1.16.1   

– NVIDIA GPU Operator: All versions up to and including 24.6.1   

Product category: 

AI Vulnerability 

Severity: 

Critical 

Type: 

Time-of-check Time-of-Use (TOCTOU) resulting in Container escape to gain full access to underlying host system 

Impact: 

Code execution, Denial of Service, Privilege Escalation, Information Disclosure, Data Tampering 

PoC: 

 No

Exploit in the wild 

No current evidence 

CISA Catalog 

 No

Remediation action 

Update NVIDIA Container Toolkit to version 1.16.2 immediately, particularly on hosts that may execute untrusted container images 

MITRE advisory 

Read more 

What is CVE-2024-0132?

Wiz Researchers Shir Tamari, Ronen Shustin, Andres Riancho have discoverd a high-risk vulnerability with a critical CVSS Score of 9.0. CVE-2024-0132 is rooted In the widely used NVIDIA Container Toolkit, which is integral to allowing GPU access for AI applications deployed in containers. The specific issue affects both cloud-based and on-prem AI systems which rely on the vulnerable version of the toolkit for GPU functionality.  

Exploitation of the vulnerability could allow an attacker with control over a container image to break out of the container and take full control of the underlying host machine. This represents a major security concern, as it opens the potential for unauthorized access to critical data and systems. 

NVIDIA responded by issuing a security advisory and releasing a patched update to fix the flaw. In their report, Wiz researchers mention the NVIDIA team immediately cooperated with the investigation and provided immediate action during the disclosure and remediation process. “Their transparency and collaborative efforts were essential in resolving this issue,” they say. 

 

The NVIDIA Container Toolkit

The NVIDIA Container Toolkit has emerged as the industry leader for this integration, streamlining GPU usage in containerized setups. As the demand for AI applications and container technology has surged in recent years, the toolkit’s adoption has grown significantly.  

In shared computing environments, running GPUs allows multiple workloads, and sometimes different users, to utilize the same GPU resource.

To achieve direct GPU access from within containers, NVIDIA developed a suite of drivers and utilities that are installed on the host system, integrating smoothly with the container runtime. The library is pre-packaged in numerous AI platforms and virtual machine images (AMIs), as it’s a critical component for running AI workloads. 

The NVIDIA GPU Operator, a Kubernetes operator, is under extensive use in GPU-powered Kubernetes setups. It has significantly extended the reach of the NVIDIA Container Toolkit, making it prevalent in containerized GPU workloads across a wide range of organizations. 

 

 

Does CVE-2024-0132 affect me?

Affected by this vulnerability are the following NVIDIA Container Toolkit and the NVIDIA GPU Operator components. Specifically the flaws affect: 

  • NVIDIA Container Toolkit: All versions up to and including v1.16.1  
  • NVIDIA GPU Operator: All versions up to and including 24.6.1  

If you are running on single-tenant environments, whenever a user unintentionally downloads a harmful container image from an untrusted source, such as through a phishing attack, the attacker could compromise that user’s machine entirely. 

In orchestrated or shared environments however, (such as Kubernetes, K8s), the stakes are even higher. An attacker with permission to deploy a container could escape from it, gaining access to sensitive data and secrets from other applications running on the same node or even across the entire cluster, potentially compromising the whole environment. 

The orchestrated scenario is particularly concerning for AI service providers that allow customers to run GPU-enabled containers.

This is also where the vulnerability becomes far more dangerous. An attacker could deploy a malicious container, escape from it, and use the host’s secrets to infiltrate the cloud service’s control systems.

This would enable them to access sensitive resources, including other customers’ source code, data, and secrets within the same service. 

 

Has CVE-2024-0132 been actively exploited in the wild?

No signs of active exploitation of this vulnerability have been detected as of this publication. The exploitation of the flaw would consist of three main stages. 

As a first step, the attacker would craft a Malicious image thus creating a specially engineered container image designed to exploit CVE-2024-0132. 

The second stage is to gain Full Filesystem Access: Once running  the malicious image on the targeted system, the attacker can, either directly (through services that utilize shared GPU resources) or indirectly (via a supply chain compromise or social engineering), exploit the vulnerability to mount the host’s entire filesystem.

This would grant full read access to the host, exposing the system’s infrastructure and potentially allowing access to other customers’ sensitive information. 

Finally, step three would include taking over the Host: With access to the host’s file system, the attackers can interact with the Container Runtime Unix sockets like docker.sock or containerd.sock, allowing them to execute arbitrary commands on the host with root privileges. This grants complete control of the machine. 

While the vulnerability initially provides only read access, a threat actor can leverage a Linux behavior where Unix sockets remain writable even when mounted with read-only permissions. This opens the door to a full system takeover. 

 

How to fix CVE-2024-0132

Organizations impacted by this vulnerability should update to the latest versions of the Container Toolkit (v1.16.2) and the NVIDIA GPU Operator (v24.6.2).  

The urgency of addressing this vulnerability depends on your system’s architecture and how much trust you place in the container images you run.

Environments that use third-party container images or AI models—whether internally or as a service face a higher level of risk, as the flaw can be exploited via a malicious image. In addition, the vulnerability does not impact use cases where Container Device Interface (CDI) is used. 

To mitigate the issue, It’s strongly advised to apply patches to container hosts running older, vulnerable versions of the Container Toolkit.

Focus should be on hosts that are more likely to run containers from untrusted images. To further streamline patching efforts, you can use runtime validation to target systems where the toolkit is actively in use. 

Public internet exposure is not a relevant factor for prioritizing this vulnerability. The container host doesn’t need to be externally accessible to be compromised by a malicious image.

Potential entry points could include social engineering attacks on developers, supply chain risks like an attacker with access to a container image repository, or misconfigured environments that allow users to upload arbitrary images. 

 

Building mature pipelines for running AI models with control over source and integrity

Like in previous investigations of vulnerabilities in AI service providers such as Hugging Face, SAP AI Core, Replicate and others, in this investigation, Wiz researchers discovered a common practice: these providers often run AI models and training processes as containers within shared computing environments, where multiple customers’ applications utilize the same GPU.  

Their next step was to check whether the shared GPU device could expose one customer’s AI models, prompts, or datasets to others using the same hardware, launching an exploration into NVIDIA’s kernel modules, SDK, and runtime tools. 

Examining the NVIDIA Container Toolkit, they uncovered a large attack surface for container breakout vulnerabilities, potentially allowing an attacker to escape their container and access data belonging to other users sharing the same GPU.

This finding shifted their focus away from GPU-specific research and towards a deeper investigation into the support tools NVIDIA offers to its users.

  

Further reading

Each new vulnerability is a reminder of where we stand and what we need to do better. Check out the following resources to help you maintain cyber hygiene and stay ahead of the threat actors: 

  1. Q1 2024 Vulnerability Watch
  2. The MITRE ATT&CK framework: Getting started
  3. The true impact of exploitable vulnerabilities for 2024
  4. Vulnerability disclosure policy (and how to get it right)
  5. How to properly tackle zero-day threats

Get rid of silos;

Start owning exposure risk

Test drive the leader in exposure risk management