GPU and other Resources
Requests and Limits
As described in kubernetes documentation, platform uses two kinds of resource “limits”. One is called Requests, and provides guaranteed amount of resource, the other one is called Limits and specifies never exceed limit of resource. User could specify Requests only, which enables guarantees without limit (which forbidden in our platform), or Limits only and in this case, Requests and Limits are equal. All resources are allocated from a single computing node, it means that they cannot be larger than capacity of a single node, e.g., requesting 1000 CPUs will never be satisfied. Resources between Requests and Limits are not guaranteed. Kubernetes scheduler consider only Requests when planning Pods to nodes, so it can happen that node receives disk or memory pressure caused by overbooking these kinds of resources. In this case, a Pod can be evicted from the node. For CPU resources between Requests and Limits, slow down can happen if node does not have free CPUs anymore.
General YAML fragment looks like this:
resources:
requests:
# resources
limits:
# resources
CPU
CPU resource can be requested in whole CPU units such as 2
for two CPUs, or in milli units such as 100m
for 0.1 CPU. See complete example below.
Memory
Memory resource can be requested in bytes or multiples of binary bytes. Notation is 1000Mi
or 1Gi
for 1GB of memory. Amount of this resource comprises shared memory (see below) and all memory emptyDir
volumes. The example below consumes up to 1GB of memory resource and if application requires e.g., 2GB of memory, user needs to request 3GB of memory resource.
volumes:
- name ramdisk
emptyDir:
medium: Memory
sizeLImit: 1Gi
Shared Memory
By default, each container runs with 64MB limit on shared memory. For many cases, this limit is enough but for some GPU applications or GUI applications this amount is not enough. In such a case, SHM size needs to be increased. This cannot be done in resource
section, the only possibility is to mount additional memory volume using the following YAML fragment that increases SHM size to 1GB:
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
volumeMounts:
- name: dshm
mountPath: /dev/shm
Name
of the volume is not important and can be anything valid, the mountPath
must be /dev/shm
.
GPU
GPU resource can be requested in two distinct ways. User can request GPU(s) exclusively using nvidia.com/gpu: x
where x
is a number denoting number of requested GPUs. User can request also only a fraction of GPU using nvidia.com/mig-1g.10gb: x
where x
is a number of such GPU parts, or nvidia.com/mig-2g.20gb: x
. The nvidia.com/mig-1g.10gb
requests GPU part with 10GB memory size, the nvidia.com/mig-2g.20gb
requests GPU part with 20GB memory size. More information about the GPU fractions (MIG) can be found here.
Requesting Specific GPU Type
It is possible to request a specific GPU type when deploying a Pod. This can be done using node selector labels. Each node equipped with a GPU has a label nvidia.com/gpu.product
, which can currently have values: NVIDIA-A10
, NVIDIA-A100-80GB-PCIe
, NVIDIA-A40
, NVIDIA-H100-PCIe
, and NVIDIA-L4
.
For example, to directly request the NVIDIA A40 GPU type, use the following Pod definition:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
nodeSelector:
nvidia.com/gpu.product: 'NVIDIA-A40'
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: test
image: ubuntu
command:
- "sleep"
- "infinity"
resources:
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
securityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
This is just an example Pod definition to make it shorter. In a real use case, Deployment
or Job
should be used, their inner spec
is exactly the same as the above spec
.
Requesting a specific GPU type reduces the chances of the Pod running. All GPUs of a particular type may be in use, e.g. we only have one NVIDIA L4 GPU.
Requesting GPU Properties
It is also possible to request some GPU properties, such as GPU memory. This is similar to requesting a specific GPU type, but the different node selector is used: nvidia.com/gpu.memory
. However, in this case, more complicated affinity
description must be used, because the node selector can only match an exact value, it cannot compare the value to less or greater.
The following example will allocate any GPU type with at least 80000MB of GPU memory:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.memory
operator: Gt
values:
- '80000'
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: test
image: ubuntu
command:
- "sleep"
- "infinity"
resources:
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
securityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
For more about operators (Gt
) see operators description.
Although the amount of GPU memory is a number, it must be defined as a string, i.e., ‘80000’ and not 80000.
Storage
User should specify also required ephemeral-storage
resource. Units are the same as for the memory
resource. This resource denotes limit on a local storage that comprises size of running container and all local files created. Those files includes all temporary files within container such as files in /tmp
or /var/tmp
directories and also all emptyDir
volumes that are not in memory.
Full Resource Example
The following example requests 1 GPU, 2 CPUs and 4GB of memory guaranteed, and 3 CPUs and 6GB of memory hard limit.
resources:
requests:
cpu: 2
memory: 4Gi
nvidia.com/gpu: 1
limits:
cpu: 3
memory: 6Gi
nvidia.com/gpu: 1
The following example requests 2 GB of GPU memory, 0.5 CPU and 4GB of memory guaranteed and also as hard limit.
resources:
requests:
cpu: 500m
memory: 4Gi
nvidia.com/mig-1g.10gb: 1
limits:
cpu: 500m
memory: 4Gi
nvidia.com/mig-1g.10gb: 1