How to Check GPU Access in SLURM Clusters

To determine the number of GPUs your account can access in a SLURM-managed cluster, follow these steps:

Check Account and Partition Access: Use the command sacctmgr show associations to view your account’s associations with partitions. Look for GPU-specific partitions (e.g., gpu or gpu-guest).

Inspect Node Configuration: Run scontrol show nodes to see the GPU configuration of nodes in the cluster. Look for lines like CfgTRES=gres/gpu:X, where X indicates the number of GPUs available per node.

Query Resource Limits: Use sacctmgr show qos to check Quality of Service (QoS) limits for your account. This may include GPU limits.

Contact Cluster Admins: If you’re unsure about your GPU allocation, reach out to your cluster administrators. They can provide specific details about your account’s GPU access.


Output explanation

Let’s interpret an example output

   Cluster    Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
     slurm       root                               1                                                                                                                                                             normal                         
     slurm       root       root                    1                                                                                                                                                             normal                         
slurm_clu+       root                               1                                                                                                                                                             normal                         
slurm_clu+       root       root                    1                                                                                                                                                             normal  

This output appears to be a result of querying Quality of Service (QoS) or resource allocation details in a SLURM-managed cluster. Here’s an interpretation of the columns and key rows:

Explanation of Fields:

  1. Name: The name of the QoS configuration (e.g., normal, dgx2q-qos, and defq-qos).
  2. Priority: Priority of jobs submitted under the QoS. A priority of 0 indicates no special prioritization.
  3. GraceTime: The time allowed for a job before it is preempted, if preemption is enabled (here it’s 00:00:00, meaning no grace time).
  4. Preempt / PreemptExemptTime / PreemptMode: Related to job preemption settings (none specified in the table).
  5. Flags: Special flags applied to the QoS (e.g., cluster in all rows indicates resources are managed at the cluster level).
  6. UsageThres / UsageFactor: Utilization thresholds and scaling factors for resource usage (defaults are shown here).
  7. GrpTRES: Group-level Trackable Resources (TRES), such as GPUs, CPUs, or memory allocations.
  8. GrpTRESMins, GrpTRESRunMin, GrpJobs: Limits on resource usage over time or for running jobs.
  9. MaxTRES: Maximum number of trackable resources (e.g., 4 in dgx2q-qos and defq-qos, which likely refers to GPUs).
  10. MaxWall: Maximum wall time for jobs under this QoS (not specified here).
  11. MaxJobsPU / MaxSubmitPU: Limits on the number of jobs a user can run or submit simultaneously (not specified here).
  12. MinTRES: Minimum resources required for jobs under this QoS (not specified).

Key Observations:

  • The normal QoS doesn’t have a MaxTRES value specified, meaning there may not be GPU access under this QoS.
  • The dgx2q-qos and defq-qos configurations both have a MaxTRES of 4, which likely indicates a limit of 4 GPUs accessible for jobs submitted under these QoS settings.
  • No other strict resource usage or job limits are explicitly defined in this table.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!