To determine the number of GPUs your account can access in a SLURM-managed cluster, follow these steps:
Check Account and Partition Access: Use the command sacctmgr show associations
to view your account’s associations with partitions. Look for GPU-specific partitions (e.g., gpu
or gpu-guest
).
Inspect Node Configuration: Run scontrol show nodes
to see the GPU configuration of nodes in the cluster. Look for lines like CfgTRES=gres/gpu:X
, where X
indicates the number of GPUs available per node.
Query Resource Limits: Use sacctmgr show qos
to check Quality of Service (QoS) limits for your account. This may include GPU limits.
Contact Cluster Admins: If you’re unsure about your GPU allocation, reach out to your cluster administrators. They can provide specific details about your account’s GPU access.
Output explanation
Let’s interpret an example output
Cluster Account User Partition Share Priority GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
slurm root 1 normal
slurm root root 1 normal
slurm_clu+ root 1 normal
slurm_clu+ root root 1 normal
This output appears to be a result of querying Quality of Service (QoS) or resource allocation details in a SLURM-managed cluster. Here’s an interpretation of the columns and key rows:
Explanation of Fields:
- Name: The name of the QoS configuration (e.g.,
normal
,dgx2q-qos
, anddefq-qos
). - Priority: Priority of jobs submitted under the QoS. A priority of
0
indicates no special prioritization. - GraceTime: The time allowed for a job before it is preempted, if preemption is enabled (here it’s
00:00:00
, meaning no grace time). - Preempt / PreemptExemptTime / PreemptMode: Related to job preemption settings (none specified in the table).
- Flags: Special flags applied to the QoS (e.g.,
cluster
in all rows indicates resources are managed at the cluster level). - UsageThres / UsageFactor: Utilization thresholds and scaling factors for resource usage (defaults are shown here).
- GrpTRES: Group-level Trackable Resources (TRES), such as GPUs, CPUs, or memory allocations.
- GrpTRESMins, GrpTRESRunMin, GrpJobs: Limits on resource usage over time or for running jobs.
- MaxTRES: Maximum number of trackable resources (e.g.,
4
indgx2q-qos
anddefq-qos
, which likely refers to GPUs). - MaxWall: Maximum wall time for jobs under this QoS (not specified here).
- MaxJobsPU / MaxSubmitPU: Limits on the number of jobs a user can run or submit simultaneously (not specified here).
- MinTRES: Minimum resources required for jobs under this QoS (not specified).
Key Observations:
- The
normal
QoS doesn’t have aMaxTRES
value specified, meaning there may not be GPU access under this QoS. - The
dgx2q-qos
anddefq-qos
configurations both have aMaxTRES
of4
, which likely indicates a limit of 4 GPUs accessible for jobs submitted under these QoS settings. - No other strict resource usage or job limits are explicitly defined in this table.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.