Skip to content
Home » How to Check GPU Access in SLURM Clusters

How to Check GPU Access in SLURM Clusters

To determine the number of GPUs your account can access in a SLURM-managed cluster, follow these steps:

Check Account and Partition Access: Use the command sacctmgr show associations to view your account’s associations with partitions. Look for GPU-specific partitions (e.g., gpu or gpu-guest).

Inspect Node Configuration: Run scontrol show nodes to see the GPU configuration of nodes in the cluster. Look for lines like CfgTRES=gres/gpu:X, where X indicates the number of GPUs available per node.

Query Resource Limits: Use sacctmgr show qos to check Quality of Service (QoS) limits for your account. This may include GPU limits.

Contact Cluster Admins: If you’re unsure about your GPU allocation, reach out to your cluster administrators. They can provide specific details about your account’s GPU access.


Output explanation

Let’s interpret an example output

   Cluster    Account       User  Partition     Share   Priority GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin 
---------- ---------- ---------- ---------- --------- ---------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- 
     slurm       root                               1                                                                                                                                                             normal                         
     slurm       root       root                    1                                                                                                                                                             normal                         
slurm_clu+       root                               1                                                                                                                                                             normal                         
slurm_clu+       root       root                    1                                                                                                                                                             normal  

This output appears to be a result of querying Quality of Service (QoS) or resource allocation details in a SLURM-managed cluster. Here’s an interpretation of the columns and key rows:

Explanation of Fields:

  1. Name: The name of the QoS configuration (e.g., normal, dgx2q-qos, and defq-qos).
  2. Priority: Priority of jobs submitted under the QoS. A priority of 0 indicates no special prioritization.
  3. GraceTime: The time allowed for a job before it is preempted, if preemption is enabled (here it’s 00:00:00, meaning no grace time).
  4. Preempt / PreemptExemptTime / PreemptMode: Related to job preemption settings (none specified in the table).
  5. Flags: Special flags applied to the QoS (e.g., cluster in all rows indicates resources are managed at the cluster level).
  6. UsageThres / UsageFactor: Utilization thresholds and scaling factors for resource usage (defaults are shown here).
  7. GrpTRES: Group-level Trackable Resources (TRES), such as GPUs, CPUs, or memory allocations.
  8. GrpTRESMins, GrpTRESRunMin, GrpJobs: Limits on resource usage over time or for running jobs.
  9. MaxTRES: Maximum number of trackable resources (e.g., 4 in dgx2q-qos and defq-qos, which likely refers to GPUs).
  10. MaxWall: Maximum wall time for jobs under this QoS (not specified here).
  11. MaxJobsPU / MaxSubmitPU: Limits on the number of jobs a user can run or submit simultaneously (not specified here).
  12. MinTRES: Minimum resources required for jobs under this QoS (not specified).

Key Observations:

  • The normal QoS doesn’t have a MaxTRES value specified, meaning there may not be GPU access under this QoS.
  • The dgx2q-qos and defq-qos configurations both have a MaxTRES of 4, which likely indicates a limit of 4 GPUs accessible for jobs submitted under these QoS settings.
  • No other strict resource usage or job limits are explicitly defined in this table.

“access denied” when trying to run sinfo

If you’re seeing “access denied” when trying to run a SLURM command like sinfo, it likely means you don’t have the necessary permissions to query the SLURM system or that SLURM isn’t set up correctly for your user. Here’s how to troubleshoot it:


✅ 1. Check SLURM installation

Run:

which sinfo
  • If this returns nothing, SLURM tools may not be in your PATH.
  • Fix: Load SLURM module (if using environment modules): module load slurm

✅ 2. Use sinfo with elevated permissions (if allowed)

Some clusters restrict SLURM tools to specific user groups or roles. If you’re a new user:

  • Ask your cluster admin to confirm that:
    • Your user account is added to the proper group (e.g., slurm, hpcusers)
    • You have read access to SLURM configuration files (e.g., slurm.conf)
    • There are no node- or partition-level access restrictions

✅ 3. Try with scontrol show partition

Sometimes sinfo might be restricted but scontrol is more verbose and can still be accessible:

scontrol show partition

✅ 4. Check node access with squeue

You can see if you’re able to run:

squeue -u $USER

If this works, SLURM is functioning, but access to other commands may be restricted.


❗ If nothing works:

Ask your cluster administrator or support team:

  • Whether your user account has been fully set up
  • Whether SLURM commands are restricted to specific users or groups
  • If partitions/nodes are restricted to specific projects or labs

Leave a Reply

error: Content is protected !!