# Slurm We use SLURM (https://slurm.schedmd.com/overview.html ) as a workload manager to schedule jobs onto compute resources. Via SLURM we can ensure that each user gets a fair share of the limited compute resources and that multiple users do not interfere with each other when e.g. running benchmarks. *Important: You can only access a node via SSH when you have a SLURM allocation of that node.* Other resources: - [Slurm Tutorial]() ## Basics ## IMGW special commands There are currently a few extra commands that can be used on the Jet Cluster to facilitate usage of the nodes. Tools: - `jobinfo` - `jobinfo_remaining` - `nodeinfo` - `queueinfo` - `watchjob` ```bash # Get information on your job jobinfo # or use a JOBID jobinfo 123456 # jobinfo_remaining ``` ## jobs ## MPI ## status and reason codes The `squeue` command details a variety of information on an active job’s status with state and reason codes. *__Job state codes__* describe a job’s current state in queue (e.g. pending, completed). *__Job reason codes__* describe the reason why the job is in its current state. The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs. ### Job State Codes | Status | Code | Explaination | | ------------- | :---: | ---------------------------------------------------------------------- | | COMPLETED | `CD` | The job has completed successfully. | | COMPLETING | `CG` | The job is finishing but some processes are still active. | | FAILED | `F` | The job terminated with a non-zero exit code and failed to execute. | | PENDING | `PD` | The job is waiting for resource allocation. It will eventually run. | | PREEMPTED | `PR` | The job was terminated because of preemption by another job. | | RUNNING | `R` | The job currently is allocated to a node and is running. | | SUSPENDED | `S` | A running job has been stopped with its cores released to other jobs. | | STOPPED | `ST` | A running job has been stopped with its cores retained. | A full list of these Job State codes can be found in [Slurm’s documentation.](https://slurm.schedmd.com/squeue.html#lbAG) ### Job Reason Codes | Reason Code | Explaination | | ------------------------ | ------------------------------------------------------------------------------------------- | | `Priority` | One or more higher priority jobs is in queue for running. Your job will eventually run. | | `Dependency` | This job is waiting for a dependent job to complete and will run afterwards. | | `Resources` | The job is waiting for resources to become available and will eventually run. | | `InvalidAccount` | The job’s account is invalid. Cancel the job and rerun with correct account. | | `InvaldQoS` | The job’s QoS is invalid. Cancel the job and rerun with correct account. | | `QOSGrpCpuLimit` | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. | | `QOSGrpMaxJobsLimit` | Maximum number of jobs for your job’s QoS have been met; job will run eventually. | | `QOSGrpNodeLimit` | All nodes assigned to your job’s specified QoS are in use; job will run eventually. | | `PartitionCpuLimit` | All CPUs assigned to your job’s specified partition are in use; job will run eventually. | | `PartitionMaxJobsLimit` | Maximum number of jobs for your job’s partition have been met; job will run eventually. | | `PartitionNodeLimit` | All nodes assigned to your job’s specified partition are in use; job will run eventually. | | `AssociationCpuLimit` | All CPUs assigned to your job’s specified association are in use; job will run eventually. | | `AssociationMaxJobsLimit`| Maximum number of jobs for your job’s association have been met; job will run eventually. | | `AssociationNodeLimit` | All nodes assigned to your job’s specified association are in use; job will run eventually. | A full list of these Job Reason Codes can be found [in Slurm’s documentation.](https://slurm.schedmd.com/squeue.html#lbAF) # Get information on your jobs ```sh title='Job details' # get all your jobs since sacct --start=YY-MM-DD -u $USER -o start,jobid,jobidraw,jobname,partition,maxvmsize,elapsed,state,exitcode # get more information on one job sacct -j [jobid] ``` ```sh title='Job efficiency' # get a jobs efficiency report seff [jobid] # example # example showing only 3% memory and 45% cpu efficiency! seff 2614735 Job ID: 2614735 Cluster: cluster User/Group: /vscusers State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 30 CPU Utilized: 01:00:33 CPU Efficiency: 41.05% of 02:27:30 core-walltime Job Wall-clock time: 00:04:55 Memory Utilized: 596.54 MB Memory Efficiency: 2.91% of 20.00 GB ``` There is a helpful [script](seff-array.py) that can report job efficiency for job arrays too. ??? note "seff-array.py" ``` sh title="seff-array.py" --8<-- "seff-array.py" ``` One can use that to get more detailed information on a job array: ```sh title="Running job efficiency report array" # usually one needs to install a few dependencies first. ```