# Slurm

We use SLURM (https://slurm.schedmd.com/overview.html ) as a workload manager to schedule jobs onto compute resources. Via SLURM we can ensure that each user gets a fair share of the limited compute resources and that multiple users do not interfere with each other when e.g. running benchmarks.

*Important: You can only access a node via SSH when you have a SLURM allocation of that node.*

Other resources:
- [Slurm Tutorial]()

## Basics



## IMGW special commands

There are currently a few extra commands that can be used on the Jet Cluster to facilitate usage of the nodes.

Tools:
- `jobinfo`
- `jobinfo_remaining`
- `nodeinfo`
- `queueinfo`
- `watchjob`


```bash
# Get information on your job
jobinfo
# or use a JOBID
jobinfo 123456
# 
jobinfo_remaining
```

## jobs


## MPI 



## status and reason codes

The `squeue` command details a variety of information on an active job’s status with state and reason codes. *__Job state codes__* describe a job’s current state in queue (e.g. pending, completed). *__Job reason codes__* describe the reason why the job is in its current state. 

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

### Job State Codes

| Status        | Code  | Explaination                                                           |
| ------------- | :---: | ---------------------------------------------------------------------- |
| COMPLETED	| `CD`	| The job has completed successfully.                                    |
| COMPLETING	| `CG`	| The job is finishing but some processes are still active.              |
| FAILED	| `F`	| The job terminated with a non-zero exit code and failed to execute.    |
| PENDING	| `PD`	| The job is waiting for resource allocation. It will eventually run.    |
| PREEMPTED	| `PR`	| The job was terminated because of preemption by another job.           |
| RUNNING	| `R`	| The job currently is allocated to a node and is running.               |
| SUSPENDED	| `S`	| A running job has been stopped with its cores released to other jobs.  |
| STOPPED	| `ST`	| A running job has been stopped with its cores retained.                |

A full list of these Job State codes can be found in [Slurm’s documentation.](https://slurm.schedmd.com/squeue.html#lbAG)


### Job Reason Codes

| Reason Code              | Explaination                                                                                |
| ------------------------ | ------------------------------------------------------------------------------------------- |
| `Priority`	           | One or more higher priority jobs is in queue for running. Your job will eventually run.     |
| `Dependency`	           | This job is waiting for a dependent job to complete and will run afterwards.                |
| `Resources`	           | The job is waiting for resources to become available and will eventually run.               |
| `InvalidAccount`	   | The job’s account is invalid. Cancel the job and rerun with correct account.             |
| `InvaldQoS`              | The job’s QoS is invalid. Cancel the job and rerun with correct account.                 |
| `QOSGrpCpuLimit` 	   | All CPUs assigned to your job’s specified QoS are in use; job will run eventually.          |
| `QOSGrpMaxJobsLimit`	   | Maximum number of jobs for your job’s QoS have been met; job will run eventually.           |
| `QOSGrpNodeLimit`	   | All nodes assigned to your job’s specified QoS are in use; job will run eventually.         |
| `PartitionCpuLimit`	   | All CPUs assigned to your job’s specified partition are in use; job will run eventually.    |
| `PartitionMaxJobsLimit`  | Maximum number of jobs for your job’s partition have been met; job will run eventually.     |
| `PartitionNodeLimit`	   | All nodes assigned to your job’s specified partition are in use; job will run eventually.   |
| `AssociationCpuLimit`	   | All CPUs assigned to your job’s specified association are in use; job will run eventually.  |
| `AssociationMaxJobsLimit`| Maximum number of jobs for your job’s association have been met; job will run eventually.   |
| `AssociationNodeLimit`   | All nodes assigned to your job’s specified association are in use; job will run eventually. |

A full list of these Job Reason Codes can be found [in Slurm’s documentation.](https://slurm.schedmd.com/squeue.html#lbAF)


# Get information on your jobs

```sh title='Job details'
# get all your jobs since 
sacct --start=YY-MM-DD -u $USER -o start,jobid,jobidraw,jobname,partition,maxvmsize,elapsed,state,exitcode 
# get more information on one job
sacct -j [jobid] 

```

```sh title='Job efficiency'
# get a jobs efficiency report
seff [jobid]
# example
# example showing only 3% memory and 45% cpu efficiency!
seff 2614735
Job ID: 2614735
Cluster: cluster
User/Group: /vscusers
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 30
CPU Utilized: 01:00:33
CPU Efficiency: 41.05% of 02:27:30 core-walltime
Job Wall-clock time: 00:04:55
Memory Utilized: 596.54 MB
Memory Efficiency: 2.91% of 20.00 GB
```

There is a helpful [script](seff-array.py) that can report job efficiency for job arrays too.

??? note "seff-array.py"

    ``` sh title="seff-array.py"
    --8<-- "seff-array.py"
    ```
One can use that to get more detailed information on a job array:

```sh title="Running job efficiency report array"
# usually one needs to install a few dependencies first.

```