Intermittent SLURM Controller Issue Impacting Some Jobs
We’re actively troubleshooting an intermittent issue with the Amarel cluster’s SLURM controller (that’s the core job scheduling and management system). At first, it wasn’t clear if this issue was impacting jobs, but now there is evidence that some users’ jobs have been impacted.
The SLURM controller exited (stopped) and was restarted at the following times (ET):
Wednesday at 06:26, 17:15, 17:42, 22:53 and Thursday at 00:47 and 08:07.
Based on what we’ve seen, it’s possible that new or existing jobs could be interrupted by this issue if the controller momentarily stops again at a time when those jobs need to communicate with the controller. However, I recommend proceeding with jobs normally, or restarting interrupted jobs, since the majority of jobs are not encountering issues.
Of course, we’ll provide updates as we get more information.
Please let me know if you have questions.
Director, Research Support