Re: Caliburn and ELF clusters lost cooling
07/13/2021, 9:56 am
Caliburn and ELF Users,
Caliburn and ELF are still offline due to the data center cooling issue announced on July 6. The University’s facilities team is working to determine whether the data center where Caliburn is located can safely handle normal operations (the cluster generates a lot of heat, even when only a portion of it is in operation). Right now, we still don’t have any news or an estimated time when researchers may be able to access the system.
Please note that the JAN-JUN 2021 allocation period for Caliburn ended on June 30. The current outage has interrupted the process of users removing their data upon the termination of that allocation period. So, we are certainly planning to restore access to Caliburn’s storage systems as soon as possible to enable users to finish removing their data.
Unfortunately, there will not be another allocation period because support agreements for Caliburn’s and ELF’s infrastructure have ended, continuation funding is not available, and the hardware vendors have declined to further extend their support or warranty coverage. As a result, even if we can restore operation temporarily (for weeks or months), we can no longer guarantee its availability to support research. We obviously want researchers to make the most of any compute resources we have to offer, but ongoing use of Caliburn and ELF must be done “at your own risk” because the storage systems are not backed-up, the infrastructure and software are no longer supported, and a hardware failure could end access to those systems at any time.
At this time, I strongly recommend exploring alternative compute and storage resources for your research. For Rutgers researchers, Amarel may be a good alternative. For both Rutgers and non-Rutgers researchers, the scale and diversity of XSEDE compute resources may be a good alternative. The support team at OARC can help with assessment of, and access to, all of those systems.
Of course, we will send updates when we have any useful news about Caliburn and ELF.
Please let me know if you have questions.
Galen
Galen Collier, PhD
Director, Research Support
Office of Advanced Research Computing (OARC)
07/06/2021, 10:56 pm
Caliburn and ELF Users,
Tonight at around 8:40pm the Caliburn and ELF clusters lost their cooling ability and by 10 pm the compute nodes started shutting down due to high temperatures.
All running and pending job have been lost.
We will provide updates as we receive more information.
Ehud (Udi) Zelzion,
Senior Scientist Research Computing
Office of Advanced Research Computing (OARC)