The Amarel cluster and all associated systems are in normal production mode: no outages, all systems are running.
The next maintenance outage (normally 2-3 days) is expected to take place in August before the Fall term begins.
Anticipated Annual Maintenance Schedule:
- 1st week of January (just after Winter Break)
- 2nd week of May (between Spring and Summer terms)
- 3rd week of August (between Summer and Fall terms)
Our vision is to provide a better user environment, where updates will be rolling updates affecting one portion at a time instead of a whole system. We will strive to do maintenances in less active periods so that classes and grants are minimally affected.
Summary of work done
- Hill Data Center facilities overhaul (see https://rutgersit.rutgers.edu/services-unavailable-for-technology-upgrades-on-june-1-2 for details)
- CICnet is in production. This means any node on any of the three Rutgers campuses can connect to any other node on any of the other campuses.
- Some filesystems have been changed (e.g., the former /scratch directory has been moved to /oldscratch). There are now campus-specific /scratch file systems using a new version of GPFS.
- There are 4 different clusters now and they are physically located in 3 different campuses. You will soon be able to choose which cluster and physical location to use for running your jobs. Federating these clusters, so that jobs can seemlessly migrate between available resources, is in the works.
- New equipment has been installed – including switches and compute nodes.
Is there any way we could have a guarantee (with the exception for unexpected emergencies of course) on how often this will happen? Say, twice per year?
We are in the process of establishing our uptime goals and obviously maintenance is it part of that. We are trying to find a balance between rapid change (including rapid growth, a wide range of projects, numerous clusters, latest software, etc.) and provide stability. At the same time, we are trying to improve how the system works, to make it more redundant and more useful. Add to this, we are a new team with heterogenous skills trying to find common approaches, while balancing a large number of demands. While there is nothing new or surprising in these statements, we are growing. One of our goals is to shorten the windows for downtime. We have discussed periodic downtime, which may be part of our future and we are starting work on automating and speeding up how some of our work is done. We are working to bundle our changes better while needing less time to debug and tweak.
Is there any way only a portion of the cluster could go down at a time for maintenance, rather than the entire cluster?
During our January 2019 maintenance, we changed the networking between the clusters, altering and adding to the way schedulers work with the clusters, and we changed how storage works across the three campuses – this enabled having jobs flow across all the clusters, simplify use of our storage and making future upgrades less invasive generally.
We apologize for the interruptions to your work. We do understand many of the frustrations with our systems and we are working hard to ensure that next year is better than this year and the year after is still better yet.