Current Status:

Amarel, Perceval, and Didact will be offline for maintenance January 6-10.
All Amarel resources, including those in Camden and Newark, will be offline.
Items to be addressed during the outage include:
  • Lenovo Scalable Infrastructure firmware updates (to match that of the storage systems)
  • Lenovo Distributed Storage Solution for IBM Spectrum Scale (DSS-G) and Lenovo GPFS Storage Server (GSS) firmware updates
  • Rewriting of all filesystems with the GPFS 5.0.x format to enable variable sub-block sizing
  • Add GPFS updates to all compute node images
  • Implement Spanning Tree Protocol (STP) across Amarel’s internal network
  • Make a range of network interface configuration changes for our non-storage infrastructure
  • Patching OS images and various service systems
  • Moving enclosures (i.e., racks, power, cooling, connectivity) to enable a future storage expansion
Please note:
(1) Submitted jobs with a run time that overlaps this maintenance window will be held in a “pending” state and will be resumed when the maintenance is complete. So, for the next couple of weeks, it might be best to set run times that enable your jobs to end before Jan 6.
(2) The automated purging of files in /scratch that have not been accessed for 90 days will remain in effect through this maintenance period.

Anticipated Annual Maintenance Schedule:

  • 1st week of January (just after Winter Break)
  • 2nd week of May (between Spring and Summer terms)
  • 3rd week of August (between Summer and Fall terms)

Our vision is to provide a better user environment, where updates will be rolling updates affecting one portion at a time instead of a whole system. We will strive to do maintenances in less active periods so that classes and grants are minimally affected.

Past Maintenance:

Summary of work done

  • On July 10, we shifted to a new GPFS /scratch filesystem and all nodes were updated with new GPFS software. Data stored in the old /scratch file system was moved to the new /scratch filesystem.
  • Hill Data Center facilities overhaul (see https://rutgersit.rutgers.edu/services-unavailable-for-technology-upgrades-on-june-1-2 for details)
  • CICnet is in production. This means any node on any of the three Rutgers campuses can connect to any other node on any of the other campuses.
  • Some filesystems have been changed (e.g., the former /scratch directory has been moved to /oldscratch). There are now campus-specific /scratch file systems using a new version of GPFS.
  • There are 4 different clusters now and they are physically located in 3 different campuses. You will soon be able to choose which cluster and physical location to use for running your jobs. Federating these clusters, so that jobs can seemlessly migrate between available resources, is in the works.
  • New equipment has been installed – including switches and compute nodes.

FAQ

Is there any way we could have a guarantee (with the exception for unexpected emergencies of course) on how often this will happen? Say, twice per year?

We are in the process of establishing our uptime goals and obviously maintenance is it part of that. We are trying to find a balance between rapid change (including rapid growth, a wide range of projects, numerous clusters, latest software, etc.) and provide stability. At the same time, we are trying to improve how the system works, to make it more redundant and more useful. Add to this, we are a new team with heterogenous skills trying to find common approaches, while balancing a large number of demands. While there is nothing new or surprising in these statements, we are growing. One of our goals is to shorten the windows for downtime. We have discussed periodic downtime, which may be part of our future and we are starting work on automating and speeding up how some of our work is done. We are working to bundle our changes better while needing less time to debug and tweak.

Is there any way only a portion of the cluster could go down at a time for maintenance, rather than the entire cluster?

During our January 2019 maintenance, we changed the networking between the clusters, altering and adding to the way schedulers work with the clusters, and we changed how storage works across the three campuses – this enabled having jobs flow across all the clusters, simplify use of our storage and making future upgrades less invasive generally.

Final Thought

We apologize for the interruptions to your work. We do understand many of the frustrations with our systems and we are working hard to ensure that next year is better than this year and the year after is still better yet.