Our vision is to provide a better user environment, where updates will be rolling updates affecting one portion at a time instead of a whole system. We will strive to do maintenances in less active periods so that classes and grants are minimally affected.
Summary of work done
- CICnet is in production. This means any node on any of the three Rutgers campuses can connect to any other node on any of the other campuses.
- Some filesystems have been changed (e.g., the former /scratch directory has been moved to /oldscratch). There are now campus-specific /scratch file systems using a new version of GPFS.
- There are 4 different clusters now and they are physically located in 3 different campuses. You will soon be able to choose which cluster and physical location to use for running your jobs. Federating these clusters, so that jobs can seemlessly migrate between available resources, is in the works.
- New equipment has been installed – including switches and compute nodes.
Is there any way we could have a guarantee (with the exception for unexpected emergencies of course) on how often this will happen? Say, once per year?
We are in the process of establishing our uptime goals and obviously maintenance is it part of that. We are trying to find a balance between rapid change (including rapid growth, a wide range of projects, numerous clusters, latest software, etc.) and provide stability. At the same time, we are trying to improve how the system works, to make it more redundant and more useful. Add to this, we are a new team with heterogenous skills trying to find common approaches, while balancing a large number of demands. While there is nothing new or surprising in these statements, I’m saying we are growing. One of my goals is to shorten the windows for downtime. We have discussed periodic downtime, which may be part of our future and we are starting work on automating and speeding up how some of our work is done. We are working to bundle our changes better while needing less time to debug and tweak.
Is there any way only a portion of the cluster could go down at a time for maintenance, rather than the entire cluster?
During this maintenance, we have changed the networking between the clusters, altering and adding to the way schedulers work with the clusters, and we changed how storage works across the three campuses – this will let jobs to flow across all the clusters, simplify users use of our storage and make future upgrades less invasive generally.
Why more than one cluster at a time? It seems this is not properly taking the users’ perspective into account.
The clusters on the three campuses share some infrastructure and we are trying to connect the three, so the changes to the networking need to be done at the same time, as does the work to the scheduler and storage. We hope in future, we can bring down one of the three sites at a time, with minimal impact on jobs (i.e. your jobs will just continue to run but they might run on nodes in Camden). These changes (and the next set) will give us a much wider base to work from. There is a lot of work to tie the three campuses together in a usable fashion.
We apologize for the interruptions to your work. We do understand many of the frustrations with our systems and we are working hard to ensure that next year is better than this year and the year after is still better yet.