Greetings,
We’ve made substantial progress getting through our activities, and are releasing jobs. We still have a number of compute nodes that still need to be brought online, however all clusters have some amount of resources and are running jobs. We will continue to work through these issues later today. After sleep.
Major upgrade to DDN & a new scratch storage
All data migrated successfully to new front ends, additional disks have been added for upcoming scratch. Substantial delays due to unanticipated long running processes to join compute nodes to the new GPFS cluster. This work is still ongoing. Benchmarking suggests a slight performance improvement for those of you with project directories in GPFS.
New PACE router and firewall hardware & additional core network capacity
successfully completed without incident.
Panasas scratch filesystem maintenance
successfully completed without incident.
Migration of home directories
successfully completed without incident.
Migration of /usr/local storage
successfully completed without incident.
Begin transition away from diskless compute nodes.
migrated approximately 100 compute nodes. Some of these still have issues with GPFS, as above.
We’ve had some unexpected delays and challenges this go around. The short version is that we will need to extend our maintenance activities into tomorrow. We’ll do a rolling release to you as we can bring compute nodes online.
The long version:
The storage system that is responsible for /usr/local and our virtual machine infrastructure experienced a hardware failure that caused us a significant amount of lost time. Some PACE staff have spent 40 of the last 48 hours on site in order to try and make corrections. We were already planning on transitioning /usr/local off of this storage and had alternate storage in place. Likewise for the virtual machines, although our plan was to live-migrate those after maintenance activities were complete. The good news is that we don’t have data loss, the bad news is that we’ve had to accelerate the virtual machine migration, resulting in additional unplanned effort.
Also, the DDN work is taking far longer than expected. Part of this work required us to remove all nodes from the GPFS filesystem and add them back in again. Current estimates to bring everything back to full production range from an additional 12 to 24 hours. This means between 10am and 10pm tomorrow before we have everything back up. As mentioned above, we will make things available as soon as we can. Pragmatically, that means that clusters will initially be available at reduced capacity. Look for another post here when we start enabling logins again.
Our maintenance activities are now underway. All PACE clusters are down. Please watch this space for updates.
For details on work to be completed, please see our previous posts, here.
Greetings,
The PACE team is preparing for our quarterly maintenance that will occur, Tuesday, October 20 and Wednesday, October 21. We have a number of activities scheduled that should provide positive improvements across the board.
** NO USERS WILL BE MIGRATED DURING THE MAINTENANCE PERIOD **
After the maintenance period, we will begin migrating users to the new scratch storage. This will be a lengthy process, with some user actions and coordination required. We will do our best to minimize the impact of the migration. We are targeting our January maintenance to retire the Panasas storage, as the service contracts expire at the end of December.