Our October 2016 maintenance period is now complete. We’ve compute nodes, login nodes and interactive (post-processing) nodes to the RedHat Enterprise Linux 6.7 previously deployed on the TestFlight cluster. This included a large number of bugfix and security patches, a major step forward in the Infiniband layer and recompiled versions of various MPI, scientific libraries and applications in /usr/local. Please do let us know (via email to pace-support@oit.gatech.edu) if you see issues with your jobs.
We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap.
DDN/GPFS work
Hardware repairs to the project directory (~/data) system is complete. Minor repairs to the scratch system will be rescheduled for a future maintenance period. The issue is minor and should not disrupt performance or availability of the scratch system. No user actions are expected.
Networking
The OIT Network Engineering team upgraded the software running on many of our switches to match that which is running elsewhere on campus. This included our firewalls. No user actions are expected.
Electrical work
These problems were a bit more extensive than originally anticipated. With some help from the OIT Operations team, we have a alternate solution in place, and will complete this work at a future maintenance period. No user action is expected.
Bonus objectives
We were able to add capacity to the project directory system, and we now have our first single filesystem that’s greater than a petabyte, coming at about 1.3PB. Maybe that’ll last us a couple of weeks. 😉 Disks for the scratch system have been installed, we will add them into the scratch filesystem shortly. This can be done live without impact to running jobs.