At approximately 10:40 this morning, the failure of a top-of-rack network switch in the P31 rack of our data center failed. This caused a loss of network connectivity for approximately 44 compute nodes across a wide variety of queues. (see below) No other compute nodes are affected. Jobs running on these nodes will likely have failed as a result. The OIT network team is swapping in a replacement at the moment, and PACE staff are working to restore service as quickly as possible.
If you have access to any of the queues below, please check on their status and resubmit as needed. You can check which queues you have access to by using the ‘pace-whoami’ command.
We apologize for the inconvenience, and will work to bring these nodes back online as soon as possible. If you have additional questions, please email email@example.com.
We look to be back up at this point. The root cause seems to have been a problem with the subnet manager that controls the Infiniband network. Since GPFS uses this network, the issue initially manifested as a storage problem. However, many MPI codes use this network as well and may have crashed.
Again, we apologize for the inconvenience. Please do check on your jobs if you use MPI.
Since about 2:30am this morning, we have been experiencing a GPFS problem and, while all data is safe, all GPFS services are currently unavailable. This includes the scratch space, and project directory (~/data) filesystems for many users. We are working on restoring service as quickly as possible and apologize for the inconvenience.
Some of our uses will not be able to login to PACE clusters after the maintenance day due to ongoing scratch data migration process. Here are the details:
One of the tasks performed during this maintenance day (http://blog.pace.gatech.edu/?p=5943) was to deploy a new fileserver to serve the “scratch” storage.
The new system is now operational, however the scratch data for some of the accounts are still in the process of migration. We temporarily inactivated these accounts to prevent modifications to incomplete data sets. The submitted jobs by these users are also held by the scheduler. Once the transfer is complete for a user, we will enable the account, release user’s jobs, and send a notification email. Currently we have no means to estimate how long the migration will take, but we are doing all we can to make this process go as quickly as possible.
If you are one of these users and need any of your data urgently, please contact us firstname.lastname@example.org. All of your data are intact and accessible by us. We will try to find a solution for delivering the data you need for your research.
Our July 2016 maintenance is now substantially complete. Again, we sincerely apologize for the unfortunate additional unplanned downtime.
As previously communicated, we’ve had an unexpected delay caused by the data migrations from the old scratch system to newly acquired system. Some of these transfers are still in progress, with a limited number of users still remaining. We have temporarily disabled access for these users to prevent jobs running on incomplete scratch data. We are reaching out to the affected users individually with more details. These users will not be able to login and their previously submitted jobs will not run until their scratch migration is complete. If you have not received a further notification from us and experience problems with logins or anything else, please do let us know as soon as possible by sending an email to email@example.com.
Scratch performance may be reduced as these migrations complete, and we are doing everything we can to finish these migrations as soon as possible.
We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap.
The new DDN SFA-7700 system is now operational and serving scratch storage for all users. We updated client software versions on all nodes. We have encountered an anomaly that reduces its internal redundancy but does not affect normal operation. We expect be able to rectify this while in production.
Tasks complete as described
Network and local storage upgrades were implemented on schedulers as planned. Additional diskless nodes were converted to diskfull as planned.
Dear PACE users,
Despite our best efforts, the data copies for the PACE scratch space have not gone as quickly as we had projected. We have also encountered an anomaly with the new storage system though we expect be able to rectify this while in production. At this writing, we have many of the compute nodes online but cannot start jobs until the data copy is complete.
As with all our maintenance periods, there is always a remote possibility we will run over our estimated time. This is one of those times. Please accept our apology for this unavoidable delay. We can assure you all the data is intact and we are continuing to work to optimize the transfers to achieve a speedy return to service.
Our staff will continue to update you with our progress.
The PACE team
Quarterly maintenance is now underway. All clusters managed by PACE, including Tardis, are now offline. Please see our previous post for details.
Dear PACE users,
Quarterly maintenance is once again upon us. Starting at 6:00am TOMORROW MORNING, all resources managed by PACE will be taken offline. Maintenance will continue through Wednesday evening. Our activities are adhering to our originally published two-day schedule.
As a heads up, please make note of our Fall maintenance, which is now scheduled to begin at 6:00am on Thursday, October 13 and continue through Saturday, October 15. Please note that this is a three-day period, including weekend work. Further details to come as we get closer to October.
As previously communicated, our original plan to update various system software components in July has been deferred to a future maintenance period. We will be in touch in advance of the October maintenance with details on this, including where you can test your codes against the updated software. (highly recommended!)
Our major activity this time around will be updates to our GPFS filesystems and DDN storage devices.
- We have acquired a new DDN SFA-7700, to which we will transition the scratch space. This will provide more consistent scratch performance, a path for future capacity and performance increases, and provide as good or better performance to what we have now. Initially, the SFA-7700 will provide approximately 375TB of space. We will be increasing this to the 500TB we have currently as soon as additional disks can be procured. No user action will be required. We currently have approximately 220TB in use on scratch, so we do not expect this temporary decrease in available capacity to be an inconvenience.
- We have DDN engineers engaged to update firmware and software on our current SFA-12k. This will provide additional management and quality-of-service features, as well as the ability to transition to larger capacity drives. Additionally, we will reallocate the drives previously used for the scratch space to provide additional project space capacity and metadata performance. No user action will be required.
- To support the two above updates, we will also be upgrading the version of the GPFS client software (where installed) from version 3.5 to version 4.2. No user action will be required.
- Facilities electricians will be performing some electrical work in the data center that will require the power to be temporarily removed from a number of our racks. This work is to support some newly purchased equipment. No user action will be required.
- Additionally, as time permits, we will upgrade the network on some of our schedulers to 10-gigabit, and add additional local logging storage. This will not affect the Gryphon, NovaZohar or Tardis clusters. No user action will be required.
- Also as time permits, we will continue the transition away from diskless nodes. This mainly affects nodes in the 5-6 years old range. No user action will be required.
All head nodes and support nodes in the VM farm are online.
Initial Post – 09:15am
Early this morning (2016/07/15, approximately 2:00am), we had a critical storage failure that cause our VM farm to declare all running head nodes as invalid. We’re looking into this seriously, as this is one of those “not supposed to happen” moments. In the mean time, the PACE team is working on getting these nodes back up and running for all users.
Welcome to our updated website! We’ve transitioned all of our content to a new website, available at pace.gatech.edu. Please be sure to check out the updated user support section, available via the front page link ‘Current User Support‘. While we aim to provide as up-to-date content as possible, if you notice anything that seems outdated, please let us know.
If you miss our old website or need content that isn’t present on our new website, please let us know – it’s temporarily available at prev.pace.gatech.edu.
As always, thanks for choosing PACE.