PACE A Partnership for an Advanced Computing Environment

August 25, 2016

localized network outage has some nodes offline

Filed under: tech support — admin @ 4:21 pm

At approximately 10:40 this morning, the failure of a top-of-rack network switch in the P31 rack of our data center failed. This caused a loss of network connectivity for approximately 44 compute nodes across a wide variety of queues. (see below) No other compute nodes are affected. Jobs running on these nodes will likely have failed as a result. The OIT network team is swapping in a replacement at the moment, and PACE staff are working to restore service as quickly as possible.

If you have access to any of the queues below, please check on their status and resubmit as needed. You can check which queues you have access to by using the ‘pace-whoami’ command.

We apologize for the inconvenience, and will work to bring these nodes back online as soon as possible.  If you have additional questions, please email pace-support@oit.gatech.edu.

aces
athena-intel
biocluster-6
bioforce-6
blue
chow
cochlea
dimer-6
dimerforce-6
granulous
hygene-6
hygeneforce-6
iw-shared-6
joe-6-intel
math-6
mathforce-6
orbit
prometforce-6
prometheus
sonar-6
sonarforce-6
starscream

August 1, 2016

resolved: storage problems this morning

Filed under: Uncategorized — admin @ 3:12 pm

We look to be back up at this point.  The root cause seems to have been a problem with the subnet manager that controls the Infiniband network.  Since GPFS uses this network, the issue initially manifested as a storage problem.  However, many MPI codes use this network as well and may have crashed.

Again, we apologize for the inconvenience.  Please do check on your jobs if you use MPI.

storage problems this morning

Filed under: Uncategorized — admin @ 2:01 pm

Happy Monday!

Since about 2:30am this morning, we have been experiencing a GPFS problem and, while all data is safe, all GPFS services are currently unavailable.  This includes the scratch space, and project directory (~/data) filesystems for many users.  We are working on restoring service as quickly as possible and apologize for the inconvenience.

Powered by WordPress