PACE A Partnership for an Advanced Computing Environment

June 27, 2020

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

Filed under: Uncategorized — Semir Sarajlic @ 8:19 pm

[Update – June 28, 2020, 2:42pm]

We are following up with another update.  The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us  that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services.  Frequently, there are a few nodes that require specific manual action.  We will continue to work on bringing back those straggling nodes.  We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown.  We encourage all users to verify their recent jobs.  Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change.  PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center.  During this emergency shutdown, our storage did not experience data loss.  At this time, we have partially restored services  to cluster login nodes and we continue to work on restoring gryphon login node.  We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.

 

[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

June 26, 2020

[Resolved] Issue with InfiniBand Fabric and subnet managers

Filed under: Uncategorized — Semir Sarajlic @ 5:33 pm

Early today, the InfiniBand Fabric located in the Rich Datacenter (where most PACE resources are located) developed issues reaching the subnet managers. After on-site troubleshooting, the subnet manager was initialized. As of 11:30 AM local time, the InfiniBand Fabric is operational.

Some running jobs might have been affected during the outage period as well as potential issues in new jobs using MPI.

Please check any jobs for any potential issues and we deeply apologize for any inconvenience that may have occurred.

June 23, 2020

DNS/DHCP maintenance

Filed under: Uncategorized — Michael Weiner @ 10:26 pm

OIT will be conducting scheduled maintenance on Thursday, June 25, 5:00 – 8:00 AM to patch gtipam and DNS/DHCP servers. Due to redundant servers, the risk of any interruption to PACE is very low. If there is an interruption, you may find yourself unable to connect to PACE or lose your open connection to a login node, interactive job, VNC session, or Jupyter notebook. Running batch jobs should not be affected, even in the event of an interruption.
Please contact us at pace-support@oit.gatech.edu with any questions.

June 17, 2020

Emergency Network Maintenance Tomorrow (6/18)

Filed under: Uncategorized — Michael Weiner @ 9:19 pm
[Update 6/19/20 12:10 PM]

The network team is beginning additional emergency network maintenance immediately (at noon today), continuing through 7 PM this evening, to reverse changes from yesterday evening. It will have the same effect as yesterday’s outage, so you will likely lose your VPN connection and/or PACE connection at some point this afternoon during intermittent outages.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post]
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tomorrow (Thursday) night, with targeted completion by 2AM Friday morning. Although every effort is being made to avoid outages, this maintenance may cause two interruptions:
  • At some point during this maintenance, users may experience up to a 20-minute interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will likely lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working tomorrow evening. Note that this may also interrupt any connection you have made over the GT VPN to non-PACE locations. Connections to PACE from within the campus firewalls may also be interrupted, which means that resources outside of PACE required for PACE jobs, such as queries to some software licenses used on PACE, including MATLAB or COMSOL, may be interrupted.  Batch jobs already running on PACE should not be affected.
  • In addition, about midway through the maintenance, there will be a period of approximately 20-30 minutes where authentication will be unavailable. This will prevent any new connections to the VPN, to PACE, and to any cloud service that authenticates using GT credentials.  It is also possible for this interruption to cause new job starts to fail due to the loss of access to the authentication service.

We will alert you if there is any change of plans for this emergency maintenance.

Please contact us at pace-support@oit.gatech.edu with any questions.

June 11, 2020

Emergency Network Maintenance

Filed under: Uncategorized — Michael Weiner @ 6:21 pm
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tonight, with targeted completion by midnight. At some point during this maintenance, users will experience a brief interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working this evening.
Note that this will also interrupt any connection you have made over the GT VPN to non-PACE locations.
Batch jobs running on PACE should not be affected, nor will connections from within the campus firewall.
We will alert you if there is any change of plans for this emergency maintenance.
Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress