GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

Posted by on Saturday, 27 June, 2020

[Update – June 28, 2020, 2:42pm]

We are following up with another update.  The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us  that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services.  Frequently, there are a few nodes that require specific manual action.  We will continue to work on bringing back those straggling nodes.  We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown.  We encourage all users to verify their recent jobs.  Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change.  PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center.  During this emergency shutdown, our storage did not experience data loss.  At this time, we have partially restored services  to cluster login nodes and we continue to work on restoring gryphon login node.  We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.

 

[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

[Resolved] Issue with InfiniBand Fabric and subnet managers

Posted by on Friday, 26 June, 2020

Early today, the InfiniBand Fabric located in the Rich Datacenter (where most PACE resources are located) developed issues reaching the subnet managers. After on-site troubleshooting, the subnet manager was initialized. As of 11:30 AM local time, the InfiniBand Fabric is operational.

Some running jobs might have been affected during the outage period as well as potential issues in new jobs using MPI.

Please check any jobs for any potential issues and we deeply apologize for any inconvenience that may have occurred.

DNS/DHCP maintenance

Posted by on Tuesday, 23 June, 2020

OIT will be conducting scheduled maintenance on Thursday, June 25, 5:00 – 8:00 AM to patch gtipam and DNS/DHCP servers. Due to redundant servers, the risk of any interruption to PACE is very low. If there is an interruption, you may find yourself unable to connect to PACE or lose your open connection to a login node, interactive job, VNC session, or Jupyter notebook. Running batch jobs should not be affected, even in the event of an interruption.
Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance Tomorrow (6/18)

Posted by on Wednesday, 17 June, 2020
[Update 6/19/20 12:10 PM]

The network team is beginning additional emergency network maintenance immediately (at noon today), continuing through 7 PM this evening, to reverse changes from yesterday evening. It will have the same effect as yesterday’s outage, so you will likely lose your VPN connection and/or PACE connection at some point this afternoon during intermittent outages.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post]
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tomorrow (Thursday) night, with targeted completion by 2AM Friday morning. Although every effort is being made to avoid outages, this maintenance may cause two interruptions:
  • At some point during this maintenance, users may experience up to a 20-minute interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will likely lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working tomorrow evening. Note that this may also interrupt any connection you have made over the GT VPN to non-PACE locations. Connections to PACE from within the campus firewalls may also be interrupted, which means that resources outside of PACE required for PACE jobs, such as queries to some software licenses used on PACE, including MATLAB or COMSOL, may be interrupted.  Batch jobs already running on PACE should not be affected.
  • In addition, about midway through the maintenance, there will be a period of approximately 20-30 minutes where authentication will be unavailable. This will prevent any new connections to the VPN, to PACE, and to any cloud service that authenticates using GT credentials.  It is also possible for this interruption to cause new job starts to fail due to the loss of access to the authentication service.

We will alert you if there is any change of plans for this emergency maintenance.

Please contact us at pace-support@oit.gatech.edu with any questions.

Emergency Network Maintenance

Posted by on Thursday, 11 June, 2020
The GT Network team will be performing emergency maintenance on the campus firewall beginning at 8 PM tonight, with targeted completion by midnight. At some point during this maintenance, users will experience a brief interruption that will cause any open VPN connection to close. If you are connected to PACE from off-campus via the GT VPN during the interruption, you will lose your connection, and any terminal session, interactive job, VNC session, or running Jupyter notebook will be interrupted. Please be prepared for such an interruption when working this evening.
Note that this will also interrupt any connection you have made over the GT VPN to non-PACE locations.
Batch jobs running on PACE should not be affected, nor will connections from within the campus firewall.
We will alert you if there is any change of plans for this emergency maintenance.
Please contact us at pace-support@oit.gatech.edu with any questions.

Georgia Power Micro Grid Testing (Week of June 8)

Posted by on Thursday, 21 May, 2020

[Update 6/15/20 12:45 PM]

Georgia Power will continue low-risk testing of the power supply to PACE’s Hive and testflight-coda clusters in the Coda data center this week.

In addition, Georgia Power is planning further testing in CODA for a later time, and we are working with them and other stakeholders to identify the best times and lowest-risk manner for completing this work in Coda.

[Update 6/12/20 6:45 PM]

Georgia Power will continue low-risk testing of the power supply to the Coda data center next week.

[Original Post]

During the week of June 8, Georgia Power will perform a series of bypass tests for the power that feeds the Coda data center, housing PACE’s Hive and testflight-coda clusters. This is a further step in establishing a Micro Grid power generation facility for Coda, after progress during the last maintenance period.
Georgia Power has classified all of these tests as low risk, and we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Home directory failures

Posted by on Monday, 18 May, 2020

[Update 5/18/20 4:25 PM]

Reliable access to home directories was restored early this afternoon. There was an issue with DNS on the GT network, where the DNS server allowing for a connection to the home and utility storage devices was reacting slowly but not completely down, so it did not fail over onto the backup server. In concert with OIT, we have reordered the DNS servers, and access is restored. Please contact us at pace-support@oit.gatech.edu with any questions.

If jobs failed due to the outage, please resubmit them to run again.

[ Issue began approximately 2 PM on 5/17/20 ]

We are experiencing an intermittent outage on PACE affecting home directories and certain other mounted utility directories. We are currently working to restore access. Thank you to those of you who reported the issue to us this afternoon. This intermittent mount failure can cause the following issues:

  • Home directories not loading on login nodes.
  • Login sessions starting with “bash” instead of “~” as the prompt and having warning messages displayed
  • Batch or interactive jobs failing immediately after launch due to an inability to load files with an error message such as “no such file or directory”
  • “pace-check-queue” and other PACE utilities failing to report information as expected
  • Missing home directories on file transfer utilities (scp or sftp)

 

For jobs that have failed, please wait until after we have completed the repair and then resubmit your jobs.

We will provide updates as they become available. Thank you for your patience.

[Resolved] Emergency Switch Reboot

Posted by on Thursday, 7 May, 2020

[Resolved 5/15/20 9:30 PM]

Extensive repairs during our quarterly maintenance period resolved remaining Infiniband issues.

[Update 5/9/20 4:20 PM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have made additional adjustments since this morning, which have improved connectivity and reliability. Read/write access to GPFS (data and scratch) at the normal rate has been restored to nearly all nodes, and the few nodes with remaining difficulties have been offlined, so no new jobs will start on them, although jobs that were already running may hang. However, we continue to see intermittent issues with MPI jobs on active nodes, and we will continue to investigate next week. Please check any running jobs to see if they are producing output or hanging. If they are hanging, please cancel the job. Please resubmit any jobs that have failed, as most non-MPI and MPI jobs should work if resubmitted at this point. Keep in mind that any job with a walltime request that will not complete by 6 AM on Thursday will be held until after the schedule maintenance period.
Thank you for your patience during this emergency repair.

[Update 5/9/20 10:45 AM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have deployed the replacement switch, previously planned for the upcoming maintenance period, and it has been in place since approximately 11:45 PM Friday evening. We are continuing to troubleshoot access to GPFS (data and scratch) and MPI job functionality.
Users who are most affected with long-running jobs have been contacted directly with instructions to check progress of jobs.
Thank you for your patience during this emergency repair.

[Update 5/8/20 11:00 AM]

Our team worked into the early hours of this morning to complete the emergency maintenance, but we have not yet completely resolved all issues. New jobs were released to run around 1:15 AM. We are continuing to isolate and fix errors in the InfiniBand network affecting read/write on GPFS storage (data and scratch) and possibly MPI jobs. Please contact us at pace-support@oit.gatech.edu about any running jobs where you encounter slow performance, which will help us in identifying specific nodes with issues.
Many affected jobs may run more slowly than normal. In order to mitigate loss of research due to these issues, we have administratively added 24 hours to the walltime request of any job currently running. Please note that this extension will not extend job completion times beyond 6:00 AM on Thursday, when our scheduled maintenance period begins. If you resubmit a job, please keep in mind that any job that will not complete by Thursday morning will be held until after scheduled maintenance is complete.
We apologize for the disruption, and we will continue to update you on the status of this repair.

[Update 5/7/20 4:05 PM]

We encountered a complication during the reboot, and our engineers are currently working to complete the repair. We will provide updates as they become available.

[Original message]

We have an emergency need to reboot an InfiniBand switch in the Rich datacenter today, as it is likely to fail shortly without intervention. We will conduct this reboot at 3 PM today, and we expect the outage to last approximately 15 minutes. Any jobs running at 3 PM today are likely to fail if they attempt to read/write files to/from data or scratch directories during the outage or if they are employing MPI. We have stopped all new jobs from beginning in order to reduce the number of affected jobs, and we will release them after the reboot. For any job that is already running, please check the output and resubmit if your job fails. Jobs that do not read/write in the data or scratch directories during the outage window should not be affected.
We have planned a long-term repair to this equipment during next week’s maintenance period, but this emergency reboot is necessary in the meantime.
PACE resources in the Coda datacenter, including Hive and testflight-coda, will not be impacted. CUI/ITAR resources in Rich are also unaffected.

[Complete] PACE Maintenance – May 14-16

Posted by on Friday, 1 May, 2020

[Update 5/15/20 9:30 PM]

We are pleased to announce that our May 2020 maintenance period has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.
As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period:
– (Completed) [Hive/Testflight-Coda] Georgia Power began work to establish a Micro Grid power generation facility for Coda. Power has been restored.
– (Completed) [Hive] Default modules were changed to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

– (Completed) Performed upgrades and replacements on several infiniband switches in the Rich datacenter.
– (Completed) Replaced other switches and hardware in the Rich datacenter.
– (Completed) Updated software modules in Hive.
– (Completed) Updated salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience!

[Update 5/13/20 10:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM tomorrow and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

 

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

 

ITEMS REQUIRING USER ACTION:

– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.

Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.

Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.

PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

 

ITEMS NOT REQUIRING USER ACTION:

– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.

– Replace other switches and hardware in the Rich datacenter.

– Update software modules in Hive.

– Update salt configuration management settings on all the production servers.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/11/20 8:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original Post]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. A link with detailed documentation of this change and necessary action by users will be provided prior to the maintenance period.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

OIT Network Services Team Firewall upgrades (5/5/2020)

Posted by on Monday, 27 April, 2020

PACE has been informed that the OIT Network Services Team is preparing for software upgrades on multiple firewall servers across the Georgia Institute of Technology Atlanta campus on 5/5/2020 20:00 – 23:59, 5/7/2020 20:00 – 23:59, 5/8/2020 19:00 – 5/9/2020 02:00. While there are no direct impacts on the Rich and Coda Datacenter networks, there is potential for interruptions in connections to license servers, which can lead to job failures. Applications which may be impacted include

  • Abaqus
  • Ansys
  • Comsol
  • Dymola
  • Matlab

and any other application that may have a license server not internal to PACE. Due to potential interruptions, please check any jobs scheduled to run during these periods. PACE apologizes for any impact on your research workflow that this may cause. 

The Network Team will report their status for the project via the status.gatech.edu. Please check blog.pace.gatech.edu for updates.