PACE A Partnership for an Advanced Computing Environment

May 21, 2020

Georgia Power Micro Grid Testing (Week of June 8)

Filed under: Uncategorized — Michael Weiner @ 8:01 pm

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 6/15/20 12:45 PM]

Georgia Power will continue low-risk testing of the power supply to PACE’s Hive and testflight-coda clusters in the Coda data center this week.

In addition, Georgia Power is planning further testing in CODA for a later time, and we are working with them and other stakeholders to identify the best times and lowest-risk manner for completing this work in Coda.

[Update 6/12/20 6:45 PM]

Georgia Power will continue low-risk testing of the power supply to the Coda data center next week.

[Original Post]

During the week of June 8, Georgia Power will perform a series of bypass tests for the power that feeds the Coda data center, housing PACE’s Hive and testflight-coda clusters. This is a further step in establishing a Micro Grid power generation facility for Coda, after progress during the last maintenance period.
Georgia Power has classified all of these tests as low risk, and we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.
Please contact us at pace-support@oit.gatech.edu with any questions.

May 18, 2020

[Resolved] Home directory failures

Filed under: Uncategorized — Michael Weiner @ 12:23 pm

[Update 5/18/20 4:25 PM]

Reliable access to home directories was restored early this afternoon. There was an issue with DNS on the GT network, where the DNS server allowing for a connection to the home and utility storage devices was reacting slowly but not completely down, so it did not fail over onto the backup server. In concert with OIT, we have reordered the DNS servers, and access is restored. Please contact us at pace-support@oit.gatech.edu with any questions.

If jobs failed due to the outage, please resubmit them to run again.

[ Issue began approximately 2 PM on 5/17/20 ]

We are experiencing an intermittent outage on PACE affecting home directories and certain other mounted utility directories. We are currently working to restore access. Thank you to those of you who reported the issue to us this afternoon. This intermittent mount failure can cause the following issues:

  • Home directories not loading on login nodes.
  • Login sessions starting with “bash” instead of “~” as the prompt and having warning messages displayed
  • Batch or interactive jobs failing immediately after launch due to an inability to load files with an error message such as “no such file or directory”
  • “pace-check-queue” and other PACE utilities failing to report information as expected
  • Missing home directories on file transfer utilities (scp or sftp)

 

For jobs that have failed, please wait until after we have completed the repair and then resubmit your jobs.

We will provide updates as they become available. Thank you for your patience.

May 7, 2020

[Resolved] Emergency Switch Reboot

Filed under: Uncategorized — Michael Weiner @ 4:15 pm

[Resolved 5/15/20 9:30 PM]

Extensive repairs during our quarterly maintenance period resolved remaining Infiniband issues.

[Update 5/9/20 4:20 PM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have made additional adjustments since this morning, which have improved connectivity and reliability. Read/write access to GPFS (data and scratch) at the normal rate has been restored to nearly all nodes, and the few nodes with remaining difficulties have been offlined, so no new jobs will start on them, although jobs that were already running may hang. However, we continue to see intermittent issues with MPI jobs on active nodes, and we will continue to investigate next week. Please check any running jobs to see if they are producing output or hanging. If they are hanging, please cancel the job. Please resubmit any jobs that have failed, as most non-MPI and MPI jobs should work if resubmitted at this point. Keep in mind that any job with a walltime request that will not complete by 6 AM on Thursday will be held until after the schedule maintenance period.
Thank you for your patience during this emergency repair.

[Update 5/9/20 10:45 AM]

We are continuing to work to resolve remaining issues with connectivity in the Rich datacenter. We have deployed the replacement switch, previously planned for the upcoming maintenance period, and it has been in place since approximately 11:45 PM Friday evening. We are continuing to troubleshoot access to GPFS (data and scratch) and MPI job functionality.
Users who are most affected with long-running jobs have been contacted directly with instructions to check progress of jobs.
Thank you for your patience during this emergency repair.

[Update 5/8/20 11:00 AM]

Our team worked into the early hours of this morning to complete the emergency maintenance, but we have not yet completely resolved all issues. New jobs were released to run around 1:15 AM. We are continuing to isolate and fix errors in the InfiniBand network affecting read/write on GPFS storage (data and scratch) and possibly MPI jobs. Please contact us at pace-support@oit.gatech.edu about any running jobs where you encounter slow performance, which will help us in identifying specific nodes with issues.
Many affected jobs may run more slowly than normal. In order to mitigate loss of research due to these issues, we have administratively added 24 hours to the walltime request of any job currently running. Please note that this extension will not extend job completion times beyond 6:00 AM on Thursday, when our scheduled maintenance period begins. If you resubmit a job, please keep in mind that any job that will not complete by Thursday morning will be held until after scheduled maintenance is complete.
We apologize for the disruption, and we will continue to update you on the status of this repair.

[Update 5/7/20 4:05 PM]

We encountered a complication during the reboot, and our engineers are currently working to complete the repair. We will provide updates as they become available.

[Original message]

We have an emergency need to reboot an InfiniBand switch in the Rich datacenter today, as it is likely to fail shortly without intervention. We will conduct this reboot at 3 PM today, and we expect the outage to last approximately 15 minutes. Any jobs running at 3 PM today are likely to fail if they attempt to read/write files to/from data or scratch directories during the outage or if they are employing MPI. We have stopped all new jobs from beginning in order to reduce the number of affected jobs, and we will release them after the reboot. For any job that is already running, please check the output and resubmit if your job fails. Jobs that do not read/write in the data or scratch directories during the outage window should not be affected.
We have planned a long-term repair to this equipment during next week’s maintenance period, but this emergency reboot is necessary in the meantime.
PACE resources in the Coda datacenter, including Hive and testflight-coda, will not be impacted. CUI/ITAR resources in Rich are also unaffected.

May 1, 2020

[Complete] PACE Maintenance – May 14-16

Filed under: Uncategorized — Michael Weiner @ 2:20 pm

[Update 5/15/20 9:30 PM]

We are pleased to announce that our May 2020 maintenance period has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible.
As usual, there are a small number of straggling nodes that will require additional intervention.

A summary of the changes and actions accomplished during this maintenance period:
– (Completed) [Hive/Testflight-Coda] Georgia Power began work to establish a Micro Grid power generation facility for Coda. Power has been restored.
– (Completed) [Hive] Default modules were changed to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

– (Completed) Performed upgrades and replacements on several infiniband switches in the Rich datacenter.
– (Completed) Replaced other switches and hardware in the Rich datacenter.
– (Completed) Updated software modules in Hive.
– (Completed) Updated salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience!

[Update 5/13/20 10:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM tomorrow and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

 

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

 

ITEMS REQUIRING USER ACTION:

– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.

Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.

Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.

PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

 

ITEMS NOT REQUIRING USER ACTION:

– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.

– Replace other switches and hardware in the Rich datacenter.

– Update software modules in Hive.

– Update salt configuration management settings on all the production servers.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/11/20 8:30 AM]

We would like to remind you that we are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. This is described at http://docs.pace.gatech.edu/hive/software_guide/#meta-module-updates-2020.
Functionally, default MPI and compiler are only patch updates, with the use of Intel 19.0.5 (previously 19.0.3) and MVAPICH 2.3.2 (previously 2.3.1). However, the base of software under the new hierarchy have been rebuilt, which may impact specific use of compiled and MPI-compiled applications.
Users may be impacted. Access to the old PACE software basis can be done interactively and within scripts can be done simply by loading “module load pace/2019.08”.
PACE encourages users to migrate to using the new defaults, as new software will continue to be added to the newest software basis. Users may preserve current workflows by using the older pace module described above.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original Post]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on May 14 and conclude at 11:59 PM on May 16. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

Georgia Power will begin work to establish a Micro Grid power generation facility for Coda beginning on Thursday, after initial testing during the February maintenance period. This means that the research hall of the Coda datacenter, including the Hive and testflight-coda clusters, will be powered down for an expected 12-14 hours. Should any issues and resultant delays occur that extend the outage for Hive & testflight-coda, users will be notified accordingly.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:
– [Hive] Default modules will change to point to pace/2020.01 from pace/2019.08, which uses an updated MPI and compiler. Users employing default modules will need to update their PBS scripts to ensure their workflows will succeed. A link with detailed documentation of this change and necessary action by users will be provided prior to the maintenance period.

ITEMS NOT REQUIRING USER ACTION:
– Perform upgrades and replacements on several infiniband switches in the Rich datacenter.
– Replace other switches and hardware in the Rich datacenter.
– Update software modules in Hive.
– Update salt configuration management settings on all the production servers.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Powered by WordPress