PACE A Partnership for an Advanced Computing Environment

August 26, 2020

OIT’s Planned Network Maintenance

Filed under: Uncategorized — Semir Sarajlic @ 5:44 pm

[Update – September 8, 2020, 1:31PM] 

We are following up with an update on the schedule for OIT’s planned network maintenance.  OIT’s Network Engineering team will be conducting two maintenance activities scheduled for the evening of Friday, September 11th, from 7:00pm to 4:00am. Blog entry: https://blog.pace.gatech.edu/?p=6889

[Update – August 28, 2020, 6:05PM] 

As of 4:46PM, the Office of Information Technology has postponed this evening’s data center firewall upgrade and network migration. The maintenance will be rescheduled in an effort to maintain access to systems that are critical to COVID-19 data collection and contact tracing. Please expect additional communications within the next two weeks regarding new dates.

 

[Original Note – August 26, 2020, 5:44PM] 

OIT’s Network Engineering team will be conducting two maintenance activities scheduled for the evening of Friday, August 28th, starting at 8:00pm, which are anticipated to last 3 – 4 hours.  These maintenance activities will affect connections of PACE to the outside Internet that is anticipated to last 30 minutes or less from the start of the activity, which may occur at any point during this maintenance time window.

What’s about to happen:  On Friday, August 28th, starting at 8:00pm – 11:59pm, Network Engineering team will be upgrading the data center firewall appliances to the latest code that is recommended by Palo Alto who has addressed serious security vulnerabilities with their latest released code.  To reassure you, OIT’s network team has been operating with some controls in place to address these vulnerabilities, and this planned upgrade will further reduce our risk.  Second maintenance activity also starts on Friday, August 28th, starting at 8:00pm – 11:00pm, Network Engineering team will be swapping service to a more capable Network Address Translation (NAT) appliance in Coda datacenter, as the one currently in Coda  is being overloaded.  These activities will affect PACE’s connection to/from the Internet.

Who is impacted: PACE users will not be able to connect to PACE resources and/or they may lose connection during this maintenance window that may last 30 minutes or less from the start of the activity at any point during the maintenance time window.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a lose of connection to the PACE resources.  Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during these maintenance activities.  These maintenance activities will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during these activities to monitor the systems, conduct testing and report on any interruptions in service.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

August 21, 2020

[Resolved/Monitoring] Brief Network/InfiniBand Interruptions

Filed under: Uncategorized — Semir Sarajlic @ 4:58 pm

Dear Researchers,

As we continue to monitor our network closely after the recent issues with our network/InfiniBand, we wanted to alert you about a brief network glitch  from this afternoon that’s impacted the connection between the GPFS and the compute nodes as well as node to node communication.

What happened and what we did: At 12:45pm, we started to experience issues in connection between our two main InfiniBand switches that GPFS connects to along with compute nodes.   We observed various errors that we were able to quickly diagnose, and by 1:55pm we resolved the issues after rebooting one of the main switches.

Who is impacted: During this brief network glitch, users may have experienced slow read/write and/or errors on GPFS directories from the compute nodes.  This may have impacted running MPI jobs.  We encourage users to check on their running jobs from earlier this afternoon, and resubmit any jobs that may have been interrupted.

What we will continue to do: We will continue to monitor the network and report as needed.    We appreciate your continued understanding and patience during these recent network interruptions. Please rest assured that we are doing everything we can to keep this network fabric operational.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

August 18, 2020

[Resolved/Monitoring] GPFS – network issues

Filed under: Uncategorized — Semir Sarajlic @ 6:41 pm
We began experiencing a network issue earlier today at approximately 2:00AM with the connection between our GPFS filesystem (data and scratch directories) and about one third of PACE’s compute nodes in the Rich datacenter. Affected nodes are on these racks, indicated by the second section of the node name (e.g., rich133-s40-20 or iw-s40-21 would be on rack s40):

b13, b14, b16, b17, c32, c34, c36, c38, g13, g14, g15, g16, g17, h31, h33, k35.

As a result of this network issue, users may have experienced slow read/write on GPFS directories from these nodes that may also have impacted MPI running jobs on these nodes. We finished making a repair late this afternoon, but the slowness could return, and we are continuing to monitor the system. Thank you to users who have been reporting the issue today via support tickets. Please continue to contact us if the slowness returns.
If your jobs have been running on the impacted nodes and not producing output, please cancel and resubmit them.  To check what nodes your job is running on, please run the following command: qstat -u USER_NAME -n, replacing USER_NAME with your username, eg. “qstat -u.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

August 17, 2020

Georgia Power testing in Coda

Filed under: Uncategorized — Michael Weiner @ 10:28 am

Georgia Power will be conducting additional tests of the MicroGrid powering the Coda datacenter (Hive and testflight-coda) this week. Unlike the last round, this new set of tests is expected to be a low risk for power interruption to compute nodes.

August 14, 2020

[Reopened] Network (Infiniband Subnet Manager) Issues in Rich

Filed under: Uncategorized — Michael Weiner @ 7:00 pm

[ Update 8/14/20 7:00 PM ]

After an additional nearly-48-hour outage in the Rich datacenter due network/InfiniBand issues, we have brought back up PACE resources on the affected systems and released user jobs. We thank you for your patience and understanding during this unprecedented outage, as we understand the significant impact that this outage has continued to have on your research throughout this week. Please note that PACE clusters in the Coda datacenter (Hive and testflight-coda) and CUI clusters in Rich have not been impacted.

While new jobs have not begun over the past two days, already-running jobs have continued. Please check the output of any jobs that are still running. If they are failing or not producing output, please cancel them and resubmit to run again. Some running user jobs were killed in the process of repairing the network, and those should also be resubmitted to the queue.

In addition to previously reported repairs, we removed a problematic spine module from a network switch this morning and further adjusted connections. This module appeared to be causing intermittent failures when under heavy load.

Currently, our network is running at reduced capacity. We have ordered a replacement switch spine module that will be used to replace the removed part. We have conducted extensive stress testing of the network and storage today, which were far beyond tests conducted earlier in the week, that indicate the system is healthy. We will continue to monitor the systems for any further network abnormalities.

Again, thank you for your patience and understanding this week while we addressed one of the most significant outages in the history of PACE.

Please contact us at pace-support@oit.gatech.edu with any questions or if you observe unexpected behavior on the cluster.

[ Update 8/13/20 8:30 PM ]

We continue to work on the network issues impacting the Rich datacenter.  We have partitioned the network and adjusted connections in an effort to isolate the problem. As mentioned this morning, we have ordered parts to address potential problematic switches as we continue systematic troubleshooting of them. We continue to run tests on InfiniBand, and we are running an overnight stress test on the network to monitor for reoccurrence of errors. The schedulers remain paused to prevent further jobs being launched on the cluster. We will follow up tomorrow with an update on the Rich cluster network.

Thank you for your continued patience and understanding during this outage.

[ Update 8/13/20 10:10 AM ]

Unfortunately, after the nearly-80-hour outage earlier this week, we must report another network outage.  We apologize for this inconvenience, as we do understand the impact of this to your research. The network/InfiniBand issues in the Rich datacenter began reoccurring late yesterday evening, and we are aware of the issues. We are currently working to resolve them, and we have ordered replacements for the parts of the network switches that appear problematic. The issue was not detected via our deterministic testing methods and occurred only after restarting user production jobs caused very heavy network utilization. We will provide further updates once more information is available.  As before, you may experience slowness in accessing storage (home, project, and/or scratch) and/or issues with communication within MPI jobs.
We have paused all the schedulers for clusters in Rich datacenter that are accessed by the following headnodes/login nodes: login-s, login-d, login7-d, novazohar, gryphon, and testflight-login. This pause prevents additional jobs from starting, but already-running jobs have not been stopped. However, there is a chance they will be killed as we continue to work to resolve the network issues.
Please note that this network issue does not impact the Coda datacenter (Hive and testflight-coda) or CUI clusters in the Rich datacenter.
Thank you for your continued patience as we continue to work to resolve this issue.
Please contact us with any questions or concerns at pace-support@oit.gatech.edu.

[ Update 8/12/20 6:20 PM ]

After nearly 80 hours of Rich datacenter outage due to network/InfiniBand issues, we have been able to bring up the PACE compute nodes in the Rich datacenter, and user jobs have begun to run again. We thank you for your patience during this period, and we understand the significant impact of this outage on your research this week.
For any user jobs that were killed due to restarts yesterday, please resubmit the jobs to the queue at this time. Please check the output of any recent jobs and resubmit any that did not succeed.
As noted yesterday evening, we have carefully brought nodes back into production in small groups to identify issues, and we have turned off nodes that we identified as having network difficulties. Our findings point to multiple hardware problems that caused InfiniBand connectivity problems between nodes. We addressed these issues, and we are no longer observing the errors after our extensive testing. We will continue to monitor the systems, but please contact us immediately at pace-support@oit.gatech.edu if you notice your job running slowly or failing to produce output.
Please note that we will continue to work on problematic nodes that are currently offline in order to restore compute access to all PACE users, and we will contact affected users as needed.
Again, thank you for your patience and understanding this week while we addressed one of the most impactful outages in the history of PACE.
Please contact us at pace-support@oit.gatech.edu with any questions.

[ Update 8/12/20 12:30 AM ]

We continue to work to bring PACE nodes back into production. After turning off all the compute nodes and reseating faulty network connections we identified, we have been slowly bringing nodes back up to avoid overwhelming the network fabric, which has been clean so far.  We are carefully testing each group to ensure full functionality, and we continue to identify challenging nodes and repair them where possible. At this time, the schedulers remain paused while we turn on and test nodes. We will provide additional updates as more progress is made.

[ Update 8/11/20 5:15 PM]

We continue to troubleshoot the network issues in the Rich datacenter. Unfortunately, our efforts to avoid disturbing running jobs have complicated the troubleshooting, which has not led to a resolution. At this time, we need to begin systematic rebooting of many nodes, which will kill some running user jobs. We will contact users with current running jobs directly to alert you to the effect on your jobs.

Our troubleshooting today has included reseating multiple spine modules in the main datacenter switch, adjusting uplinks between the two main switches to isolate problems, and rebooting switches and some nodes already.

We will continue to provide updates as more information becomes available. Thank you for your patience during this outage.

[ Update 8/10/20 11:35 PM ]

We have made several changes to create a more stable Infiniband network, including deploying an updated subnet manager, bypassing bad switch links, and repairing GPFS filesystem errors. However, we have not yet been able to uncover all issues the network is facing, so affected schedulers remain paused for now, to ensure that new jobs do not begin when they cannot produce results.

We will provide an update on Tuesday as more information becomes available. We greatly appreciate your patience as we continue to troubleshoot.

[ Update 8/10/20 6:20 PM ]

We are continuing to troubleshoot network issues in Rich. At this time, we are working to deploy an older backup subnet manager, and we will test the network again to determine if communication has been restored after that step.

The schedulers on the affected clusters remain paused, to ensure that new jobs do not begin when they cannot produce results.

We recognize that this outage has a significant impact on your research, and we are working to restore functionality in Rich as soon as possible. We will provide an update when more information becomes available.

[ Update 8/9/20 11:55 PM]

We have restarted PACE’s Subnet Manager in Rich, but some network slowness remains. We are continuing to troubleshoot the problem. At this time, we plan to leave the Rich schedulers paused overnight in order to ensure that the issue is fully resolved before additional jobs begin, so that they will be able to run successfully.
We will provide further updates on Monday.

[ Original Post]

At approximately noon today, we began experiencing issues with our primary InfiniBand Subnet Manager in Rich data center.  PACE is investigating this issue.  We will provide an update when additional information or a resolution is available.  At this time, you may experience slowness in accessing storage (home, project, or scratch) or issues with communication within MPI jobs.

In order to minimize impact to jobs, we have paused all schedulers on the affected clusters (accessed via login-s, login-d, login7-d, novazohar, gryphon, and testflight-login headnodes). This will prevent additional jobs from starting, but jobs that are already running will not be stopped, although they may fail to produce results due to the network issues.

This issue does not impact the Coda data center (Hive & testflight-coda clusters) or CUI clusters in the Rich data center.

Please contact us with any questions or concerns at pace-support@oit.gatech.edu.

August 10, 2020

[Resolved] [testflight-coda] Lustre scratch outage

Filed under: Uncategorized — Michael Weiner @ 9:16 pm

[ Update 8/11/20 10:15 AM]

Lustre scratch has been repaired. We identified a broken ethernet port on a switch and moved to another port, restoring access.

[ Original Post ]

There is an outage affecting our Lustre scratch, which is currently used only in testflight-coda. We are working with the vendor to restore the system. Storage on all PACE production systems is unaffected.

You may continue your testing in testflight-coda to prepare for your Coda migration by using Lustre project storage, accessed via the “data” symbolic link in your testflight-coda home directory.

We will provide an update when the Lustre scratch system is restored. Please contact us at pace-support@oit.gatech.edu with questions.

Powered by WordPress