PACE A Partnership for an Advanced Computing Environment

July 20, 2020

[Resolved] Georgia Power Micro Grid Testing (continued)

Filed under: Uncategorized — Michael Weiner @ 1:03 pm

[Update 7/22/20 1:00 PM]

Hive and testflight-coda systems were restored early this morning. Systems have returned to normal operation, and user jobs are running. If you were notified of a lost job, please resubmit it at this time.

Georgia Power does not plan to conduct any tests today. No additional information about the cause of yesterday’s outage is available at this time.

[Update 7/21/20 11:00 PM]

The power outage in CODA has been bypassed, and power is returning to the Coda research hall.  However, because the cooling plant has been offline for so long, it will require about 2 hours to restart and stabilize before we can resume full operation.  Due to the late hour, we will begin to bring systems back on in the morning and provide another update when we’re back to normal operation.  Georgia Power will be researching the root cause of this outage in the morning, and we will share details if available.

[Update 7/21/20 3:15 PM]

Unfortunately, the planned testing of the Georgia Power Micro Grid this week has led to a loss of power in the Coda research hall, home to compute nodes for Hive & testflight-coda. Any running jobs on those clusters will have failed at this time. Access to login nodes and storage, housed in the Coda enterprise hall, is uninterrupted.

We are sorry for what we know if a significant interruption to your work.

We will follow up with users who had jobs running at the time of the power outage to provide more specific information.

At this time, teams are working to restore power to the system. We will provide an update when available.

 

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

Visit https://blog.pace.gatech.edu/?p=6778 for full details on this power testing.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress