PACE A Partnership for an Advanced Computing Environment

October 23, 2024

Message concerning September-October 2024 Datacenter Outages

Filed under: Uncategorized — Eric Coulter @ 4:00 pm

Dear PACE Community,

Due to a highly unusual series of data center related outages this Fall, we would like to share details about the sequence of events and causes which have impacted the availability of PACE resources, and the PACE team’s continued work to provide a stable research computing environment to the GT community. We fully understand the significant negative impact these outages have on the research community, including the inability to submit research papers, complete deadlines, and as well as the loss of research time. While this does not make up for the full impact of these outages, we always work to ensure that no paid accounts are charged for computational jobs that fail due to outages, and have temporarily doubled free-tier account credits for October 2024 in a small effort to alleviate the pain of lost time. 

While many of the details here were communicated in the moment, a unified picture may help clear up certain misconceptions, and unfortunately prompt communication was required before a full understanding of the situation could be gained.   

Background: The CODA datacenter is the sole hosting facility for PACE resources. The datacenter is owned and operated by Databank. PACE resources are spread across two datacenter areas: 

  • The Enterprise Hall (500kW power provisioned) which has N+1 redundant cooling, networking, and power (battery-based UPS + Generator), where PACE and OIT host critical infrastructure and storage systems. This enables us to maintain access to login nodes and storage during most system and service outages impacting the datacenter. 
  • The Research Hall (2MW), which was designed without redundant cooling, and relies on a combination of flywheel UPS (<1minute runtime) + Georgia Power Microgrid in the case of an electrical utility outage (https://research.gatech.edu/georgia-tech-celebrates-opening-new-energy-project-midtown-atlanta)This design choice allowed for significantly more research compute capacity, performance, and greatly reduced facilities and operational costs. The design and operational model included elements to minimize single points of failure and to support faster recovery times. 

For the calendar year 2024, the following power and cooling datacenter outages have impacted PACE services: 

  • 9/3/2024: On August 27th, Databank identified a failed chilled water flow sensor on the High-Temp Chiller loop providing cooling to the research hall. Databank requested downtime before the next PACE maintenance period (January 2025) for emergency replacement.  
  • 9/8/2024: On September 8th, the High-Temp Chiller system providing cooling to the research hall failed due to the condenser pump variable-frequency drive (VFD) failing. Due to supply chain constraints, a unit was not available as part of the on-site inventory and  different brand/model VFD had to be sourced and installed. During the repair, Databank identified that the VFD failure had damaged the condenser pump internal bearing. The condenser pump was replaced with the on-site spare. 
  • 10/1/2024: On October 1st, the data center experienced a short loss of utility power, which impacted the High-Temperature Chiller system providing cooling to the research hall. The new condenser pump variable frequency drive was unable to properly auto-reset because of a previously unknown parameter. Note: during this incident, PACE only shut off idle nodes and prevented new jobs from being launched. No running jobs were impacted. 
  • 10/2/2024: On October 2nd, at approximately 11:33am, the datacenter experienced a rapid sequence of utility power loss (8 events in less than two minutes). The Research Hall electrical load was transferred to the UPS/Flywheels for backup power. However, the load was unable to be transferred back to the microgrid as intended due to a network breaker that tripped in the electrical vault during the October 1st event. Only Georgia Power can reset this breaker. As a result, power was lost entirely to the Research Hall once the flywheels were depleted. 

What are we doing to prevent these failures in the future?  

OIT, in partnership with the GT Real Estate Office, has engaged Databank to review outages over the past few years. Specifically, we are: 

  • Evaluating the 2017-2018 datacenter design requirements for the research hall, and how these requirements align with the needs for a reliable research computing infrastructure. 
  • Reviewing, evaluating, and improving operational procedures between DataBank, Georgia Power, and Georgia Tech. 
  • Reviewing and evaluating the list of critical spare parts maintained on-site by DataBank. 
  • Engaging stakeholders to review reliability and resilience requirements for research computing. 
  • Explore potential options to improve the cooling and power redundancy of the research hall. 
  • Analyzing the feasibility and pros and cons of hosting a small portion of the PACE computational capabilities in the high-availability enterprise side or leveraging cloud resources during outages. 

Long term, we plan to explore the use of additional datacenter locations to host research computing resources. 

Please feel free to reach out with any questions or concerns,

Didier Contis
Executive Director of Academic Technology, Innovation, Research Computing for the Office of Information Technology

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress