PACE A Partnership for an Advanced Computing Environment

December 10, 2024

 Message about Storage Performance, Reliability, and Future Plans for Phoenix 

Filed under: Uncategorized — Eric Coulter @ 4:45 pm

Executive summary 

PACE recognizes that the increasing frequency of performance issues on the storage system is causing disruptions to your research on the Phoenix cluster. We are striving to mitigate the impact of these events while taking proactive measures to improve the reliability of our systems for the future. To this end, we are introducing new storage technology, and prioritizing migration of data from the existing project storage to the new system over the next year. We are currently working towards finalizing a seamless migration plan. Once the plan is ready, around late spring, we will follow up with detailed information regarding the timeline and any potential workflow impacts. Our goal is to minimize disruption and ensure that everyone is aligned on key milestones. There will be no changes to the unit price until the new system is fully implemented, data migration is complete, the existing system is reconfigured, and we have sufficient data usage metrics to determine any necessary price adjustments. We estimate this will be no earlier than the end of 2025. We believe this to be the fastest path towards a stable and effective storage solution that can cater to the varied storage needs of our user community. Please find more details below. 

Phoenix cluster at PACE hosts two major storage systems – scratch and project. Scratch is the temporary file system for storing files used during job execution — it is cleaned up once a month by including files older than 60days for deletion. Project is a long-term file system – it has 2.3pB of data and about 1.8B files on it. Besides environmental factors such as chilled water outage, these two storage systems have been significant contributors to downtime and degraded performance of the Phoenix cluster. Based on our analysis, so far for the calendar year 2024, storage failure or severe degradation accounts for 47% of unplanned downtime. Addressing this concern has been our primary focus over the past six months. This message aims to share our progress and plans for the next 12 months in this regard.  

Scratch Space 

The Scratch space is supported by a DDN 400NVX2 unit, which includes a mix of flash drives (NVMe) and spinning disks.  The system was offlined during August maintenance for a major software upgrade performed by the vendor to improve its stability.  Furthermore, software issues/instability on this unit has required us to turn off the use of flash drives for hot pools that has impacted performance. To address this, a second software upgrade is planned for January 2025 during our scheduled maintenance period, at which point we will configure the flash space to use Progressive File Layout (PFL), keeping small files in flash and progressively increasing the number of stripes across all devices for improved performance. 

Project Space 

Project space is supported by a DDN18K system purchased in 2020. This unit primarily uses spinning disks and is nearing its end of life. Over the past two years, we have been experiencing an increase in issues related to software defects and disk failures. While the system has built-in redundancies to support multiple disk failures, the disk rebuild process coupled with an increasing number of disk intensive research jobs on Phoenix are negatively impacting the performance. 

We have adopted a two-pronged strategy to address the project storage:  

  1. Perform software and limited hardware upgrades. During the August maintenance period, the hardware subsystem supporting the metadata functionality of the Lustre filesystem was replaced with a new dedicated unit. Due to its complexity, this operation required an additional day of work. The objectives of this upgrade were to a) improve the performance of the metadata functionality and b) provide an upgrade path for the future software releases. During the January 2025 maintenance period, the vendor will be able to perform a major software upgrade, to standardize Lustre 12.4 on all the storage appliances. This will increase the stability and performance of the project storage while simplifying its management.  
  1. Consult with other research computing sites (including the Texas Advanced Computing Center) to invest in a new storage system to replace or complement our DDN 18K unit. 

Based on feedback from other research computing sites, we made an initial investment in an all-flash storage system from VAST Data. Being an all-flash storage with disaggregated architecture we expect significantly better uptime and performance on this new system. Particularly, VAST system does not require downtime to perform major code upgrades. The vendor has installed the system, and we are in the process of bringing the unit into production to host data, initially in support of improvements to the DDN18K.  

Our plan over the next 12 months includes: 

  1. Migrate all data from the current DDN18K project space to the new VAST storage space to support these improvements. Due to the storage outages and performance issues affecting our user community, we want to clarify the following during the migration process over the next 12 months, or as long as necessary to complete the DDN18K improvements: 
  • All storage credit balances, including the free tier, will be equivalent to the existing system. 
  • The unit rate of $5.67 per TB per month for the paid tier will remain the same for both the current project space and the new VAST Storage system. 
  1. Work with our vendor to reconfigure the DDN18K appliance to efficiently use all available SSD / NVMe drive space and recreate storage pools to remove performance bottlenecks during disk failure. This will require a complete reformatting of the storage space. 
  1. Gather metrics on storage efficiencies (e.g., capacity reduction after data de-duplication and compression) and operational efficiencies so we can more accurately calculate the rate for VAST Storage. 
  1. Leverage the VAST Data analytics to help users archive older data to lower-cost storage options such as CEDAR.  
  1. Publish comparative analysis about functionality, performance, and resiliency between the different storage services provided by PACE to help users make decisions on what storage service(s) to use based on their type of research and data. 

In the long term, we expect the new VAST Data storage system to be offered as a separate service with its own storage rate. This rate will likely be higher than the current $5.67 per TB per month for the DDN18K system. However, we cannot determine the final rate until we better understand how our data usage and efficiencies are managed by the new system. 

Storage credits that have already been purchased, or are purchased during this transition, will retain their value on the DDN18K system (in terabyte months). Alternatively, they can be converted to storage credits on the VAST system at a ratio to be determined once the transition is complete. At that point, users will have the option to stay on VAST or move back to the reconfigured DDN18K, or choose a different, potentially cheaper storage option. 

Our goal is to enhance the reliability of the storage system in PACE while introducing new technologies to meet the diverse needs and budgets of the Georgia Tech Research community. Our aim is to develop migration strategies that minimize disruptions to your workflows and offer more storage options tailored to your requirements. To this end, we are cross-evaluating bulk versus individual group migrations, and we will be engaging with different research groups as necessary. We are committed to providing regular updates during this process. 

Thank you for your understanding and support during this project. If you have any questions or concerns, please feel free to reach out to us at pace-support@oit.gatech.edu. 

October 31, 2024

Firebird ASDL Outage

Filed under: Uncategorized — Grigori Yourganov @ 4:58 pm

On Oct 30, 2024, at 9:20 PM, there was a drive failure on the Firebird ASDL servers (on the ZFS pool dedicated to the ASDL project). The ASDL login nodes were offlined. Several jobs failed, and no new jobs were accepted since 10:09 AM on Oct 31. The NFS server was restarted and tested, and the ASDL nodes were back online at 12:38 PM on Oct 31.

New GPUs for Phoenix, V100s being Replaced 

Filed under: Uncategorized — Michael Weiner @ 9:53 am

[Additional Message 11/7/24]

As we prepare to remove 12 of the V100 servers from Phoenix next week in preparation for the arrival of new GPU nodes in December, we would like to inform you of another set of new GPUs available on the cluster through the embers backfill QOS.

There are 8 nodes, each with 8 L40S GPUs, providing 64 GPUs that have been available exclusively on embers (due to the ownership of this equipment) since late September in the Phoenix RHEL9 environment.

Visit our Phoenix Slurm guide on GPU requests to learn how to request them. Be sure to include a request for the embers QOS when requesting L40S architecture, at least until the additional L40S nodes for general use become available in December on inferno. You must make the request from the RHEL9 environment. Access via Phoenix OnDemand is not yet available.

Please contact pace-support@oit.gatech.edu with any questions.

[Original Post 10/31/24]

We’re happy to announce that there are will be 6 new H200 machines coming to Phoenix for general usage, with 8x NVIDIA H200 GPUs each, along with 2x L40S machines, each with 8x NVIDIA L40S GPUs. These will be available on the RHEL 9 operating system on Phoenix, which is required to support the new hardware. 

12 of the existing V100 servers will be REMOVED from the Phoenix RHEL7 environment to make room for the new L40S hardware, due to having reached the end-of-life on vendor support. The overall impact will be to greatly increase both the number and power of GPUs available on Phoenix – 24 V100 GPUs will be replaced with 16 L40S and 48 H200 GPUs. 
 
This change will begin on Nov. 11th, when the V100 machines will be removed, and we will 
begin installing the new servers, which we hope to release by December 6th
 
The new machines will be available via both the Inferno QoS and Embers on RHEL9. Jobs using the new H200 machines will be charged at a rate of $0.673 per GPU Hour ($1.4571 for GTRI), matching the current H100 rate. The rate for the new L40S GPUs will be shared prior to their release, as we’re working through approvals. 

October 24, 2024

Phoenix Project storage Slowness

Filed under: Uncategorized — Eric Coulter @ 10:57 am

WHAT’S HAPPENING? 

Multiple hard disks failed in a single RAID pool making up the filesystem underlying Phoenix Project storage. As the arrays are being rebuilt to ensure continued resilience against disk failures, read/write performance on the device may be somewhat slower. 
 
In addition to this, as part of a mitigation for a previous storage issue on 9/30, we have temporarily re-configured our storage to rely fully on spinning disk rather than caching parts of files on solid-state drives, which will cause a general decrease in access speeds until we are able to transition back to the prior configuration.  

WHEN IS IT HAPPENING? 
The failed drives were replaced on Oct 23rd, the pool rebuild will continue automatically. We will update when the process is complete. 

WHY IS IT HAPPENING? 

Hard disk failures are a regular part of life; the devices we support are capable of weathering these without data loss, however, it is necessary to re-write striped data onto replacement disks, leading to slight performance slowdown. In this case, 4/64 disks failed in one of the several pools making up the coda1 filesystem. We have configured the system to avoid writing new files to that pool in the meantime. These particular disks were in service for over 5 years before failing.  
 
We also had to disable our use of the Lustre Progressive File Layout (PFL) option on this device, which splits files between solid-state and spinning disk to provide faster access, due to the fact that the solid state drive pool became completely full on 9/30, causing a temporary outage. We are working to migrate data from the solid-state pool to spinning disk, but this process takes time and depends on the underlying drive pools being fully rebuilt, among other things.  

WHO IS AFFECTED? 

Phoenix users may experience slower performance of Phoenix Project storage during the rebuild, and additionally until we are able to re-enable PFL. 

WHAT DO YOU NEED TO DO? 

Please bear with us and keep an eye out for updates. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

For any questions, please contact PACE at pace-support@oit.gatech.edu.

October 23, 2024

Message concerning September-October 2024 Datacenter Outages

Filed under: Uncategorized — Eric Coulter @ 4:00 pm

Dear PACE Community,

Due to a highly unusual series of data center related outages this Fall, we would like to share details about the sequence of events and causes which have impacted the availability of PACE resources, and the PACE team’s continued work to provide a stable research computing environment to the GT community. We fully understand the significant negative impact these outages have on the research community, including the inability to submit research papers, complete deadlines, and as well as the loss of research time. While this does not make up for the full impact of these outages, we always work to ensure that no paid accounts are charged for computational jobs that fail due to outages, and have temporarily doubled free-tier account credits for October 2024 in a small effort to alleviate the pain of lost time. 

While many of the details here were communicated in the moment, a unified picture may help clear up certain misconceptions, and unfortunately prompt communication was required before a full understanding of the situation could be gained.   

Background: The CODA datacenter is the sole hosting facility for PACE resources. The datacenter is owned and operated by Databank. PACE resources are spread across two datacenter areas: 

  • The Enterprise Hall (500kW power provisioned) which has N+1 redundant cooling, networking, and power (battery-based UPS + Generator), where PACE and OIT host critical infrastructure and storage systems. This enables us to maintain access to login nodes and storage during most system and service outages impacting the datacenter. 
  • The Research Hall (2MW), which was designed without redundant cooling, and relies on a combination of flywheel UPS (<1minute runtime) + Georgia Power Microgrid in the case of an electrical utility outage (https://research.gatech.edu/georgia-tech-celebrates-opening-new-energy-project-midtown-atlanta)This design choice allowed for significantly more research compute capacity, performance, and greatly reduced facilities and operational costs. The design and operational model included elements to minimize single points of failure and to support faster recovery times. 

For the calendar year 2024, the following power and cooling datacenter outages have impacted PACE services: 

  • 9/3/2024: On August 27th, Databank identified a failed chilled water flow sensor on the High-Temp Chiller loop providing cooling to the research hall. Databank requested downtime before the next PACE maintenance period (January 2025) for emergency replacement.  
  • 9/8/2024: On September 8th, the High-Temp Chiller system providing cooling to the research hall failed due to the condenser pump variable-frequency drive (VFD) failing. Due to supply chain constraints, a unit was not available as part of the on-site inventory and  different brand/model VFD had to be sourced and installed. During the repair, Databank identified that the VFD failure had damaged the condenser pump internal bearing. The condenser pump was replaced with the on-site spare. 
  • 10/1/2024: On October 1st, the data center experienced a short loss of utility power, which impacted the High-Temperature Chiller system providing cooling to the research hall. The new condenser pump variable frequency drive was unable to properly auto-reset because of a previously unknown parameter. Note: during this incident, PACE only shut off idle nodes and prevented new jobs from being launched. No running jobs were impacted. 
  • 10/2/2024: On October 2nd, at approximately 11:33am, the datacenter experienced a rapid sequence of utility power loss (8 events in less than two minutes). The Research Hall electrical load was transferred to the UPS/Flywheels for backup power. However, the load was unable to be transferred back to the microgrid as intended due to a network breaker that tripped in the electrical vault during the October 1st event. Only Georgia Power can reset this breaker. As a result, power was lost entirely to the Research Hall once the flywheels were depleted. 

What are we doing to prevent these failures in the future?  

OIT, in partnership with the GT Real Estate Office, has engaged Databank to review outages over the past few years. Specifically, we are: 

  • Evaluating the 2017-2018 datacenter design requirements for the research hall, and how these requirements align with the needs for a reliable research computing infrastructure. 
  • Reviewing, evaluating, and improving operational procedures between DataBank, Georgia Power, and Georgia Tech. 
  • Reviewing and evaluating the list of critical spare parts maintained on-site by DataBank. 
  • Engaging stakeholders to review reliability and resilience requirements for research computing. 
  • Explore potential options to improve the cooling and power redundancy of the research hall. 
  • Analyzing the feasibility and pros and cons of hosting a small portion of the PACE computational capabilities in the high-availability enterprise side or leveraging cloud resources during outages. 

Long term, we plan to explore the use of additional datacenter locations to host research computing resources. 

Please feel free to reach out with any questions or concerns,

Didier Contis
Executive Director of Academic Technology, Innovation, Research Computing for the Office of Information Technology

October 2, 2024

Data Center Power Outage – 10/2/2024

Filed under: Uncategorized — Jeff Valdez @ 12:47 pm

[Update – 1/3/2024 – 12:43am]

The Buzzard cluster has been tested and confirmed functional, all nodes are back in service.

All PACE clusters are back in service, the impacts of the power outage have been remediated – this outage is over.

[Update – 10/3/2024 – 11:59am]

The Firebird cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Firebird are available for use. 
All Firebird nodes are back in service. 
 
Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week. 

 [Update – 10/3/2024 – 11:55am]

The Phoenix cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Phoenix are available for use. 
PACE continues to investigate 54 nodes which we were unable to power on remotely after the outage, which includes 19 NVIDIA V100 GPU nodes. 

Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week. We will provide the details for reimbursement of paid storage to affected users later this week.  

We are also doubling the amount of credits for ALL free-tier accounts on Phoenix for the month of October to offset the impacts of these outages. All Georgia Tech free-tier accounts (starting with gts-) will have the balance of $136 for the month of October; all GTRI free-tier accounts (starting with gtris-) will have the balance of $504.  

[Update – 10/3/2024 – 9:58am]

The Hive cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Hive are available for use.  
PACE continues to investigate 21 CPU nodes, 10 “nvme” nodes, and 4 “himem” nodes on Hive for errors and will return those to service as soon as possible.  
 
The PACE Team is continuing to test the Phoenix, Firebird, and Buzzard clusters, in that order of priority.

[Update – 10/3/2024 – 9:00am]

PACE and the OIT Datacenter teams have brought up the vast majority of machines making up the PACE clusters. Roughly 100 nodes remain in a state requiring manual intervention out of our 2,100 machines. The PACE team is working to confirm hardware readiness and beginning to carry out test procedures prior to releasing the clusters. Further updates will be provided as clusters become available for use.

The PACE team is prioritizing the Phoenix and Hive clusters, followed by Firebird and Buzzard. We hope to have the full suite of systems released by mid-afternoon.

[Update – 10/2/2024 – 5:01pm]

The ICE Cluster has been fully powered on, tested, and released for access in order to prioritize educational resources.

PACE and the OIT Datacenter teams are in the process of bringing up machines that make up the research clusters. Due to the sudden nature of the outage, the usual recovery mechanisms for rapid power-up are not available, which is considerably slowing recovery efforts compared to previous outages. The PACE and OIT Datacenter teams are continuing to check, manually reset, power on and subsequently test the hundreds of nodes that have been left in a bad state due to the nature of this power outage. Our tests have currently covered slightly over 1/5th of our 2,100 machines, and we expect to continue working to bring all machines online through the following day and will provide updates as we’re able to release clusters.

[Initial Post – 10/2/2024-12:55pm]

Dear PACE users,

A power outage (related to Georgia Power) impacted Tech Square including the CODA Datacenter. Due to a secondary failure of the UPS system, all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) were impacted. Currently, most of the nodes on all clusters are powered off, and the schedulers on all clusters have been paused. The outage started at approximately 11:37 am this morning. At the moment, no new jobs can start, and large number of jobs that have been running when the outage started have been terminated. Access to login nodes and storage remains available due to backup power. We are actively monitoring the situation and will keep you updated on the progress of the restoration of services. 

Thank you for your patience,

– The PACE team 

October 1, 2024

DATA CENTER CHILLER FAILURE – 10/1/2024

Filed under: Uncategorized — Eric Coulter @ 9:18 am

[Update 10/1/24 02:06 PM]

Cooling was restored to the datacenter this morning and PACE has tested all nodes that were powered off. All working nodes on Phoenix and Hive have been returned to service, we continue to investigate a small number of nodes with issues, including 7 of the cpu-amd nodes on RHEL9, but otherwise all clusters are fully operational.  
 
Thank you for your patience during this partial outage!

[Update 10/1/24 09:18 AM]

Our data center hosting provider, DataBank, identified a cooling failure this morning around 8:42am. As temperatures were rising to dangerous levels, we’ve initiated a partial shutdown.  The Phoenix and Hive schedulers have been paused, and all idle compute nodes on Phoenix and Hive have been powered off. Running jobs are not currently impacted. We are continuing to monitor the situation and determine if additional measures are needed. ICE, Firebird, and Buzzard remain in production at this time. 

Access to login nodes and all storage systems remains available. Files can be accessed or retrieved via Globus, the OnDemand web interface, or the login nodes.  

We will continue to provide updates as the situation evolves, and are working closely with the vendor to restore functionality.  

For any questions, please contact PACE at pace-support@oit.gatech.edu.  

September 18, 2024

PACE Phoenix Storage Hotfix – Sept 24th, 2024

Filed under: Uncategorized — Eric Coulter @ 3:30 pm

WHAT’S HAPPENING? 

Due to a recent instance of lower performance in our Project storage system (coda1), we will be working with our storage vendor to apply updates to the underlying device on Tuesday, September 24th. This should not cause any outage, but may result in decreased performance for some operations during the patch deployment. Due to the non-zero risk of outage, we will be working hand-in-hand with the vendor during this operation, and will be monitoring performance closely. Please do let us know if you observe impact to any work during that time, and we will refund jobs accordingly.                        

WHEN IS IT HAPPENING? 
The update process will begin on Tuesday morning, Sept 24th, 2024. 
We will send an announcement when the update is complete. 

WHY IS IT HAPPENING? 

Patches to the storage devices underlying Phoenix Project storage (coda1) have been recommended by the device vendor to improve reliability and performance based on recently observed degraded performance of the metadata servers on our Lustre filesystem.  

WHO IS AFFECTED? 

Phoenix users *may* experience slower performance of Phoenix Project storage during the update, and there is a low risk of outage. 

WHAT DO YOU NEED TO DO? 

Please do let us know if you observe impact to any work using the Phoenix Project filesystem (coda1) during that time, and we will refund jobs accordingly. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

For any questions, please contact PACE at pace-support@oit.gatech.edu.

September 8, 2024

PACE-Wide Emergency Shutdown – September 8, 2024

Filed under: Uncategorized — Grigori Yourganov @ 9:11 pm

[Update 9/11/24 2:51 PM]

Dear Hive community, 

The emergency maintenance on the Coda datacenter has been completed and the Hive cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that were held by the scheduler have been released. 

[Update 9/11/24 10:52 AM]

Dear Firebird users,

The emergency maintenance on the Coda datacenter has been completed and the Firebird cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that have been held by the scheduler have been released.

As a reminder:

RHEL7 Firebird nodes are accessible at the usual address login-<project>.pace.gatech.edu. RHEL9 Firebird nodes can be accessed via ssh at login-<project>-rh9.pace.gatech.edu for testing new software. The majority of our software stack has been rebuilt for the RHEL9 environment. We strongly encourage you to test your software on RHEL9, and please let us know if anything is missing! For more information, please see our Firebird RHEL9 documentation page.

Please take the time to test your software and workflows on the RHEL9 Firebird Environment (accessible via login-<project>-rh9.pace.gatech.edu) and let us know if anything is missing!

The next Maintenance Period will be January 13-16, 2025.

[Update 9/9/24 6:00 PM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.

Please visit https://status.gatech.edu for updates.

Access to head nodes and file systems is available.

[Update 9/9/24 9:00 AM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available. Please visit https://status.gatech.edu for updates. 

Access to login nodes and filesystems (via Globus, OpenOnDemand or direct connection to login nodes) is still available.

[Original Post 9/8/24]

WHAT’S HAPPENING?  

Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024. 

WHEN IS IT HAPPENING?  

Sunday, September 8, 2024, starting at 7.30 AM.EDT.  

WHY IS IT HAPPENING?  

PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.  

WHO IS AFFECTED?  

All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!   

WHAT DO YOU NEED TO DO?  

Wait patiently; we will communicate as soon as the clusters are ready to resume work.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

For any questions, please contact PACE at pace-support@oit.gatech.edu.  

August 26, 2024

PACE-Wide Emergency Shutdown – Sept 3, 2024

Filed under: Uncategorized — Eric Coulter @ 3:36 pm

WHAT’S HAPPENING? 

It is necessary to shut down the whole cohort of PACE clusters next week to make repairs in the datacenter. 

The repair and cluster resumption will take up to 1 day to complete, requires shutting down all nodes in the research hall, and must be done in the next few days.  
 
This shutdown will NOT affect Globus access, login-node access, or access to any storage locations.  

WHEN IS IT HAPPENING? 

Tuesday, September 3rd, 2024, starting at 4 PM EDT. Compute nodes are expected to return to availability on the afternoon of Wednesday, September 4th.  

WHY IS IT HAPPENING? 

Databank, the physical infrastructure provider for our datacenter, detected an issue over the weekend where multiple cooling doors reported high temperature alerts. They traced the issue to a high team chiller sensor. It was temporarily bypassed to avoid the multiple alerts and needs to be replaced to avoid additional issues.  

This outage is necessary to prevent widespread catastrophic failure of the servers in the research hall.  

WHO IS AFFECTED? 

All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) will be stopped at 4pm on the afternoon of September 3rd, 2024. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!  

WHAT DO YOU NEED TO DO? 

Wait patiently; we will communicate as soon as the clusters are ready to resume work. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Older Posts »

Powered by WordPress