PACE A Partnership for an Advanced Computing Environment

February 28, 2022

Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3

Filed under: Uncategorized — Michael Weiner @ 11:13 am

Summary: Battery replacement on Phoenix project & scratch storage will impact performance on Thursday, March 3.

What’s happening and what are we doing: Power supply units on the Phoenix Lustre storage device, holding project and scratch storage, need to be replaced. During the replacement, which will begin at approximately 10 AM on Thursday, March 3, storage will shift to write-through mode, and performance will be impacted. Once the UPS batteries in the new units are sufficiently charged, performance will return to normal.

How does this impact me: Phoenix project and scratch performance will be impacted until the fresh batteries have sufficiently charged, which should take several hours. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Phoenix storage throughout this procedure.

Thank you for your patience as we complete this replacement. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

February 17, 2022

[Resolved] Phoenix Lustre project & scratch storage degraded performance

Filed under: Uncategorized — Michael Weiner @ 10:30 am

[Update 2/23/2022 3:00 PM]

Phoenix Lustre project and scratch storage performance has now recovered. You may now resume submitting jobs as normal. Thank you for your patience as we investigated the root cause.

This issue was caused by certain jobs engaging in heavy read/write on network storage. Thank you to the researchers we contacted for your cooperation in adjusting your jobs.

If your workflow requires extensive access of large files on multiple nodes, please contact us, and we will be happy to work with you to create a workflow that may speed up your research while simultaneously ensuring network stability. PACE will also continue to work on improvements to our systems and monitoring.

If your work requires generating temporary files during a run, especially if they are large and/or numerous, you may benefit from using local disk on Phoenix compute nodes. Writing intermediate files to local storage avoids network latency and can speed up your calculations while lessening load on the system. Most Phoenix nodes have at least 1 TB of local NVMe storage available, while our SAS nodes have at least 7 TB of local storage. At the end of your job, you can transfer only the relevant output files to network storage (project or scratch).

We apologize for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.

[Update 2/22/2022 5:28pm]

We are following up with an update on the Phoenix storage (project and scratch) slowness issues that have persisted since previous reporting.  We have engaged our storage and fabric vendors as we work to address this issue.  Based on our current assessment, we have identified possible problematic server racks, which are offlined. Scheduler remains online, but Phoenix is operating under reduced capacity, and we ask users to refrain from submitting new jobs unless they are urgent.  We will continue to provide updates to users daily going forward as we work to address this issue.

Please accept our sincere apologies for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.

 

[Update 2/17/22 5:15 PM]

The PACE team and our storage vendor continue actively working to restore Lustre’s performance. We will provide updates as additional information becomes available. Please contact us at pace-support@oit.gatech.edu if you have any questions.

[Original Post 2/17/22 10:30 AM]

Summary: Phoenix Lustre project & scratch storage degraded performance

What’s happening and what are we doing: Phoenix project and scratch storage have been performing more slowly than normal since late yesterday afternoon. We have determined that the Phoenix Lustre device, hosting project and scratch storage, is experiencing errors and are working with our storage support vendor to restore performance.

How does this impact me: Researchers may experience slow performance using Phoenix project and scratch storage. This may include slowness in listing files in directories, reading files, or running jobs on Lustre storage. Home directories should not be impacted.

What we will continue to do: PACE is actively working, in coordination with our support vendor, to restore Lustre to full performance. We will update you as more information becomes available.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

February 10, 2022

[Complete PACE Maintenance Period – February 9 – 11, 2022] PACE Clusters Ready for Research!

Filed under: Uncategorized — Semir Sarajlic @ 2:38 pm

Dear PACE Users,

All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all user jobs that were held by the scheduler.

Due to complications with the RHEL7.9 upgrade, 36% of Phoenix compute nodes remain under maintenance. We will work to return the cluster to full strength in the coming days. All node classes and queues have nodes available, and all storage is accessible.

Researchers who did not complete workflow testing on our Testflight environments on Phoenix and Hive, and Firebird users for whom a testing environment was not available, could experience errors related to the upgrade (see blog post). Please submit a support ticket to pace-support@oit.gatech.edu for assistance if you encounter any issues.

Our next maintenance period is tentatively scheduled to begin at 6:00 A on Wednesday, May 11, 2022, and conclude by 11:59 PM on Friday, May 13, 2022. Additional maintenance periods are tentatively scheduled for August 10-12 and November 2-4.

The following tasks were part of this maintenance period:

ITEMS REQUIRING USER ACTION:

  • [Complete on most nodes][System] Phoenix, Hive and Firebird clusters’ operating system will be upgraded to RHEL7.9.

ITEMS NOT REQUIRING USER ACTION:

  • [Deferred][Datacenter] Databank will repair/replace the DCR, requiring that all PACE compute nodes be powered off.
  • [Complete][Storage/Hive] Upgrade GPFS controller firmware
  • [Complete][Storage/Phoenix] Reintegrate storage previously borrowed for scratch into project storage
  • [Complete][Storage/Phoenix] Replace redundant storage controller and cables
  • [Complete][System] System configuration management updates
  • [Complete][Network] Upgrade IB switch firmware

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

February 3, 2022

[Resolved] Coda Datacenter Cooling Issue

Filed under: Uncategorized — Aaron Jezghani @ 11:35 pm

[Update – 02/04/2022 10:24AM]

Dear PACE Researchers,

We are following up to inform you that all PACE clusters have resumed normal operations and clusters are accepting new user jobs. After the cooling loop was restored last night, datacenter’s operating temperatures had returned to normal and remained stable.

As previously mentioned, this outage should not have impacted any running jobs as PACE had only powered off idle compute nodes, so there is no user action required. Thank you for your patience as we worked through this emergency outage in coordination with Databank. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

 

[Original Post]

Dear PACE Researchers,

Due to a cooling issue in the Coda datacenter, we were asked to power off as many nodes as possible to control temperature in the research hall. At this time, Databank has recovered the cooling loop, and temperatures have stabilized. However, all PACE job schedulers will remain paused to help expedite the return to normal operating temperatures in the datacenter.

These events should have had no impact on running jobs, so no action is required at this time. We expect normal operation to resume in the morning. As always, if you have any questions, please contact us at pace-support@oit.gatech.edu.

Best,
The PACE Team

Powered by WordPress