PACE A Partnership for an Advanced Computing Environment

August 26, 2023

All PACE Clusters Down Due to Cooling Failure

Filed under: Uncategorized — Michael Weiner @ 9:40 pm

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

August 6, 2023

Phoenix Scratch Storage Outage

Filed under: Uncategorized — Michael Weiner @ 1:33 pm

[Update 8/7/23 9:34 PM]

Access to Phoenix scratch continued to have issues last night as of 10:19 PM last night (Sunday). We paused the scheduler and restarted the controller around 6am this morning (Monday).

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 10:19 PM Sunday and ended this morning at 9:24 AM Monday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 8/6/23 2:25 PM]

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 9:30 PM Saturday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 8/6/23 1:30 PM]

Summary: Phoenix scratch storage is currently unavailable, which may impact access to directories on other Phoenix storage systems. The Phoenix scheduler is paused, so no new jobs can start.

Details: A storage target controller on the Phoenix scratch system became unresponsive just before midnight on Saturday evening. The Phoenix scheduler crashed shortly before 7 AM Sunday morning due to the number of failures to reach scratch directories. PACE restarted the scheduler around 1 PM today (Sunday), restoring access, while also pausing it to prevent new jobs from starting.

Impact: The network scratch filesystem on Phoenix is inaccessible. Due to the symbolic link to scratch, an ls of Phoenix home directories may also hang. Access via Globus may also time out. Individual directories on the home storage device may be reachable if an ls of the main home directory is not performed. Scheduler commands, such as squeue, were not available this morning but have now been restored. As the scheduler is paused, any new jobs submitted will not start at this time. There is no impact to project storage.

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

July 13, 2023

Phoenix Project Storage & Login Node Outage

Filed under: Uncategorized — Michael Weiner @ 10:19 am

[ Update 7/18/2023 4:00 PM]

Summary: Phoenix project storage performance is degraded as redundant drives rebuild. The process may continue for several more days. Scratch storage is not impacted, so tasks may proceed more quickly if run on the scratch filesystem.

Details: During and after the storage outage last week, several redundant drives on the Phoenix project storage filesystem failed. The system is rebuilding the redundant array across additional disks, which is expected to take several more days. Researchers may wish to copy necessary files to their scratch directories or to local disk and run jobs from there for faster performance. In addition, we continue working with our storage vendor to identify the cause of last week’s outage.

Impact: Phoenix project storage performance is degraded for both read & write, which may continue for several days. Home and scratch storage are not impacted. All data on project storage is accessible.

Thank you for your patience as the process continues. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 7/13/2023 2:57 PM]

Phoenix’s head nodes, which were unresponsive earlier this morning, have been rebooted and are available again without issue. We will continue to monitor the login nodes for any other issues.

Regarding the failed redundant drives, we have replaced the control cables and a few hard drives have been reseated. We have run tests to confirm that the storage is running correctly.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM due to the outage.

[Update 7/13/2023 12:20 PM]

Failed redundant drives led an object storage target to become unreachable. We are working to replace controller cables to restore access.

[Original Post 7/13/2023 10:20 AM]

Summary: The Phoenix project storage filesystem became unresponsive this morning. Researchers may be unable to access data in their project storage. We have paused the scheduler, preventing new jobs from starting, while we investigate. Multiple Phoenix login nodes have also become unresponsive, which may have prevented logins.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known. We have also rebooted several Phoenix login nodes that had become unresponsive to restore ssh access.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress. Some login attempts this morning may have hung.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

June 7, 2023

Phoenix filesystem intermittent slowness

Filed under: Uncategorized — Michael Weiner @ 4:00 pm

Summary: The Phoenix’s filesystem response has been inconsistent starting today. We are noticing that there is a high utilization on all the head-nodes. 

Details: File access is intermittently slow on home storage, project storage, and scratch. Executing any command such as ‘ls’ on the head-node can have a slow response. Slowness in file access was first detected by a couple users around 3pm yesterday, and we have started getting more reports this afternoon. PACE team is actively working on the issue to identify the root cause and resolve this at the earliest. 

Impact: Users may continue to experience intermittent slowness in using the head-node, submitting jobs, compiling code, using interactive sessions, and file read/write. 

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to watch the performance and follow-up with another status message tomorrow morning.

06/08/2023 Update

Phoenix home, project storage and scratch are all fully functional. The filesystem performance has been normal for the last 12 hours. We will continue our investigation on the root cause and continue to monitor the performance.

As of now, the utilization on our servers has stabilized. The issue has not impacted any jobs running or waiting in queue. Users can resume using Phoenix as usual.

For questions, please contact PACE at pace-support@oit.gatech.edu.

April 24, 2023

PACE Maintenance Period, May 9-11, 2023

Filed under: Uncategorized — Michael Weiner @ 7:12 am

[Update 5/11/23]

The Phoenix, Hive, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler. 

The ICE instructional cluster remains under maintenance until tomorrow. Summer instructors will be notified when the upgraded ICE is ready for use.

The next maintenance period for all PACE clusters is August 8, 2023, at 6:00 AM through August 10, 2023, at 11:59 PM. An additional maintenance period for 2023 is tentatively scheduled for October 24-26, 2023 (note revised date).  

Status of activities:

  • [Complete][Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [In progress][ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Complete][Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Complete][Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Complete][Datacenter] High-temperature loop pump maintenance
  • [Complete][Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Complete]Network] Upgrade ethernet switch code in Enterprise Hall
  • [Complete][Network] Configure virtual pair between ethernet switches in Research Hall

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Update 5/2/23]

This is a reminder that the next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

Updated planned activities:

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Phoenix Storage] Phoenix scratch will be migrated to a new Lustre device, which will result in fully independent project & scratch filesystems. Researchers will find their scratch data remains accessible at the same path via symbolic link or directly via the same mount location.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace cables on Hive and Phoenix parallel filesystems
  • [Network] Upgrade ethernet switch code in Enterprise Hall
  • [Network] Configure virtual pair between ethernet switches in Research Hall

[Original Announcement 4/24/23]

WHEN IS IT HAPPENING?
The next PACE Maintenance Period starts at 6:00AM on Tuesday, 05/09/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 05/11/2023.

Maintenance on the ICE instructional cluster is expected to continue through Friday, 05/12/2023.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

WHAT IS HAPPENING?  

ITEMS NOT REQUIRING USER ACTION: 

  • [Login nodes] Implement enforcement of usage limits on the login nodes, limiting each individual to 1 CPU, 4 GB memory, and 5000 open files. These limits should reduce the possibility of one individual’s processes causing a login node outage. Researchers are reminded to use interactive jobs for resource-intensive activities via OnDemand Interactive Shell or the command line (Phoenix, Hive, and Firebird instructions).
  • [ICE] The instructional cluster will be migrated to the Slurm scheduler; new Lustre-based scratch storage will be added; and home directories will be migrated. PACE-ICE and COC-ICE will be merged. Additional information will be available for instructors on ICE.
  • [Datacenter] Connect new cooling doors to power for datacenter expansion
  • [Datacenter] High-temperature loop pump maintenance
  • [Storage] Replace Input/Output Modules on two storage devices

WHY IS IT HAPPENING? 
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. Future maintenance dates may be found on our homepage.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

April 3, 2023

Phoenix Scratch Storage & Scheduler Outages

Filed under: Uncategorized — Michael Weiner @ 2:29 pm

[Update 4/3/23 5:30 PM]

Phoenix’s scratch storage & scheduler are again fully functional.

The scratch storage system was repaired by 3 PM. We rebooted one of the storage servers, with the redundant controllers taking over the load, and brought it back online to restore responsiveness.

The scheduler outage was caused by a number of communication timeouts, later exacerbated by stuck jobs on scratch storage. After processing the backlog, the scheduler began allowing jobs to begin around 4:20 PM this afternoon. We have been monitoring it since then. At this time, due high utilization, the Phoenix CPU nodes are nearly completely occupied.

We will refund any job that failed after 10:30 AM today due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 4/3/23 2:30 PM]

Summary: Scratch storage is currently inaccessible on Phoenix. In addition, jobs are not able to start. The login nodes experienced high load earlier today, rendering them non-responsive, which was resolved through a reboot.

Details: Phoenix is currently experiencing multiple issues, and the PACE team is investigating. The scratch storage system is inaccessible as the Lustre service has been timing out since approximately 11:30 AM today. The scheduler is also failing to launch jobs, which started by 10:30 AM today. Finally, we experienced high load on all four Phoenix login nodes around 1:00 PM today. The login nodes were repaired through a reboot. All issues, including any potential root cause, are being investigated by the PACE team today.

Impact: Researchers on login nodes may have been disconnected during the reboots required to restore functionality. Scratch storage is unreachable at this time. Home and project storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in scratch storage may not be working. New jobs are not launching and will remain in queue.

Thank you for your patience as we investigate these issues and restore Phoenix to full functionality. For questions, please contact PACE at pace-support@oit.gatech.edu.

March 16, 2023

Phoenix project storage outage

Filed under: Uncategorized — Michael Weiner @ 2:48 pm

[Updated 2023/03/17 3:30 PM]

Phoenix project storage is again available, and we have resumed the scheduler, allowing new jobs to begin. Queued jobs will begin as resources are available.

The storage issue arose when one metadata server rebooted shortly after 1:00 PM yesterday, and the high-availability configuration automatically switched to the secondary server, which became overloaded. After extensive investigation yesterday evening and today, in collaboration with our storage vendor, we identified and stopped a specific series of jobs heavily taxing storage and also replaced several cables to fully restored Phoenix project storage availability.

Jobs that were running as of 1:00 PM yesterday that will fail or have failed due to the project storage outage will be refunded to the charge account provided. Please resubmit these failed jobs to Slurm to continue research.

Thank you for your patience as we repaired project storage. Please contact us with any questions.

[Updated 2023/03/16, 11:55PM ET]

We’re still experiencing significant slowness of the filesystem. We’re going to keep job scheduling paused for tonight and PACE team will resume troubleshooting in the morning as early as possible.

[Updated 2023/03/16, 6:50PM ET]

Troubleshooting continues with the vendor’s assistance. The file system is currently stable, but one of the meta data servers continues with an abnormal workload. We are working to resolve this issue to avoid additional file system failures.

[Original post 2023/03/16, 2:48PM ET]

Summary: Phoenix project storage is currently unavailable. The scheduler is paused, preventing any additional jobs from starting until the issue is resolved.

Details: An MDS server for the Phoenix Lustre parallel filesystem for project storage has encountered errors and rebooted. The PACE team is investigating at this time and working to restore project storage availability.

Impact: Project storage is slow or unreachable at this time. Home and scratch storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in project storage may not be working. To avoid further job failures, we have paused the scheduler, so no new jobs will start on Phoenix, regardless of the storage used.

Thank you for your patience as we investigate this issue and restore Phoenix storage to full functionality.

For questions, please contact PACE at pace-support@oit.gatech.edu.

February 23, 2023

OIT Network Maintenance, Saturday, February 25

Filed under: Uncategorized — Michael Weiner @ 2:31 pm

WHAT’S HAPPENING?

OIT Network Services will be upgrading the Coda Data Center firewall appliances. This will briefly disrupt connections to PACE, impacting login sessions, interactive jobs, and Open OnDemand sessions. Details on the maintenance are available on the OIT status page.

WHEN IS IT HAPPENING?
Saturday, February 25, 2023, 6:00 AM – 12:00 noon

WHY IS IT HAPPENING?
Required maintenance

WHO IS AFFECTED?

Any researcher or student with an active connection to PACE clusters (Phoenix, Hive, Buzzard, PACE-ICE, and COC-ICE) may lose their connection briefly during the maintenance window. Firebird will not be impacted.

This impacts ssh sessions and interactive jobs. Running batch jobs will not be impacted. Open OnDemand sessions that are disrupted may be resumed via the web interface once the network is restored if their walltime has not expired.

WHAT DO YOU NEED TO DO?

No action is required.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

February 6, 2023

PACE Spending Deadlines for FY23

Filed under: Uncategorized — Michael Weiner @ 1:48 pm

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY23 on June 30, 2023, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by March 31, 2023. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2023 will be held for processing in July, in FY24. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2023.
    1. State funds (DE worktags) expiring on June 30, 2023, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2023, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

January 11, 2023

PyTorch Security Risk: Please Check & Update

Filed under: Uncategorized — Michael Weiner @ 9:19 am

WHAT’S HAPPENING?

Researchers who install their own copies of PyTorch may have downloaded a compromised package and should uninstall it immediately.

WHEN IS IT HAPPENING?

Pytorch-nightly for December 25-30, 2022, is impacted. Please uninstall it immediately if you have installed this version.

WHY IS IT HAPPENING?

A malicious Triton dependency was added to the Python Package Index. See https://pytorch.org/blog/compromised-nightly-dependency/ for details.

WHO IS AFFECTED?

Researchers who install PyTorch on PACE or other services and updated with nightly packages December 25-30. PACE has scanned all .conda and .local directories on our systems and has not identified any copies of the Triton package.

Affected services: All PACE clusters

WHAT DO YOU NEED TO DO?

Please uninstall the compromised package immediately. Details are available at https://pytorch.org/blog/compromised-nightly-dependency/. In addition, please alert PACE at pace-support@oit.gatech.edu to let us know that you have identified an installation on our systems.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions, or if you are unsure if you have installed the compromised package on PACE.

Older Posts »

Powered by WordPress