PACE A Partnership for an Advanced Computing Environment

March 16, 2023

Phoenix project storage outage

Filed under: Uncategorized — Michael Weiner @ 2:48 pm

[Updated 2023/03/17 3:30 PM]

Phoenix project storage is again available, and we have resumed the scheduler, allowing new jobs to begin. Queued jobs will begin as resources are available.

The storage issue arose when one metadata server rebooted shortly after 1:00 PM yesterday, and the high-availability configuration automatically switched to the secondary server, which became overloaded. After extensive investigation yesterday evening and today, in collaboration with our storage vendor, we identified and stopped a specific series of jobs heavily taxing storage and also replaced several cables to fully restored Phoenix project storage availability.

Jobs that were running as of 1:00 PM yesterday that will fail or have failed due to the project storage outage will be refunded to the charge account provided. Please resubmit these failed jobs to Slurm to continue research.

Thank you for your patience as we repaired project storage. Please contact us with any questions.

[Updated 2023/03/16, 11:55PM ET]

We’re still experiencing significant slowness of the filesystem. We’re going to keep job scheduling paused for tonight and PACE team will resume troubleshooting in the morning as early as possible.

[Updated 2023/03/16, 6:50PM ET]

Troubleshooting continues with the vendor’s assistance. The file system is currently stable, but one of the meta data servers continues with an abnormal workload. We are working to resolve this issue to avoid additional file system failures.

[Original post 2023/03/16, 2:48PM ET]

Summary: Phoenix project storage is currently unavailable. The scheduler is paused, preventing any additional jobs from starting until the issue is resolved.

Details: An MDS server for the Phoenix Lustre parallel filesystem for project storage has encountered errors and rebooted. The PACE team is investigating at this time and working to restore project storage availability.

Impact: Project storage is slow or unreachable at this time. Home and scratch storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in project storage may not be working. To avoid further job failures, we have paused the scheduler, so no new jobs will start on Phoenix, regardless of the storage used.

Thank you for your patience as we investigate this issue and restore Phoenix storage to full functionality.

For questions, please contact PACE at pace-support@oit.gatech.edu.

February 23, 2023

OIT Network Maintenance, Saturday, February 25

Filed under: Uncategorized — Michael Weiner @ 2:31 pm

WHAT’S HAPPENING?

OIT Network Services will be upgrading the Coda Data Center firewall appliances. This will briefly disrupt connections to PACE, impacting login sessions, interactive jobs, and Open OnDemand sessions. Details on the maintenance are available on the OIT status page.

WHEN IS IT HAPPENING?
Saturday, February 25, 2023, 6:00 AM – 12:00 noon

WHY IS IT HAPPENING?
Required maintenance

WHO IS AFFECTED?

Any researcher or student with an active connection to PACE clusters (Phoenix, Hive, Buzzard, PACE-ICE, and COC-ICE) may lose their connection briefly during the maintenance window. Firebird will not be impacted.

This impacts ssh sessions and interactive jobs. Running batch jobs will not be impacted. Open OnDemand sessions that are disrupted may be resumed via the web interface once the network is restored if their walltime has not expired.

WHAT DO YOU NEED TO DO?

No action is required.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

February 6, 2023

PACE Spending Deadlines for FY23

Filed under: Uncategorized — Michael Weiner @ 1:48 pm

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY23 on June 30, 2023, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by March 31, 2023. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2023 will be held for processing in July, in FY24. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2023.
    1. State funds (DE worktags) expiring on June 30, 2023, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2023, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

January 11, 2023

PyTorch Security Risk: Please Check & Update

Filed under: Uncategorized — Michael Weiner @ 9:19 am

WHAT’S HAPPENING?

Researchers who install their own copies of PyTorch may have downloaded a compromised package and should uninstall it immediately.

WHEN IS IT HAPPENING?

Pytorch-nightly for December 25-30, 2022, is impacted. Please uninstall it immediately if you have installed this version.

WHY IS IT HAPPENING?

A malicious Triton dependency was added to the Python Package Index. See https://pytorch.org/blog/compromised-nightly-dependency/ for details.

WHO IS AFFECTED?

Researchers who install PyTorch on PACE or other services and updated with nightly packages December 25-30. PACE has scanned all .conda and .local directories on our systems and has not identified any copies of the Triton package.

Affected services: All PACE clusters

WHAT DO YOU NEED TO DO?

Please uninstall the compromised package immediately. Details are available at https://pytorch.org/blog/compromised-nightly-dependency/. In addition, please alert PACE at pace-support@oit.gatech.edu to let us know that you have identified an installation on our systems.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions, or if you are unsure if you have installed the compromised package on PACE.

December 21, 2022

Storage-eas read-only during configuration change

Filed under: Uncategorized — Michael Weiner @ 11:08 am

[Update 1/9/23 10:58 AM]

The migration of storage-eas data to a new location is complete, and full read/write capability is available for all research groups on the device. Researchers may resume regular use of storage-eas, including writing new data to it.

Thank you for your patience as we completed these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

 

[Original Post 12/21/22 11:08 AM]

Summary: Researchers have reported multiple outages of the storage-eas server recently. To stabilize the storage, PACE will make configuration changes. The storage-eas server will become read-only at 3 PM today and will remain read-only until after the Winter Break, while the changes are being implemented. We will provide an update when write access is restored.

Details: PACE will remove the deduplication setting on storage-eas, which is causing performance and stability issues. Beginning this afternoon, the system will become read-only while all data is copied to a new location. After the copy is complete, we will enable access to the storage in the new location, with full read/write capabilities.

Impact: Researchers will not be able to write to storage-eas for up to two weeks. You may continue reading files from it on both PACE and external systems where it is mounted. While this move is in progress, PACE recommends that researchers copy any files that need to be used in Phoenix jobs into their scratch directories, then work from there to write during a job. Scratch provides each researcher with 15 TB of temporary storage on the Lustre parallel filesystem. Files in scratch can be copied to non-PACE storage via Globus.

Thank you for your patience as we complete these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

December 2, 2022

Slow Storage on Phoenix

Filed under: Uncategorized — Michael Weiner @ 1:11 pm

[Update 12/5/22 10:45 AM]

Performance on Phoenix project & scratch storage has returned to normal. PACE continues to investigate the root cause of last week’s slowness, and we would like to thank those researchers we have contacted with questions about your workflows. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 12/2/22 1:11 PM]

Summary: Researchers may experience slow performance on Phoenix project & scratch storage.

Details: Over the past three days, Phoenix has experienced intermittent slowness on the Lustre filesystem hosting project & scratch storage due to heavy utilization. PACE is investigating the source of the heavy load on the storage system.

Impact: Any jobs or commands that read or write on project or scratch storage may run more slowly than normal.

Thank you for your patience as we continue to investigate. Please contact us at pace-support@oit.gatech.edu with any questions.

 

November 29, 2022

Scratch Deletion Resumption on Phoenix & Hive

Filed under: Uncategorized — Michael Weiner @ 9:27 am

Monthly scratch deletion will resume on the Phoenix and Hive clusters in December, in accordance with PACE’s scratch deletion policy for files over 60 days old. Scratch deletion has been suspended since May 2022, due to an issue with a software upgrade on Phoenix’s Lustre storage system that was resolved during the November maintenance period. Researchers with data scheduled for deletion will receive warning emails on Tuesday, December 6, and Tuesday, December 13, and files will be deleted on Tuesday, December 20. If you receive an email notification next week, please review the files scheduled for deletion and contact PACE if you need additional time to relocate the files.

Scratch is intended to be temporary storage, and regular deletion of old files allows PACE to offer a large space at no cost to our researchers. Please keep in mind that scratch space is not backed up, and any important data for your research should be relocated to your research group’s project storage.

If you have any questions about scratch or any other storage location on PACE clusters, please contact PACE.

November 7, 2022

Action Required: Globus Certificate Authority Update

Filed under: Uncategorized — Michael Weiner @ 4:20 pm

Globus is updating the Certificate Authority (CA) used for its transfer service, and action is required to continue using existing Globus endpoints. PACE updated the Phoenix, Hive, and Vapor server endpoints during the recent maintenance period. To continue using Globus Connect Personal to transfer files to/from your own computers, please update your Globus client to version 3.2.0 by December 12, 2022. Full details are available on the Globus website. This update is required to continue transferring data between your local computer and PACE or other computing sites.

Please contact us at pace-support@oit.gatech.edu with any questions.

 

October 3, 2022

Firebird inaccessible

Filed under: Uncategorized — Michael Weiner @ 9:41 am

[Update 10/3/22 10:45 AM]

Access to Firebird and the PACE VPN has been restored, and all systems should be functioning normally. If you do not see the PACE VPN as an option in the GlobalProtect client, please disconnect from the GT VPN and reconnect for it to appear again.

Urgent maintenance on the GlobalProtect VPN device on Thursday night inadvertently led to the loss of PACE VPN access, which was restored this morning.

Please contact us at pace-support@oit.gatech.edu with questions, or if you are still unable to access Firebird.

 

[Original Message 10/3/22 9:40 AM]

Summary: The Firebird cluster and PACE VPN are currently inaccessible. OIT is working to restore access.

Details: The Firebird cluster was found to be inaccessible over the weekend. PACE is working with OIT colleagues to identify the cause and restore access.

Impact: Researchers are unable to connect to the PACE VPN or access the Firebird cluster.

Thank you for your patience as we work to restore access. Please contact us at pace-support@oit.gatech.edu with questions.

July 26, 2022

Hive scheduler outage

Filed under: Uncategorized — Michael Weiner @ 9:23 am

Summary: The Hive scheduler became non-responsive last evening and was restored at approximately 8:30 AM today.

Details: The Torque resource manager on the Hive scheduler stopped responding around 7:00 PM yesterday. The PACE team restored its function around 8:30 AM this morning and is continuing to monitor its status. The scheduler was fully functional for some time after the system utility repair yesterday afternoon, and it is not clear if the issues are connected.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Hive Open OnDemand. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

Older Posts »

Powered by WordPress