PACE A Partnership for an Advanced Computing Environment

March 16, 2024

PACE Clusters Unreachable

Filed under: Uncategorized — Michael Weiner @ 7:13 pm

[3/18/24 10:00 AM]

Full functionality of all PACE clusters has been restored, and the schedulers have resumed launching queued jobs. Please resubmit any jobs that may have failed over the weekend.

A migration of GT’s DNS services on Saturday from BlueCat to Efficient IP caused widespread outages over the weekend to PACE and other campus services. DNS records began to disappear at 5 PM on Saturday and were patched late Saturday night, with PACE login access reappearing on Sunday morning as changes propagated.

All jobs running on Phoenix and Firebird between 5:30 PM on Saturday, March 16, and 9:00 AM on Monday, March 18, will be refunded.

Thank you for your patience as we recovered from the DNS outage.

[3/16/24 7:15 PM]

Summary: All PACE clusters (Phoenix, Hive, ICE, Firebird, and Buzzard) are currently unreachable due to a domain name resolution (DNS) issue.

Details: We are investigating a DNS issue that has left all PACE clusters unreachable. No further information is known at this time. We are pausing the scheduler on all clusters to prevent additional jobs from starting.

Impact: It will not be possible to access any PACE cluster via ssh or OnDemand at this time. Running jobs may be impacted on all clusters except Firebird. If you are already connected to a PACE cluster, scheduler and other commands may fail with address resolution errors on all clusters except Firebird.

Thank you for your patience as we work to restore access to PACE clusters. Please contact us at pace-support@oit.gatech.edu with any questions. Please visit status.gatech.edu for updates.

March 15, 2024

PACE Spending Deadlines for FY24

Filed under: Uncategorized — Michael Weiner @ 1:17 pm

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY24 on June 30, 2024, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 19, 2024. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2024, will be held for processing in July, in FY25. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2024.
    1. State funds (DE worktags) expiring on June 30, 2024, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2024, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

March 14, 2024

Intermittent Scratch Access from Phoenix OnDemand File Browser

Filed under: Uncategorized — Michael Weiner @ 5:39 pm

Summary: Phoenix scratch storage may not be accessible from the OnDemand file browser. There is no impact to scratch access or performance from login nodes, running jobs (including those launched via OnDemand apps), or Globus. The Globus File Manager may serve as an alternative.

Details: Over the past several weeks, researchers and the PACE team have identified intermittent failure in ability to access their Phoenix scratch directory from the “Files” tab in Phoenix OnDemand. “Permission denied” or other error messages may display. The PACE team is working to repair reliable access. The issue has been isolated to the way the OnDemand web server accesses scratch storage and therefore does not have wider impact.

Researchers wishing to use a graphical web-based file browser to manage files in their Phoenix scratch directories are encouraged to use the File Manager in Globus, which has similar capabilities. It is not necessary to install the Globus Connect Personal client on a local computer if you only wish to manage files on Phoenix rather than transfer them. Visit KB0041890 for more information about using Globus. KB0042390 provides information about using the Globus File Manager.

Impact: The impact is only to the file browser in Phoenix OnDemand. There is no impact to accessing scratch for job launched via the “Interactive Apps” or “IDEs” in OnDemand, which run on compute nodes. Similarly, access to scratch from login nodes, jobs on compute nodes, and Globus is normal. There is no performance impact.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with questions or concerns.

March 13, 2024

Firebird Firewall Update

Filed under: Uncategorized — Jeff Valdez @ 12:23 pm

Summary: The firewall protecting access to Firebird needs to be updated to avoid certificate expiration at the end of the month. 

Details: The network team needs to update the code on the firewalls protecting access to Firebird. As the connections are switched over to the High Availability (HA) pair, users might experience disconnections. The upgrade is needed to avoid certificate expiration at the end of the month; it was not done during the last maintenance day due to delays in the release of the production version of the code and it cannot wait until the next maintenance day.

The update will be completed during tomorrow’s network change window, Thursday, March 14, starting at 8 PM EDT, and finishing no later than 11:59 PM EDT. The upgrade itself will take about 30 minutes to complete within that time frame.

Impact: Access to Firebird head nodes will be impacted. Running batch jobs on the Slurm scheduler will continue without issues, but interactive jobs may be disrupted.

Thank you for your patience as we complete this update. Please contact us at pace-support@oit.gatech.edu with any questions. 

Powered by WordPress