PACE A Partnership for an Advanced Computing Environment

May 13, 2024

Firebird scheduler outage resolved

Filed under: Uncategorized — Michael Weiner @ 2:47 pm

Summary: A configuration issue with the Firebird scheduler caused failures to Firebird jobs over the weekend and this morning as storage was not accessible on compute nodes. The issue was resolved by 2:00 PM today.

Details: Changes to the Firebird scheduler configuration were made during last week’s maintenance period (May 7-9) in order to facilitate future updates to Firebird. A repair was made on Friday, after which jobs were running successfully. Over the weekend, a different issue occurred, and jobs were launched on compute nodes without the proper storage being mounted. We have fully reverted the Firebird configuration changes to their state prior to the maintenance period, and jobs should no longer face any errors.

Impact: Some jobs launched on Firebird over the last three days may have failed due to missing home and project storage on the compute nodes with messages like “no such file or directory” or an absent output file. Jobs attempted mid-day on Monday, May 13, may have been queued for an extended period while repairs were made to the scheduler configuration.

Thank you for your patience as we resolved this issue. Please contact us at pace-support@oit.gatech.edu with questions or if you continue to experience errors.

April 22, 2024

PACE Maintenance Period (May 07 – May 10, 2024) 

Filed under: Uncategorized — Eric Coulter @ 9:53 am

[Update 05/09/24 04:25 PM]

Dear PACE users,   

The maintenance on the Phoenix, Hive, Firebird, and OSG Buzzard clusters has been completed. The Phoenix, Hive, Firebird, and OSG Buzzard clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. 

The ICE cluster is still under maintenance due to the RHEL9 migration, but we expect it to be ready tomorrow. Instructors teaching summer courses will be notified when it is ready. 

The POSIX user group names on the Phoenix, Hive, Firebird, and OSG Buzzard clusters have been updated so that names will start with the “pace-” prefix. If your scripts or workflows rely on POSIX group names, they will need to be updated; otherwise, no action is required on your part. This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.

Just a reminder that the next Maintenance Period will be August 6-8, 2024

Thank you for your patience! 

-The PACE Team 

[Update 05/07/24 06:00 AM]

PACE Maintenance Period starts now at 6:00 AM on Tuesday, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, 05/10/2024.

[Update 05/01/24 06:37 PM]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, May 10th, 05/10/2024. An extra day is needed to accommodate physical work done by Databank in the Coda Data Center.PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
    • NOTE: This item was originally planned for January but was delayed to avoid integration issues with IAM services, which have now been resolved.
  • [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know! 
    • The ICE login nodes will be updated to RHEL 9.3 as well, and this WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance. 
  • [ICE] We will be retiring 8 of the RTX6000 GPU nodes on ICE to prepare for the addition of several new L40 nodes the week after MD. 
  • [software] Sync Gaussian and VASP on RHEL7 pace-apps.
  • [software] Sync any remaining RHEL9 pace-apps for the OS migration.
  • [Phoenix, ICE] Upgrade Nvidia drivers on all HGX/DGX servers.
  • [Hive] The scratch deleter will not run in May and June but will resume in July.
  • [Phoenix] The scratch deleter will not run in May but will resume in June.
  • [ICE] The scratch deleter will run for Spring semester deletion during the week of May 13.

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.  
  • [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9.
  • [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit).
  • [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0. 
  • [Hive] Consolidate all the ICE access entitlements into a single one, all-pace-ice-access.
  • [Hive] Upgrade Hive compute nodes to GPFS 5.1.
  • [Phoenix] Replace cables for the Phoenix storage server.
  • [Firebird] Patch Firebird storage server to 100GbE switch and reconfigure.
  • [Firebird, Hive] Deploy Slurm scheduler CLI+Feature bits on Firebird and Hive. 
  • [datacenter] Configure LDAP on the MANTA NetApp HPCNA SVM.

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog

Thank you,  

-The PACE Team 

[Update 04/22/24 09:53 AM]

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59PM on Friday May 10th, 05/10/2024. The additional day is needed to accommodate physical work carried out by Databank in the Coda datacenter. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.  
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  
    • NOTE: This item was originally planned for January, but was delayed to avoid integration issues with IAM services, which have now been resolved.
  • [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know! 
    • Note – This WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.  
  • [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9 
  • [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit) 
  • [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0. 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

April 8, 2024

Phoenix A100 CPU:GPU Ratio Change

Filed under: Uncategorized — Michael Weiner @ 5:45 pm

On Phoenix, the default number of CPUs assigned to jobs requesting an Nvidia Tensor Core A100 GPU has recently changed. Now, jobs requesting one or more A100 GPUs will be assigned 8 cores per GPU by default, rather than 32 cores per GPU. You may still request up to 32 cores per GPU if you wish by using the --ntasks-per-node flag in your SBATCH script or salloc command to specify the number of CPUs per node your job requires. Any request with a CPU:GPU ratio of at most 32 will be honored.

12 of our Phoenix A100 nodes host 2 GPUs and 64 CPUs (AMD Epyc 7513), supporting a CPU:GPU ratio up to 32, and can be allocated through both the inferno (default priority) and embers (free backfill) QOSs. We have recently added 1 more A100 node with 8 GPUs and 64 CPUs (AMD Epyc 7543), requiring this change to the default ratio. This new node is available only to jobs using the embers QOS due to the funding for its purchase.

Please visit our documentation to learn more about GPU requests and QOS or about compute resources on Phoenix and contact us with any questions about this change.

April 4, 2024

PACE clusters unreachable on the morning of April 4, 20204

Filed under: Uncategorized — Grigori Yourganov @ 10:54 am

The PACE clusters were not accepting new connections from 4 AM until 10 AM today (April 4, 2024). As part of the preparations to migrate the clusters to a new version of the operating system (Red Hat Enterprise Edition 9), an entry in the configuration management system from the development environment was accidentally applied to production, including the /etc/nologin file on the head nodes. This has been fixed and additional controls are in place to avoid reincidence. 

The jobs and the data transfers running during that period were not affected. The interactive sessions that started before the configuration change were not affected either. 

Currently, the clusters are back online, and the scheduler is accepting jobs. We strongly apologize for this accidental disruption. 

March 16, 2024

PACE Clusters Unreachable

Filed under: Uncategorized — Michael Weiner @ 7:13 pm

[3/18/24 10:00 AM]

Full functionality of all PACE clusters has been restored, and the schedulers have resumed launching queued jobs. Please resubmit any jobs that may have failed over the weekend.

A migration of GT’s DNS services on Saturday from BlueCat to Efficient IP caused widespread outages over the weekend to PACE and other campus services. DNS records began to disappear at 5 PM on Saturday and were patched late Saturday night, with PACE login access reappearing on Sunday morning as changes propagated.

All jobs running on Phoenix and Firebird between 5:30 PM on Saturday, March 16, and 9:00 AM on Monday, March 18, will be refunded.

Thank you for your patience as we recovered from the DNS outage.

[3/16/24 7:15 PM]

Summary: All PACE clusters (Phoenix, Hive, ICE, Firebird, and Buzzard) are currently unreachable due to a domain name resolution (DNS) issue.

Details: We are investigating a DNS issue that has left all PACE clusters unreachable. No further information is known at this time. We are pausing the scheduler on all clusters to prevent additional jobs from starting.

Impact: It will not be possible to access any PACE cluster via ssh or OnDemand at this time. Running jobs may be impacted on all clusters except Firebird. If you are already connected to a PACE cluster, scheduler and other commands may fail with address resolution errors on all clusters except Firebird.

Thank you for your patience as we work to restore access to PACE clusters. Please contact us at pace-support@oit.gatech.edu with any questions. Please visit status.gatech.edu for updates.

March 15, 2024

PACE Spending Deadlines for FY24

Filed under: Uncategorized — Michael Weiner @ 1:17 pm

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY24 on June 30, 2024, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 19, 2024. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2024, will be held for processing in July, in FY25. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2024.
    1. State funds (DE worktags) expiring on June 30, 2024, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2024, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

March 14, 2024

Intermittent Scratch Access from Phoenix OnDemand File Browser

Filed under: Uncategorized — Michael Weiner @ 5:39 pm

Summary: Phoenix scratch storage may not be accessible from the OnDemand file browser. There is no impact to scratch access or performance from login nodes, running jobs (including those launched via OnDemand apps), or Globus. The Globus File Manager may serve as an alternative.

Details: Over the past several weeks, researchers and the PACE team have identified intermittent failure in ability to access their Phoenix scratch directory from the “Files” tab in Phoenix OnDemand. “Permission denied” or other error messages may display. The PACE team is working to repair reliable access. The issue has been isolated to the way the OnDemand web server accesses scratch storage and therefore does not have wider impact.

Researchers wishing to use a graphical web-based file browser to manage files in their Phoenix scratch directories are encouraged to use the File Manager in Globus, which has similar capabilities. It is not necessary to install the Globus Connect Personal client on a local computer if you only wish to manage files on Phoenix rather than transfer them. Visit KB0041890 for more information about using Globus. KB0042390 provides information about using the Globus File Manager.

Impact: The impact is only to the file browser in Phoenix OnDemand. There is no impact to accessing scratch for job launched via the “Interactive Apps” or “IDEs” in OnDemand, which run on compute nodes. Similarly, access to scratch from login nodes, jobs on compute nodes, and Globus is normal. There is no performance impact.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with questions or concerns.

March 13, 2024

Firebird Firewall Update

Filed under: Uncategorized — Jeff Valdez @ 12:23 pm

Summary: The firewall protecting access to Firebird needs to be updated to avoid certificate expiration at the end of the month. 

Details: The network team needs to update the code on the firewalls protecting access to Firebird. As the connections are switched over to the High Availability (HA) pair, users might experience disconnections. The upgrade is needed to avoid certificate expiration at the end of the month; it was not done during the last maintenance day due to delays in the release of the production version of the code and it cannot wait until the next maintenance day.

The update will be completed during tomorrow’s network change window, Thursday, March 14, starting at 8 PM EDT, and finishing no later than 11:59 PM EDT. The upgrade itself will take about 30 minutes to complete within that time frame.

Impact: Access to Firebird head nodes will be impacted. Running batch jobs on the Slurm scheduler will continue without issues, but interactive jobs may be disrupted.

Thank you for your patience as we complete this update. Please contact us at pace-support@oit.gatech.edu with any questions. 

February 19, 2024

Outage on Scratch Storage on the Phoenix Cluster

Filed under: Uncategorized — Michael Weiner @ 10:35 am

[Update 02/19/24 10:47 AM]

Summary: The Phoenix /storage/scratch1 file system is operational. The performance is stable. The scheduler has been un-paused, current jobs continue to run, and new jobs are being accepted. 

Details: The storage vendor provided us with a hot fix late Friday evening that was installed this morning on the Lustre appliance supporting /storage/scratch1. The performance test of the scratch file system after the upgrade was stable. We are releasing the cluster and the Slurm scheduler. The Open OnDemand services are back to normal. 

The cost of all jobs running between 6PM on Wednesday, February 14, and 10AM on Monday, February 19, will be refunded to the PI’s accounts. 

During the weekend, an automatic process accidentally resumed the scheduler and some jobs started to run. If you have a job that ran during the outage and used scratch, please consider re-running it from the beginning, because, if your job was running before the hot fix was applied, it is possible some processes failed trying to access the scratch file system. The cost of the jobs that were accidentally re-started during the outage will be refunded.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and jobs can be submitted. The Globus and Open OnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu.   

Thank you for your patience! 

[Update 02/16/24 05:58 PM]

PACE has decided to leave the Slurm scheduler paused, and no jobs will be accepted over the weekend. We will allow jobs that are currently running to continue, but those utilizing scratch may fail.

While this was a difficult call to decide on keeping the job scheduling paused during the weekend, we want to ensure that the issues with scratch storage will not impact the integrity of other components on Phoenix. 

We are not confident that functionality can be restored without further input from the storage vendor. As part of continuing the diagnostic process, we expect we will have no other option but to reboot the scratch storage system on Monday morning. As a result, any jobs still running at that point that utilize scratch storage will likely fail. We have continued to provide diagnostic data that the vendor will analyze during the weekend. We plan to provide an update on the state of the scratch storage by next Monday (2/19) at noon.

We will refund all jobs that ran from the start of the outage on Wednesday evening 6:00 pm until performance is restored. 

Monthly deletion of old files in scratch, scheduled for Tuesday, February 20, has been canceled. All researchers who have received notifications for February will be given a one-month extension automatically. 

Finally, while you cannot schedule jobs, you may be able to log on to Phoenix to view or copy files. However, please be aware that you may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. We recommend avoiding using file commands like “ls” of your home (~) or scratch (~/scratch) directories as that may lead to your command prompt stalling. 

You may follow updates to this incident on the GT OIT Status page.  

We recognize the negative impact this storage disruption has on the research community, especially given that some of you may have research deadlines. Thank you for your patience as we continue working to fully restore scratch storage system performance. If you have additional concerns, please email ART Executive Director, Didier Contis, directly at didier.contis@gatech.edu

[Update 02/16/24 02:59 PM]

Unfortunately, the scratch storage on the Phoenix cluster remains currently unstable. You may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. Jobs that are currently running from scratch might be experiencing delays. We are continuing to work on resolving the issue and we are in close communication with the storage vendor. The scheduler remains paused, and no new jobs are being accepted. We will provide an update on the state of the scratch storage by this evening. We sincerely apologize for the inconvenience that the current outage is causing you. 

Thank you for your patience.

[Update 02/16/24 09:15 AM]

Summary: Phoenix /storage/scratch1 file system continues to have issues for some users. The recommended procedure is to fail over the storage services to the high availability pair and reboot the affected component. This will require pausing the Phoenix scheduler. 

Details: After analyzing the storage logs, the vendor recommended that the affected component is rebooted, moving all the services and connections to the high availability pair. While the device restarts, the Phoenix scheduler will be paused. Running jobs will see a momentary pause accessing the /storage/scratch1 file system while the connections are moved to the redundancy device. Once the primary device is up and running and all the errors have cleared, the services will be switched back, and the jobs scheduling will be resumed. 

We will start this procedure at 10:00am EDT. Please wait for the all-clear message before starting additional jobs on the Phoenix cluster. 

Impact: Jobs on Phoenix will be paused during the appliance restart procedure; running jobs should continue with some delays while the connections are switched over. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 04:56 PM]

Summary: Phoenix /storage/scratch1 file system is now stable for most users. A small number of users are still experiencing issues. 

Details: While we continue working with the vendor to get to the root cause of the issue, all diagnostic tests executed through the day have been successful. However, there is a small number of users who have running jobs from their scratch folder that continue to notice slowness accessing their files. 

Please inform us if you are seeing degraded performance on our file systems. As mentioned, we continue the efforts to find a permanent solution. 

Impact: Access to /storage/scratch1 is normal for the majority of users; please let us know if you are still experiencing issues by emailing us at pace-support@oit.gatech.edu. OnDemand-Phoenix and the scheduler are working fine. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 11:07 AM]

Summary: Phoenix /storage/scratch1 file system has intermittent issues. Jobs running from the scratch storage might be stuck. 

Details: Around 5:00 PM yesterday (February 14, 2024), the Lustre filesystem hosting /storage/scratch1 on the Phoenix cluster became inaccessible. We restarted the services at 8AM today (February 15, 2024) but some accessibility issues remain. The PACE team is investigating the cause and the storage vendor has been contacted. This may cause delays and timeouts on interactive sessions and running jobs. 

Impact: Access to /storage/scratch1 might be interrupted for some users. Running ‘ls’ on Phoenix home directories may hang as it attempts to resolve the symbolic link to the scratch directory. OnDemand-Phoenix was also affected; as of this writing, it is stable, and we continue to monitor it. Jobs using /storage/scratch1 may be stuck. The output of the `pace-quota` command might hang as scratch utilization is checked and might show the incorrect balance. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

January 31, 2024

Phoenix Scheduler Outage

Filed under: Uncategorized — Michael Weiner @ 11:44 am

Summary: The Slurm scheduler on Phoenix is experiencing an intermittent outage.

Details: The scheduler is repeatedly freezing due to a problematic input. The PACE team has identified the likely cause and is attempting to restore functionality.

Impact: Commands like squeue and sinfo may report errors, and new jobs may not start on Phoenix. Already-running jobs are not impacted. Other clusters (Hive, ICE, Firebird, Buzzard) are not impacted.

Thank you for your patience as we work to restore Phoenix to full functionality. Please contact us at pace-support@oit.gatech.edu with any questions. You may track the status of this outage on the GT Status page.

Older Posts »

Powered by WordPress