PACE A Partnership for an Advanced Computing Environment

December 23, 2021

Improvements to job accounting and queue wait times on PACE clusters

Filed under: Uncategorized — Semir Sarajlic @ 11:31 am
We would like to share two updates with you regarding improvements to job accounting and queue wait times on the Phoenix and Firebird clusters.
  • Due to an error, some users have seen the wrong account names listed in our pace-quota and pace-whoami utilities in recent months. We have corrected this, and all users can now use pace-quota to see available charge accounts and balances on Phoenix or Firebird. At the same time, a new improvement to our utility now makes balances visible for all accounts, including multi-PI or school-owned accounts that previously displayed a zero balance, so researchers can always check available balances. Read our documentation for more details about the charge accounts available to you and what they mean. The pace-quota command is available on Phoenix, Hive, Firebird, and ICE. It provides user-specific details:
    • your storage usage on that cluster
    • your charge account information for that cluster (Phoenix and Firebird only)
  • Additionally, in order to improve utilization of our clusters and reduce wait times, we have enabled spillover between node classes, allowing waiting jobs to run on underutilized, more capable nodes rather than those requested, requiring no user action, at no additional charge. Spillover on GPU nodes was enabled in September, while CPU nodes gained the capability last week, on both Phoenix and Firebird.
Please note that targeting a specific/more expensive node class to reduce wait time is no longer effective or necessary. Please request the resources required for your job. Your job will continue to be charged based on the rate for the resources it requests, even if it ends up being assigned to run on more expensive hardware.
As always, please contact us if you have any questions.

December 21, 2021

PACE availability during the Holidays

Filed under: Uncategorized — Semir Sarajlic @ 3:00 pm

While leaving 2021 behind, we wanted to remind everyone that PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. Please continue to report any problems or requests you may have to pace-support@oit.gatech.edu. We will receive those and get back to you as soon as possible after the holidays.

2021 was a pivotal year for PACE. We migrated all of our services to our new datacenter, changed our service model, and working to better serve GT researchers and students. We could not have done any of these without your input, support and patience. We are grateful for that and look forward to achieving more great things together in 2022.

Happy Holidays and a New Year!

December 17, 2021

Headnode Violation Detector Updates

Filed under: Uncategorized — Michael Weiner @ 8:34 am

Running many or extended resource-intensive processes on the login nodes slows the node for all users and is a violation of PACE policy, as it prevents others from using the cluster. We would like to make you aware of recent improvements to our headnode violation detector.

PACE may stop processes that improperly occupy the headnode, in order to restore functionality for all members of our user community. Please use compute nodes for all computational work. If you need an interactive environment, please submit an interactive job. If you are uncertain about how to use the scheduler to work on compute nodes, please contact us for assistance. We are happy to help you with your workflows on the cluster.

If you run processes that overuse the headnode, we will send an email asking you to refrain from doing so. We have recently updated our violation detector to ensure that emails are sent to the proper user and to adjust the logic of the script to align it with policy.

Thank you for your efforts to ensure PACE clusters are an available resource for all.

December 15, 2021

Reboot on login-hive1 on Tuesday, December 21, at 10:00 AM

Filed under: Uncategorized — Michael Weiner @ 4:09 pm

Summary: Reboot on login-hive1 on Tuesday, December 21, at 10:00 AM

What’s happening and what are we doing: As part of our preparations for the RHEL7.9 testflight environment that will be available in January, PACE will reboot the login-hive1 headnode on Tuesday, December 21, at 10:00 AM. Hive has two headnodes, and the login-hive2 headnode will not be impacted. The load balancer that automatically routes new user login-hive connections to either login-hive1 or login-hive2 has been adjusted to send all new connections to login-hive2 beginning the afternoon of December 15.

How does this impact me: If you are connected to login-hive1 at the time of the reboot, you will lose your connection to Hive, and any processes running on login-hive1 will be terminated. Running interactive jobs submitted from login-hive1 will also be disrupted. Batch jobs will not be affected. Users connected to login-hive2 will not be impacted. Users who connected to Hive prior to Wednesday afternoon may be on login-hive1 and should complete their current work or log out and back in to Hive before Tuesday. Users who ssh to login-hive.pace.gatech.edu beginning this afternoon will all be assigned to login-hive2 and will not be impacted. If you specifically ssh to login-hive1.pace.gatech.edu, then you will still reach the node that is scheduled to be rebooted and should complete your session before next Tuesday.

What we will continue to do: PACE will monitor the Hive headnodes and ensure that login-hive1 is fully functional after reboot before re-initiating the load balancer that distributes user logins between the two headnodes.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Powered by WordPress