GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

Operating System Upgrade to RHEL7.9

Posted by on Monday, 10 January, 2022

[Update 1/10/22 3:45 PM]

Testflight environments are now available for you to prepare for the upgrade of PACE’s Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies. 

All PACE researchers are strongly encouraged to test all workflows they regularly run-on PACE. Please conduct your testing at your earliest convenience to avoid delays to your research. An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software, including updates and modifications to our scientific software repository. PACE is providing updated modules for all our Message Passing Interface (MPI) options. 

For details of what to test and how to access our Testflight-Coda (Phoenix) and Testflight-Hive environments, please visit our RHEL7.9 upgrade documentation.  

Please let us know if you encounter any issues with the upgraded environment. Our weekly PACE Consulting Sessions are a great opportunity to work with PACE’s facilitation team on your testing and upgrade preparation. Visit the schedule of upcoming sessions to find the next opportunity.  

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 12/7/21 3:30 PM]

Summary: Operating System Upgrade to RHEL7.9

What’s happening and what are we doing: PACE will upgrade our Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The upgrade timing of the ICE clusters will be announced later. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies and will also update our software repository.

PACE will provide researchers with access to a “testflight” environment in advance of the upgrade, allowing you the opportunity to ensure your software works in the new environment. More details will follow at a later time, including how to access the testing environment for each research cluster.

How does this impact me:

  • An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software. PACE is providing updated modules for all of our Message Passing Interface (MPI) options and testing their compatibility with all software PACE installs in our scientific software repository.
  • Researchers who built their own software may need to rebuild it in the new environment and are encouraged to use the testflight environment to do so. Researchers who have contributed to PACE Community applications (Tier 3) should test their software and upgrade it if necessary to ensure continued functionality.
  • Researchers that have installed their own MPI code independent of PACE’s MPI installations will need to rebuild it in the new environment.
  • Due to the pending upgrade, software installation requests may be delayed in the coming months. Researchers are encouraged to submit a software request and discuss their specific needs with our software team research scientists. As our software team focuses on preparing the new environment and ensuring that existing software is compatible, requests for new software may take longer than usual to be fulfilled.

What we will continue to do: PACE will ensure that our scientific software repository is compatible with the new environment and will provide researchers with a testflight environment in advance of the migration, where you will be able to test the upgraded software or rebuild your own software. We will provide additional details as they become available.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Improvements to job accounting and queue wait times on PACE clusters

Posted by on Thursday, 23 December, 2021
We would like to share two updates with you regarding improvements to job accounting and queue wait times on the Phoenix and Firebird clusters.
  • Due to an error, some users have seen the wrong account names listed in our pace-quota and pace-whoami utilities in recent months. We have corrected this, and all users can now use pace-quota to see available charge accounts and balances on Phoenix or Firebird. At the same time, a new improvement to our utility now makes balances visible for all accounts, including multi-PI or school-owned accounts that previously displayed a zero balance, so researchers can always check available balances. Read our documentation for more details about the charge accounts available to you and what they mean. The pace-quota command is available on Phoenix, Hive, Firebird, and ICE. It provides user-specific details:
    • your storage usage on that cluster
    • your charge account information for that cluster (Phoenix and Firebird only)
  • Additionally, in order to improve utilization of our clusters and reduce wait times, we have enabled spillover between node classes, allowing waiting jobs to run on underutilized, more capable nodes rather than those requested, requiring no user action, at no additional charge. Spillover on GPU nodes was enabled in September, while CPU nodes gained the capability last week, on both Phoenix and Firebird.
Please note that targeting a specific/more expensive node class to reduce wait time is no longer effective or necessary. Please request the resources required for your job. Your job will continue to be charged based on the rate for the resources it requests, even if it ends up being assigned to run on more expensive hardware.
As always, please contact us if you have any questions.

PACE availability during the Holidays

Posted by on Tuesday, 21 December, 2021

While leaving 2021 behind, we wanted to remind everyone that PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. Please continue to report any problems or requests you may have to pace-support@oit.gatech.edu. We will receive those and get back to you as soon as possible after the holidays.

2021 was a pivotal year for PACE. We migrated all of our services to our new datacenter, changed our service model, and working to better serve GT researchers and students. We could not have done any of these without your input, support and patience. We are grateful for that and look forward to achieving more great things together in 2022.

Happy Holidays and a New Year!

Headnode Violation Detector Updates

Posted by on Friday, 17 December, 2021

Running many or extended resource-intensive processes on the login nodes slows the node for all users and is a violation of PACE policy, as it prevents others from using the cluster. We would like to make you aware of recent improvements to our headnode violation detector.

PACE may stop processes that improperly occupy the headnode, in order to restore functionality for all members of our user community. Please use compute nodes for all computational work. If you need an interactive environment, please submit an interactive job. If you are uncertain about how to use the scheduler to work on compute nodes, please contact us for assistance. We are happy to help you with your workflows on the cluster.

If you run processes that overuse the headnode, we will send an email asking you to refrain from doing so. We have recently updated our violation detector to ensure that emails are sent to the proper user and to adjust the logic of the script to align it with policy.

Thank you for your efforts to ensure PACE clusters are an available resource for all.

Reboot on login-hive1 on Tuesday, December 21, at 10:00 AM

Posted by on Wednesday, 15 December, 2021

Summary: Reboot on login-hive1 on Tuesday, December 21, at 10:00 AM

What’s happening and what are we doing: As part of our preparations for the RHEL7.9 testflight environment that will be available in January, PACE will reboot the login-hive1 headnode on Tuesday, December 21, at 10:00 AM. Hive has two headnodes, and the login-hive2 headnode will not be impacted. The load balancer that automatically routes new user login-hive connections to either login-hive1 or login-hive2 has been adjusted to send all new connections to login-hive2 beginning the afternoon of December 15.

How does this impact me: If you are connected to login-hive1 at the time of the reboot, you will lose your connection to Hive, and any processes running on login-hive1 will be terminated. Running interactive jobs submitted from login-hive1 will also be disrupted. Batch jobs will not be affected. Users connected to login-hive2 will not be impacted. Users who connected to Hive prior to Wednesday afternoon may be on login-hive1 and should complete their current work or log out and back in to Hive before Tuesday. Users who ssh to login-hive.pace.gatech.edu beginning this afternoon will all be assigned to login-hive2 and will not be impacted. If you specifically ssh to login-hive1.pace.gatech.edu, then you will still reach the node that is scheduled to be rebooted and should complete your session before next Tuesday.

What we will continue to do: PACE will monitor the Hive headnodes and ensure that login-hive1 is fully functional after reboot before re-initiating the load balancer that distributes user logins between the two headnodes.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Join us today for GT’s Virtual ARC Symposium & Poster Session @ SC21 that’s on Wednesday (11/17) 6:00pm – 8:00pm

Posted by on Wednesday, 17 November, 2021

This is a friendly reminder that the ARC Symposium and Poster Session is today from 6:00pm – 8:00pm (EST).  Join us for this exciting virtual event that will feature invited talks plus more than 20! poster presenters whom will highlight GT’s efforts in research computing, so relax for the evening and engage with our community and guests as we have a number joining from outside GT that includes Microsoft, AMD, Columbia, UCAR, to name a few…   Hope you can join us.

Links to Join the Event:

To join the ARC Symposium invited talks session (6:00 – 7:00pm EST), please use the BlueJeans link below: https://primetime.bluejeans.com/a2m/live-event/jxzvgwub

To join the ARC Symposium poster session (7:00pm – 8:15pm EST), use the following link:
https://gtsc21.event.gatherly.io/

ARC Symposium Agenda:

5:45 PM EST – Floor Opens

6:00 PM EST – Opening Remarks and Welcome 

Prof. Srinivas Aluru, Executive Director of IDEaS

6:05 PM EST –  “Exploring the Cosmic Graveyard with LIGO and Advanced Research Computing”

Prof. Laura Cadonati, Associate Dean for Research, College of Sciences

6:25 PM EST – “Life after Moore’s Law: HPC is Dead, Long Live HPC!”

Prof. Rich Vuduc, Director of CRNCH

6:45 PM EST –  “PACE Update on Advanced Research Computing at Georgia Tech”

Pam Buffington, Interim Associate Director of Research Cyberinfrastructure, PACE/OIT and Director Faculty & External Engagement, Center for 21st Century University

7:00PM EST – Poster Session Opens (more than 20 poster presenters!!)

8:15PM EST – Event Closes

[Complete] Hive Project & Scratch Storage Cable Replacement

Posted by on Tuesday, 9 November, 2021

[Update 11/12/21 11:30 AM]

The pool rebuilding has completed on Hive GPFS storage, and normal performance has returned.

[Update 11/10/21 11:30 AM]

The cable replacement has been completed without interruption to the storage system. Rebuilding of the pools is now in progress.

[Original Post 11/9/21 5:00 PM]

Summary: Hive project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers has failed and needs to be replaced, beginning around 10:00 AM tomorrow (Wednesday). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Complete – PACE Maintenance Period: November 3 – 5, 2021] PACE Clusters Ready for Research!

Posted by on Friday, 5 November, 2021

Dear PACE researchers,

Our scheduled maintenance has completed ahead of schedule! All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, February 9, 2022, and conclude by 11:59PM on Friday, February 11, 2022. We have also tentatively scheduled the remaining maintenance periods for 2022 for May 11-13, August 10-12, and November 2-4.

The following tasks were part of this maintenance period:

ITEMS REQUIRING USER ACTION:

  • [Complete] TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details are available on our blog.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
  • [Complete][System] Operating system patch installs
  • [Complete][Storage/Phoenix] Lustre controller firmware and other upgrades
  • [Complete][Storage/Phoenix] Lustre scratch upgrade and expansion
  • [Postponed][Storage] Hive GPFS storage upgrade
  • [Complete][System] System configuration management updates
  • [Complete][System] Updates to NVIDIA drivers and libraries
  • [Complete][System] Upgrade some PACE infrastructure nodes to RHEL 7.9
  • [Complete][System] Reorder group file
  • [Complete][Headnode/ICE] Configure c-group controls on COC-ICE and PACE-ICE headnodes
  • [Complete][Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
  • [Complete][Network] update ethernet switch firmware
  • [Complete][Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu. You may read this message and prior updates related to this maintenance period on our blog.

Best,

-The PACE Team

 

TensorFlow update required due to identified security vulnerability

Posted by on Wednesday, 20 October, 2021

Summary: TensorFlow update required due to identified security vulnerability

What’s happening and what are we doing: A security vulnerability was discovered in TensorFlow. PACE has installed the patched version 2.6.0 of TensorFlow in our software repository, and we will retire the older versions on November 3, 2021, during our maintenance period.

How does this impact me: Both researchers who use PACE’s TensorFlow installation and those who have installed their own are impacted.

The following PACE installations will be retired:

Modules: tensorflow-gpu/2.0.0 and tensorflow-gpu/2.2.0

Virtual envs under anaconda3/2020.02: pace-tensorflow-gpu-2.2.0 and pace-tensorflow-2.2.0

Please use the tensorflow-gpu/2.6.0 module instead of the older versions  identified above. If you were previously using  a PACE-provided virtual env provided  inside the anaconda3 module, please use the separate new module instead. You can find more information about using PACE’s TensorFlow installation in our documentation. You will need to update your PBS scripts to call the new module, and you may need to update python code to ensure compatibility with the latest version of the package.

If you have created your own conda environment on PACE and installed TensorFlow in it, please create a new virtual environment and install the necessary packages. You can build this environment from the tensorflow-gpu/2.6.0 virtual environment as a base if you would like, then install other packages you need, as described in our documentation. In order to protect Georgia Tech’s cybersecurity, please discontinue use of any older environments running prior versions of TensorFlow on PACE.

What we will continue to do: We are happy to assist researchers with the transition to the new version of TensorFlow. PACE will offer support to researchers upgrading TensorFlow at our upcoming consulting sessions. The next sessions are Thursday, October 28, 10:30-12:15, and Tuesday, November 2, 2:00-3:45. Visit our training page for the full schedule and BlueJeans links.

Thank you for your prompt attention to this security update, and please accept our sincere apology for any inconvenience that this may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Hive scheduler recurring outages

Posted by on Friday, 15 October, 2021

[Update 11/5/21 3:15 PM]

During the November maintenance period, PACE separated Torque and Moab, the two components of the Hive scheduler. This two-server setup, mirroring the Phoenix scheduler arrangement, should improve stability of the Hive scheduler under heavy utilization. We will continue to monitor the Hive scheduler. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Update 10/15/21 5:15 PM]

The Hive scheduler is functioning at this time. The PACE team disabled several system utilities that may have contributed to earlier issues with the scheduler. We will continue to monitor the scheduler status and to work with our support vendor to improve stability of Hive’s scheduler. Please check this blog post for updates.

[Update 10/15/21 4:15 PM]

The Hive scheduler is again functional. The PACE team and our vendor are continuing our investigation in order to restore stability to the scheduler.

[Original Post 10/15/21 12:35 PM]

Summary: Hive scheduler recurring outages

What’s happening and what are we doing: The Hive scheduler has been experiencing intermittent outages over the past few weeks requiring frequent restarts. At this time, the PACE team is running a diagnostic utility and will restart the scheduler shortly. The PACE team is actively investigating the outages in coordination with our scheduler vendor to restore stability to Hive’s scheduler.

How does this impact me: Hive researchers may be unable to submit or check the status of jobs, and jobs may be unable to start. You may find that the “qsub” and “qstat” commands and/or the “showq” command are not responsive. Already-running jobs will continue.

What we will continue to do: PACE will continue working to restore functionality to the Hive scheduler and coordinating with our support vendor. We will provide updates on our blog, so please check here for current status.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.