PACE A Partnership for an Advanced Computing Environment

July 27, 2022

Hive Cluster Migration to Slurm Scheduler and Update to Software Stack

Filed under: Uncategorized — Semir Sarajlic @ 2:50 pm

Dear Hive researchers, 

The Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! PACE has worked closely with the Hive PIs on the plan for the migration to ensure minimum interruption to research. Slurm is a widely-popular scheduler  on many research computing clusters, so you may have experienced it elsewhere (if commands like ‘sbatch’ and ‘squeue’ sound familiar to you, then you’ve used Slurm!). Hive will be the first cluster in PACE’s transition from Torque/Moab to Slurm. We expect the new scheduler to provide improved stability and reliability, offering a better user experience. At the same time, we will be updating our software stack. We will be offering extensive support to facilitate this migration.  

The first phase will begin with the August maintenance period (August 10-12), during which 100 Hive compute nodes (of 484 total) will join our new “Hive-Slurm” cluster, while the rest remain in the existing Torque/Moab cluster. The 100 nodes will represent each existing queue/node type proportionally. Following the conclusion of maintenance, we strongly encourage all researchers to begin exploring the Slurm-based side of Hive and shifting over their workflows.  Also, as part of the phased migration approach, researchers will continue to have access to the Hive (Moab/Torque) cluster that will last until the final phase of this migration, and this is to ensure minimum interruption to research.     Users will receive detailed communication on how to connect to the Hive-Slurm part of the cluster along with other documentation and training.  

The phased transition is planned in collaboration with the Hive Governance Committee, represented by the PIs on the NSF MRI grant that funds the cluster (Drs. Srinivas Aluru, Surya Kalidindi, C. David Sherrill, Richard Vuduc, and John H. Wise on behalf of Deirdre Shoemaker).  Following the migration of the first 100 nodes, the committee will review the status and consider the timing for migrating the remaining compute nodes to the ‘Hive-Slurm’ cluster.   

In addition to the scheduler migration, another significant change for researchers on Hive will be an update to the PACE Apps software stack. The Hive-Slurm cluster will feature a new set of provided applications, listed in our documentationPlease review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you are currently using on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay to your research as the migration process concludes. We have reviewed batch job logs to determine packages in use and upgraded them to the latest version. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  

PACE will provide documentation, training sessions, and additional support (e.g., increased frequency of PACE consulting sessions) to aid you as you transition your workflows to Slurm. Prior to the launch, we will have updated documentation as well as a guide for converting job scripts from PBS to Slurm-based commands. We will also offer specialized training virtual sessions (PACE Slurm Orientation) on the use of Slurm on Hive. Additionally, we are increasing the frequency of our PACE consulting sessions during this migration phase, and you are invited to join PACE Consulting Sessions or to email us for support.  Schedule for PACE Slurm orientation and consulting sessions will be communicated soon.  

You will notice a few other changes to Hive in the new environment. There continues to be no charge to use Hive.  As part of this migration, we are introducing a new feature, in which each job will require a “tracking account” to be provided for reporting purposes. Researchers who use the Phoenix cluster will be familiar with this accounting feature; however, the tracking accounts on Hive will have neither balances nor limitations, as they’ll be used solely for cluster utilization metrics. We will provide additional details prior to the launch of Hive-Slurm. Also, we will restructure access to GPUs to increase utilization while continuing to support short jobs.   

We are excited to launch Slurm on PACE as we continue working to improve Georgia Tech’s research computing infrastructure, and we will be providing additional information and support in the coming weeks through documentation, support tickets, and live sessions. Please contact us with any questions or concerns about this transition.  

Best, 

-The PACE Team 

 

[08/08/22 update]

As you already know, the Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! This is a follow up to our initial notification on 7/27/2022 in this regard. PACE will be providing all the necessary documentation, orientation, and additional PACE consulting sessions in support of a smooth transition of your workflows to Slurm. 

 

Documentation – Our team is working on the necessary documentation for guiding you through the new Hive-Slurm environment and conversion of the submission scripts to Slurm. We have drafted information for 1) login information, partitions, and tracking account for the new Hive-Slurm cluster. 2) Guidelines on converting existing PBS scripts and commands to Slurm. 3) Details on using Slurm on hive and examples for writing new scripts. The links for the documentation will be provided soon! 

 

Orientation sessions – PACE will be hosting orientation sessions on migration to Slurm. They are open-attendance, and there is no registration required to attend these sessions. Find the details for the first two sessions here. 

When: Tuesday, Aug 16, 1-2 PM and Wednesday, Aug 24, 1-2 PM 

Zoom link: https://gatech.zoom.us/j/98876400947?pwd=and2Wlh0aEdLSHlwdmFQOFZ6UkVudz09 

What is discussed: Introduction to the new Hive-Slurm environment and Slurm usage on Hive. Q&A for broad questions. The orientation would be providing the information to get you started on converting scripts. PACE will be working with individuals and provide hands-on help during the consulting sessions later. 

 

PACE Consulting sessions – PACE will be providing consulting sessions at a higher frequency to help researchers get onboarded in the new Hive-Slurm environment and provide one-on-one help in converting their PBS scripts to Slurm.  For the first month following the maintenance period, we will be hosting consulting sessions twice every week, rather than once. You can join us through the same link we use for consulting right now – find more details here starting from Aug 18th. 

When: Tuesdays, 2-3:45 and Thursdays, 10:30-12:15 AM, repeats weekly. 

Zoom link: https://gatech.zoom.us/j/99762843114?pwd=YjJrYVFQd05mdUQrTFpnM2NyU1hZUT09 

Purpose: In addition to any PACE related queries or issues, you could join the session to seek help from experts on converting your scripts to Slurm on the new Hive-Slurm cluster. 

 

Software Changes – The Slurm cluster will feature a new set of provided applications listed in our documentation. As a gentle reminder, please review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you currently use on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay in your research as the migration process concludes. A couple of points to note:  

  1. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  
  2. The commands pace-jupyter-notebook and pace-vnc-job will be retired with the migration to Slurm. Instead, OnDemand will be available for Hive-Slurm (online after maintenance day) via the existing portal. Please use OnDemand to access Jupyter notebooks, VNC sessions, and more on Hive-Slurm via your browser. 

We are excited to launch Slurm on HIVE as we continue working to improve Georgia Tech’s research computing infrastructure and strive to provide all the support you need to facilitate this transition with minimum interruption to your research.  We will follow up with additional updates and timely reminders as needed. In the meantime, please contact us with any questions or concerns about this transition. 

 

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress