PACE A Partnership for an Advanced Computing Environment

July 27, 2022

Hive Cluster Migration to Slurm Scheduler and Update to Software Stack

Filed under: Uncategorized — Semir Sarajlic @ 2:50 pm

Dear Hive researchers, 

The Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! PACE has worked closely with the Hive PIs on the plan for the migration to ensure minimum interruption to research. Slurm is a widely-popular scheduler  on many research computing clusters, so you may have experienced it elsewhere (if commands like ‘sbatch’ and ‘squeue’ sound familiar to you, then you’ve used Slurm!). Hive will be the first cluster in PACE’s transition from Torque/Moab to Slurm. We expect the new scheduler to provide improved stability and reliability, offering a better user experience. At the same time, we will be updating our software stack. We will be offering extensive support to facilitate this migration.  

The first phase will begin with the August maintenance period (August 10-12), during which 100 Hive compute nodes (of 484 total) will join our new “Hive-Slurm” cluster, while the rest remain in the existing Torque/Moab cluster. The 100 nodes will represent each existing queue/node type proportionally. Following the conclusion of maintenance, we strongly encourage all researchers to begin exploring the Slurm-based side of Hive and shifting over their workflows.  Also, as part of the phased migration approach, researchers will continue to have access to the Hive (Moab/Torque) cluster that will last until the final phase of this migration, and this is to ensure minimum interruption to research.     Users will receive detailed communication on how to connect to the Hive-Slurm part of the cluster along with other documentation and training.  

The phased transition is planned in collaboration with the Hive Governance Committee, represented by the PIs on the NSF MRI grant that funds the cluster (Drs. Srinivas Aluru, Surya Kalidindi, C. David Sherrill, Richard Vuduc, and John H. Wise on behalf of Deirdre Shoemaker).  Following the migration of the first 100 nodes, the committee will review the status and consider the timing for migrating the remaining compute nodes to the ‘Hive-Slurm’ cluster.   

In addition to the scheduler migration, another significant change for researchers on Hive will be an update to the PACE Apps software stack. The Hive-Slurm cluster will feature a new set of provided applications, listed in our documentationPlease review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you are currently using on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay to your research as the migration process concludes. We have reviewed batch job logs to determine packages in use and upgraded them to the latest version. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  

PACE will provide documentation, training sessions, and additional support (e.g., increased frequency of PACE consulting sessions) to aid you as you transition your workflows to Slurm. Prior to the launch, we will have updated documentation as well as a guide for converting job scripts from PBS to Slurm-based commands. We will also offer specialized training virtual sessions (PACE Slurm Orientation) on the use of Slurm on Hive. Additionally, we are increasing the frequency of our PACE consulting sessions during this migration phase, and you are invited to join PACE Consulting Sessions or to email us for support.  Schedule for PACE Slurm orientation and consulting sessions will be communicated soon.  

You will notice a few other changes to Hive in the new environment. There continues to be no charge to use Hive.  As part of this migration, we are introducing a new feature, in which each job will require a “tracking account” to be provided for reporting purposes. Researchers who use the Phoenix cluster will be familiar with this accounting feature; however, the tracking accounts on Hive will have neither balances nor limitations, as they’ll be used solely for cluster utilization metrics. We will provide additional details prior to the launch of Hive-Slurm. Also, we will restructure access to GPUs to increase utilization while continuing to support short jobs.   

We are excited to launch Slurm on PACE as we continue working to improve Georgia Tech’s research computing infrastructure, and we will be providing additional information and support in the coming weeks through documentation, support tickets, and live sessions. Please contact us with any questions or concerns about this transition.  

Best, 

-The PACE Team 

 

[08/08/22 update]

As you already know, the Hive cluster will be migrating to the Slurm scheduler with the first phase scheduled for the August 10-12 maintenance period! This is a follow up to our initial notification on 7/27/2022 in this regard. PACE will be providing all the necessary documentation, orientation, and additional PACE consulting sessions in support of a smooth transition of your workflows to Slurm. 

 

Documentation – Our team is working on the necessary documentation for guiding you through the new Hive-Slurm environment and conversion of the submission scripts to Slurm. We have drafted information for 1) login information, partitions, and tracking account for the new Hive-Slurm cluster. 2) Guidelines on converting existing PBS scripts and commands to Slurm. 3) Details on using Slurm on hive and examples for writing new scripts. The links for the documentation will be provided soon! 

 

Orientation sessions – PACE will be hosting orientation sessions on migration to Slurm. They are open-attendance, and there is no registration required to attend these sessions. Find the details for the first two sessions here. 

When: Tuesday, Aug 16, 1-2 PM and Wednesday, Aug 24, 1-2 PM 

Zoom link: https://gatech.zoom.us/j/98876400947?pwd=and2Wlh0aEdLSHlwdmFQOFZ6UkVudz09 

What is discussed: Introduction to the new Hive-Slurm environment and Slurm usage on Hive. Q&A for broad questions. The orientation would be providing the information to get you started on converting scripts. PACE will be working with individuals and provide hands-on help during the consulting sessions later. 

 

PACE Consulting sessions – PACE will be providing consulting sessions at a higher frequency to help researchers get onboarded in the new Hive-Slurm environment and provide one-on-one help in converting their PBS scripts to Slurm.  For the first month following the maintenance period, we will be hosting consulting sessions twice every week, rather than once. You can join us through the same link we use for consulting right now – find more details here starting from Aug 18th. 

When: Tuesdays, 2-3:45 and Thursdays, 10:30-12:15 AM, repeats weekly. 

Zoom link: https://gatech.zoom.us/j/99762843114?pwd=YjJrYVFQd05mdUQrTFpnM2NyU1hZUT09 

Purpose: In addition to any PACE related queries or issues, you could join the session to seek help from experts on converting your scripts to Slurm on the new Hive-Slurm cluster. 

 

Software Changes – The Slurm cluster will feature a new set of provided applications listed in our documentation. As a gentle reminder, please review this list of software we plan to offer on Hive post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you currently use on Hive is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay in your research as the migration process concludes. A couple of points to note:  

  1. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.  
  2. The commands pace-jupyter-notebook and pace-vnc-job will be retired with the migration to Slurm. Instead, OnDemand will be available for Hive-Slurm (online after maintenance day) via the existing portal. Please use OnDemand to access Jupyter notebooks, VNC sessions, and more on Hive-Slurm via your browser. 

We are excited to launch Slurm on HIVE as we continue working to improve Georgia Tech’s research computing infrastructure and strive to provide all the support you need to facilitate this transition with minimum interruption to your research.  We will follow up with additional updates and timely reminders as needed. In the meantime, please contact us with any questions or concerns about this transition. 

 

July 18, 2022

PACE Maintenance Period (August 10 – 12, 2022)

Filed under: Uncategorized — Semir Sarajlic @ 11:33 am

[8/12/22 5:00 PM Update]

PACE continues work to deploy Hive-Slurm. Maintenance on Hive-Slurm only will be extended into next week, as we complete setting up the new environment. At this time, please use the existing (Moab/Torque) Hive, which was released earlier today. We will provide another update next week when the Slurm cluster is ready for research, along with details about how to access and use the new scheduler and updated software stack.

The Slurm Orientation session previously announced for Tuesday, August 16, will be rescheduled for a later time.

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/12/22 2:20 PM Update]

The Phoenix, existing Hive (Moab/Torque), Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler.

We are continuing to work on launching the Hive-Slurm cluster, and we will provide another update to Hive researchers later today. Maintenance on the existing Hive (Moab/Torque) cluster has completed, and researchers can resume using it.

The next maintenance period for all PACE clusters is November 2, 2022, at 6:00 AM through November 4, 2022, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on January 31 – February 2, May 9-11, August 8-10, and October 31 – November 2.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • [Complete][Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
  • [In progress][Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Complete][Datacenter] Transformer repairs
  • [Complete][Datacenter] Cooling tower cleaning
  • [Complete][Scheduler] Accounting database maintenance
  • [Complete][Firebird][Network] Add VPC redundancy
  • [Complete][Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/9/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin tomorrow at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Utilities] PACE will merge the functionality of pace-whoami into the pace-quota utility. Please use the pace-quota command to find out all relevant information about your account, including storage directories and usage, job charge or tracking accounts, and other relevant information. Running pace-whoami will now report the same output as pace-quota.
  • [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Datacenter] Transformer repairs
  • [Datacenter] Cooling tower cleaning
  • [Scheduler] Accounting database maintenance
  • [Firebird][Network] Add VPC redundancy
  • [Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[8/3/22 Update]

This is a reminder that our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:

  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Datacenter] Transformer repairs
  • [Datacenter] Cooling tower cleaning
  • [Scheduler] Accounting database maintenance
  • [Firebird][Network] Add VPC redundancy
  • [Phoenix][Storage] Replace redundant power supply on Lustre storage system

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[7/27/22 Update]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, August 10, and end at 11:59 PM on Friday, August 12. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:
• [Hive] Slurm migration and software stack update first phase for Hive cluster – see recent announcement for details

ITEMS NOT REQUIRING USER ACTION:
• [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
• [Datacenter] Transformer repairs
• [Datacenter] Cooling tower cleaning
• [Scheduler] Accounting database maintenance
• [Firebird][Network] Add VPC redundancy

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[7/18/22 Early reminder]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/10/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/12/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to this maintenance, we will share further details on the tasks, which will be posted here.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

May 3, 2022

[Complete] PACE Maintenance Period: May 11 – 13, 2022

Filed under: Uncategorized — Semir Sarajlic @ 4:17 pm

[Update 5/16/22 9:20 AM]

All PACE clusters, including Phoenix, are now ready for research and learning. We have restored stability of the Phoenix Lustre storage system and released jobs on Phoenix.

Thank you for your patience as we worked to restore Lustre project & scratch storage on the Phoenix cluster. In working with our support vendor, we identified a scanning tool that was causing instability on the scratch filesystem and impacting the entire storage system. This has been disabled pending further investigation.

Due to the complications, we will not proceed with monthly deletions of old files on the Phoenix & Hive scratch filesystems tomorrow. Although only Phoenix was impacted, we will also delay Hive to avoid confusion. Files for which researchers were notified this month will not be deleted at this time, and you will receive another notification prior to any future deletion. Researchers are still encouraged to delete unneeded scratch files to preserve space on the system.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Complete][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 3:25 PM]

The PACE team and our support vendor’s engineers continue working to restore functionality of the Phoenix Lustre filesystem following the upgrade. Testing and remediation will continue today and through the weekend. At this time, we hope to be able to open Phoenix for research on Monday. We appreciate your patience as our maintenance period is extended. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 2:00 PM]

PACE maintenance continues on Phoenix, while the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning.

Phoenix remains under maintenance, as complications arose following the upgrade of Lustre project and scratch storage. PACE and our storage vendor are working to resolve the issue at this time. We will update you when Phoenix is ready for research.

Jobs on the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters have been released.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [In progress][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

 

 

[Detailed announcement 5/3/22]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, May 11, and end at 11:59 PM on Friday, May 13. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [System] Install operating system patches
  • [System] Update operating system on administrative servers
  • [Network] Move BCDC DNS appliance to new IP address
  • [Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [System] Remove unused nouveau graphics kernel from GPU nodes
  • [Network] Set static IP addresses on schedulers to improve reliability
  • [Datacenter] Cooling loop maintenance
  • [Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[Early announcement]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 05/11/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 05/13/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

April 25, 2022

Hive Gateway Resource Now Available to Campus Champions

Filed under: Uncategorized — Semir Sarajlic @ 2:17 pm

Dear Campus Champion Community,

We are pleased to announce the official release of the Hive Gateway at Georgia Tech’s Partnership for an Advanced Computing Environment (PACE) to the Campus Champion community. The Hive gateway is powered by Apache Airavata, and provides access to a portion of the Hive cluster at GT that is an NSF MRI funded supercomputer that delivers nearly 1 Linpack petaflops of computing power.  For more hardware details see the following link: https://docs.pace.gatech.edu/hive/resources/.

The Hive Gateway is available to *any* XSEDE researcher via federated login (i.e., CILogon), and has a variety of applications available including Abinit, Psi4, NAMD, a python environment with Tensorflow and Keras installed, among others.

Hive Gateway is accessible via https://gateway.hive.pace.gatech.edu

Our user guide is available at: https://docs.pace.gatech.edu/hiveGateway/gettingStarted/ and contains details on the process of getting access.  Briefly, to get access to the Hive gateway, go to “Log In” on the site, select XSEDE credentials via CILogon, which should allow you to log into the gateway and generate a request to our team to approve your gateway access and enable job submissions on the resource.

Please feel free to stop by the Hive gateway site, try it out, and/or direct your researchers to it.

Cheers!

– The PACE Team

April 19, 2022

Launch of Open OnDemand Portal for PACE’s Phoenix and Hive Clusters

Filed under: Uncategorized — Semir Sarajlic @ 9:50 am

Dear PACE Researchers, 

We are pleased to announce the official release of the Open OnDemand (OOD) portal for PACE’s Phoenix and Hive clusters! OOD portal allows you to access PACE compute resources through your browser, and OOD provides a seamless interface for several different interactive applications, including Jupyter, Matlab, and a general interactive desktop environment. Each PACE cluster has its own portal, allowing access to all your data as usual with the Web interface. 

In-depth documentation on OOD at PACE is available at https://docs.pace.gatech.edu/ood/guide, and links to the portal for each PACE cluster are listed below: 

Please note that you will need to be on the GT VPN in order to access the OOD portals.

Thursday’s PACE clusters orientation will feature a demo using OOD. To register for upcoming PACE clusters orientation, visit https://b.gatech.edu/3w6ifqO.  

Please direct any questions about Open OnDemand to our ticketing system via email to pace-support@oit.gatech.edu or by filling out a help request form.  

Cheers! 

– The PACE Team 

February 10, 2022

[Complete PACE Maintenance Period – February 9 – 11, 2022] PACE Clusters Ready for Research!

Filed under: Uncategorized — Semir Sarajlic @ 2:38 pm

Dear PACE Users,

All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all user jobs that were held by the scheduler.

Due to complications with the RHEL7.9 upgrade, 36% of Phoenix compute nodes remain under maintenance. We will work to return the cluster to full strength in the coming days. All node classes and queues have nodes available, and all storage is accessible.

Researchers who did not complete workflow testing on our Testflight environments on Phoenix and Hive, and Firebird users for whom a testing environment was not available, could experience errors related to the upgrade (see blog post). Please submit a support ticket to pace-support@oit.gatech.edu for assistance if you encounter any issues.

Our next maintenance period is tentatively scheduled to begin at 6:00 A on Wednesday, May 11, 2022, and conclude by 11:59 PM on Friday, May 13, 2022. Additional maintenance periods are tentatively scheduled for August 10-12 and November 2-4.

The following tasks were part of this maintenance period:

ITEMS REQUIRING USER ACTION:

  • [Complete on most nodes][System] Phoenix, Hive and Firebird clusters’ operating system will be upgraded to RHEL7.9.

ITEMS NOT REQUIRING USER ACTION:

  • [Deferred][Datacenter] Databank will repair/replace the DCR, requiring that all PACE compute nodes be powered off.
  • [Complete][Storage/Hive] Upgrade GPFS controller firmware
  • [Complete][Storage/Phoenix] Reintegrate storage previously borrowed for scratch into project storage
  • [Complete][Storage/Phoenix] Replace redundant storage controller and cables
  • [Complete][System] System configuration management updates
  • [Complete][Network] Upgrade IB switch firmware

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

December 23, 2021

Improvements to job accounting and queue wait times on PACE clusters

Filed under: Uncategorized — Semir Sarajlic @ 11:31 am
We would like to share two updates with you regarding improvements to job accounting and queue wait times on the Phoenix and Firebird clusters.
  • Due to an error, some users have seen the wrong account names listed in our pace-quota and pace-whoami utilities in recent months. We have corrected this, and all users can now use pace-quota to see available charge accounts and balances on Phoenix or Firebird. At the same time, a new improvement to our utility now makes balances visible for all accounts, including multi-PI or school-owned accounts that previously displayed a zero balance, so researchers can always check available balances. Read our documentation for more details about the charge accounts available to you and what they mean. The pace-quota command is available on Phoenix, Hive, Firebird, and ICE. It provides user-specific details:
    • your storage usage on that cluster
    • your charge account information for that cluster (Phoenix and Firebird only)
  • Additionally, in order to improve utilization of our clusters and reduce wait times, we have enabled spillover between node classes, allowing waiting jobs to run on underutilized, more capable nodes rather than those requested, requiring no user action, at no additional charge. Spillover on GPU nodes was enabled in September, while CPU nodes gained the capability last week, on both Phoenix and Firebird.
Please note that targeting a specific/more expensive node class to reduce wait time is no longer effective or necessary. Please request the resources required for your job. Your job will continue to be charged based on the rate for the resources it requests, even if it ends up being assigned to run on more expensive hardware.
As always, please contact us if you have any questions.

December 21, 2021

PACE availability during the Holidays

Filed under: Uncategorized — Semir Sarajlic @ 3:00 pm

While leaving 2021 behind, we wanted to remind everyone that PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. Please continue to report any problems or requests you may have to pace-support@oit.gatech.edu. We will receive those and get back to you as soon as possible after the holidays.

2021 was a pivotal year for PACE. We migrated all of our services to our new datacenter, changed our service model, and working to better serve GT researchers and students. We could not have done any of these without your input, support and patience. We are grateful for that and look forward to achieving more great things together in 2022.

Happy Holidays and a New Year!

November 17, 2021

Join us today for GT’s Virtual ARC Symposium & Poster Session @ SC21 that’s on Wednesday (11/17) 6:00pm – 8:00pm

Filed under: Uncategorized — Semir Sarajlic @ 11:01 am

This is a friendly reminder that the ARC Symposium and Poster Session is today from 6:00pm – 8:00pm (EST).  Join us for this exciting virtual event that will feature invited talks plus more than 20! poster presenters whom will highlight GT’s efforts in research computing, so relax for the evening and engage with our community and guests as we have a number joining from outside GT that includes Microsoft, AMD, Columbia, UCAR, to name a few…   Hope you can join us.

Links to Join the Event:

To join the ARC Symposium invited talks session (6:00 – 7:00pm EST), please use the BlueJeans link below: https://primetime.bluejeans.com/a2m/live-event/jxzvgwub

To join the ARC Symposium poster session (7:00pm – 8:15pm EST), use the following link:
https://gtsc21.event.gatherly.io/

ARC Symposium Agenda:

5:45 PM EST – Floor Opens

6:00 PM EST – Opening Remarks and Welcome 

Prof. Srinivas Aluru, Executive Director of IDEaS

6:05 PM EST –  “Exploring the Cosmic Graveyard with LIGO and Advanced Research Computing”

Prof. Laura Cadonati, Associate Dean for Research, College of Sciences

6:25 PM EST – “Life after Moore’s Law: HPC is Dead, Long Live HPC!”

Prof. Rich Vuduc, Director of CRNCH

6:45 PM EST –  “PACE Update on Advanced Research Computing at Georgia Tech”

Pam Buffington, Interim Associate Director of Research Cyberinfrastructure, PACE/OIT and Director Faculty & External Engagement, Center for 21st Century University

7:00PM EST – Poster Session Opens (more than 20 poster presenters!!)

8:15PM EST – Event Closes

November 5, 2021

[Complete – PACE Maintenance Period: November 3 – 5, 2021] PACE Clusters Ready for Research!

Filed under: Uncategorized — Semir Sarajlic @ 5:06 pm

Dear PACE researchers,

Our scheduled maintenance has completed ahead of schedule! All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, February 9, 2022, and conclude by 11:59PM on Friday, February 11, 2022. We have also tentatively scheduled the remaining maintenance periods for 2022 for May 11-13, August 10-12, and November 2-4.

The following tasks were part of this maintenance period:

ITEMS REQUIRING USER ACTION:

  • [Complete] TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details are available on our blog.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
  • [Complete][System] Operating system patch installs
  • [Complete][Storage/Phoenix] Lustre controller firmware and other upgrades
  • [Complete][Storage/Phoenix] Lustre scratch upgrade and expansion
  • [Postponed][Storage] Hive GPFS storage upgrade
  • [Complete][System] System configuration management updates
  • [Complete][System] Updates to NVIDIA drivers and libraries
  • [Complete][System] Upgrade some PACE infrastructure nodes to RHEL 7.9
  • [Complete][System] Reorder group file
  • [Complete][Headnode/ICE] Configure c-group controls on COC-ICE and PACE-ICE headnodes
  • [Complete][Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
  • [Complete][Network] update ethernet switch firmware
  • [Complete][Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu. You may read this message and prior updates related to this maintenance period on our blog.

Best,

-The PACE Team

 

Older Posts »

Powered by WordPress