PACE A Partnership for an Advanced Computing Environment

January 20, 2023

PACE Maintenance Period (January 31 – February 7, 2023)

Filed under: Uncategorized — Jeff Valdez @ 8:45 am

[Updated 2023/02/03, 4:33 PM EST]

Dear Phoenix Users, 

The Phoenix cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. We will host a Slurm Orientation Session (for users new to Slurm) on Friday, Feburary 17, 11am. 

The transfer of remaining funds on Phoenix Moab/Torque to Slurm is ongoing and is expected to be completed next week.  January statements will report the accurate balance when they are sent out. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Complete] [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Complete][Phoenix] New Phoenix login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message  
  • [Complete] [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [Complete] [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files were updated 
  • [Complete] [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Complete] [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [Complete] [Network] Code upgrade to PACE departmental Palo Alto 
  • [Complete] [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Complete] [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Complete] [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Complete] [Storage] Update sysctl parameters on ZFS servers 
  • [Complete] [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Complete] [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:20 PM EST]

Dear Hive Users, 

The Hive cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Thank You,
– The PACE Team

[Updated 2023/02/02, 4:05 PM EST]

Dear Firebird Users, 

The Firebird cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:05 PM EST]

Dear Buzzard Users,

The Buzzard cluster is now ready for research and learning. We have released all jobs that were held by the scheduler. 

The next maintenance period for all PACE clusters is May 9, 2023, at 6:00 AM through May 11, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on August 8-10, and October 31-November 2. 

We are still working on maintenance for the Phoenix cluster and will provide more updates as more information is available. 

Status of activities: 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu 

Thank You,
– The PACE Team

[Updated 2023/02/01, 4:00 PM EST]

The PACE-ICE and COC-ICE instructional clusters are ready for learning. As usual, we have released all user jobs that were held by the scheduler. You may resume using PACE-ICE and COC-ICE at this time. PACE’s research clusters remain under maintenance as planned.

[Updated 2023/01/31, 6:00AM EST]

WHEN IS IT HAPPENING?
Maintenance Period starts now at 6 AM EST on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023. 

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work are complete.

WHAT DO YOU NEED TO DO?
During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window.

Torque/Moab will no longer be available to Phoenix users starting now. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
PACE Maintenance Period starts now and will run until it is complete. Phoenix downtime could last until Tuesday, 02/07/2023 or beyond. 

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/27, 2:06PM EST]

WHEN IS IT HAPPENING?
Reminder that the next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/20, 8:45AM EST]

WHEN IS IT HAPPENING?
The next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 119 additional nodes for a final total of about 1319). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

January 17, 2023

Upcoming Maintenance Period Extension Required January 31-February 7 (estimated)

Filed under: Uncategorized — Jeff Valdez @ 11:51 am

WHAT IS HAPPENING? 
PACE is updating the group ID of every group & file in our storage infrastructure to remove the conflicts with those assigned campus wide by OIT. The expected time per cluster varies greatly due to the size of the related storage. During the maintenance period, PACE will release clusters as soon as each is complete.  

WHEN IS IT HAPPENING? 
Maintenance period starts on Tuesday, January 31, 2023. The changes to the Phoenix project file system are estimated to take seven days to complete. Thus, the maintenance period will be extended from the typical three days to seven days for the Phoenix cluster. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier. PACE will release them as soon as maintenance and migration work are complete. Researchers can expect those clusters to be ready earlier.   

WHY IS IT HAPPENING? 
This is a critical step for us to be able to make new storage available to campus users, removing group-id conflicts we currently have with the Georgia Tech Enterprise Directory (GTED). This will allow us to provide campus and pace mountable storage to our researchers and provide a foundation for additional self-service capabilities. This change will also allow us to evolve the PACE user management tools and processes. We understand that the short-term impact of this outage is problematic, but as we increase storage utilization, the problem will only get worse if left unaddressed. We expect the long-term impact of this update to be low, since the ownership, group names and permissions will remain unchanged. 

WHO IS AFFECTED? 
All users across all PACE’s clusters.  

WHAT DO YOU NEED TO DO? 
Please plan accordingly for an extended Maintenance Period for the Phoenix cluster starting Tuesday, January 31, 2023. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know and we can collaborate on possible alternatives.  

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. If escalation is required, please email me, Pam Buffington pam@gatech.edu directly as well.

Best, 
– The PACE Team 
– Pam Buffington – PACE Director 

Phoenix Cluster Migration to Slurm Scheduler – Phase 5

Filed under: Uncategorized — Jeff Valdez @ 6:00 am

[Updated 2023/01/17, 4:02PM EST]

Dear Phoenix researchers,  

The fifth phase of Phoenix Slurm Migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated about 1200 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster. 

As a reminder, the final phase of the migration is scheduled to complete later this month, during which the remaining 119 nodes will join Phoenix-Slurm:

  • Phase 6: January 31, 2023 (PACE Maintenance Period) – remaining 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled this Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. 

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best,
– The PACE Team

[Updated 2023/01/17, 6:00AM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Today – Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
We will follow up with additional updates and reminders as needed. If you should have any additional questions about the migration, please email us if you have any questions or concerns about the migration.

Best,
– The PACE Team

[Updated 2023/01/13, 1:05PM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. So far, we have successfully migrated about 1100 nodes (out of about 1319 total). For this fifth phase of the migration, 100 additional nodes will join the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHAT DO YOU NEED TO DO? 
As recommended at the beginning of the migration, we strongly encourage all researchers to continue shifting their workflows to the Slurm side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm.

Best,
– The PACE Team

January 12, 2023

Phoenix Project & Scratch Storage Cables Replacement

Filed under: Uncategorized — Marian Zvada @ 6:02 pm

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 1PM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Marian Zvada @ 6:02 pm

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 2 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 10AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

January 11, 2023

PyTorch Security Risk: Please Check & Update

Filed under: Uncategorized — Michael Weiner @ 9:19 am

WHAT’S HAPPENING?

Researchers who install their own copies of PyTorch may have downloaded a compromised package and should uninstall it immediately.

WHEN IS IT HAPPENING?

Pytorch-nightly for December 25-30, 2022, is impacted. Please uninstall it immediately if you have installed this version.

WHY IS IT HAPPENING?

A malicious Triton dependency was added to the Python Package Index. See https://pytorch.org/blog/compromised-nightly-dependency/ for details.

WHO IS AFFECTED?

Researchers who install PyTorch on PACE or other services and updated with nightly packages December 25-30. PACE has scanned all .conda and .local directories on our systems and has not identified any copies of the Triton package.

Affected services: All PACE clusters

WHAT DO YOU NEED TO DO?

Please uninstall the compromised package immediately. Details are available at https://pytorch.org/blog/compromised-nightly-dependency/. In addition, please alert PACE at pace-support@oit.gatech.edu to let us know that you have identified an installation on our systems.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions, or if you are unsure if you have installed the compromised package on PACE.

January 4, 2023

Phoenix Cluster Migration to Slurm Scheduler – Phase 4

Filed under: Uncategorized — Jeff Valdez @ 1:50 pm

[Update 2022/01/04, 2:18PM EST]

Dear Phoenix researchers,

The fourth phase of migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated 1100 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster.

As a reminder, the final phases of the migration are scheduled to continue in January 2023, during which the remaining 219 nodes will join Phoenix-Slurm: 

  • Phase 5: January 17, 2023 – 100 nodes  
  • Phase 6: January 31, 2023 (PACE Maintenance Period) – about 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled for Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users on January 31st, at 6 AM ET.

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best, 

-The PACE Team

[Update 2022/01/04, 6:00AM EST]

Dear Phoenix researchers, 

Just a reminder that the fourth phase of will start today, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline now (6am ET), and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

We will follow up with additional updates and reminders as needed. In the meantime, please email us if you have any questions or concerns about the migration. 

Best, 

– The PACE Team 

[Update 2022/01/03, 5:26PM EST]

Dear Phoenix researchers,

We have successfully migrated about 1000 nodes (out of about 1319 total) from Phoenix to the Phoenix-Slurm cluster. As a reminder, the fourth phase is scheduled starting tomorrow, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline tomorrow morning (January 4th) at 6am ET, and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

As recommended at the beginning of this migration, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm. 

Best, 

– The PACE Team

Powered by WordPress