GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

PACE Maintenance Period (January 31 – February 7, 2023)

Posted by on Friday, 20 January, 2023

[Updated 2023/01/27, 2:06PM EST]

WHEN IS IT HAPPENING?
Reminder that the next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (123 additional nodes for a final total of about 1323). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

[Updated 2023/01/20, 8:45AM EST]

WHEN IS IT HAPPENING?
The next Maintenance Period starts at 6:00AM on Tuesday, 01/31/2023, and is tentatively scheduled to conclude by 11:59PM on Tuesday, 02/07/2023.

The Phoenix project file system changes are estimated to take seven days to complete.  The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier than seven days. PACE will release them as soon as maintenance and migration work is complete.

WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this extended Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. Phoenix is expected to take the full week, while the other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know, and we can collaborate on possible alternatives. Please plan accordingly for the projected downtime.

Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation, PACE Consulting Sessions, and PACE Slurm Orientation Sessions to support the smooth transition of your workflows to Slurm. These can be found at: https://pace.gatech.edu/

Users using Singularity on the command line need to use Apptainer commands moving forward. 

WHAT IS HAPPENING?  
The next PACE Maintenance Period starts 01/31/2023 at 6am and will run until complete. Phoenix downtime could last until Feb 7 or beyond.

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for the sixth and final phase of Phoenix cluster (about 119 additional nodes for a final total of about 1319). Phoenix users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. 
  • [Software] Singularity -> Apptainer Migration for PACE-apps, OOD. Users using Singularity on the command line need to use Apptainer commands moving forward. 

ITEMS NOT REQUIRING USER ACTION: 

  • [ALL CLUSTERS] Update GID on all file systems. Over 1.7 Billion files will be updated 
  • [Phoenix] Re-image last Phoenix login node; re-enable load balancer 
  • [Phoenix] Migrate Remaining Phoenix-Moab Funds to Phoenix-Slurm 
  • [ICE] Update cgroups limits on ICE head nodes 
  • [Network] Code upgrade to PACE departmental Palo Alto 
  • [Network] Upgrade ethernet switch firmware to 9.3.10 (research hall) 
  • [Hive][Storage] Replace 40G cables on storage-hive 
  • [Storage] Reduce the amount of memory available for ZFS caches, to 60% of installed memory 
  • [Storage] Update the number of NFS threads to 4 times the number of cores 
  • [Storage] Update sysctl parameters on ZFS servers 
  • [Datacenter] Georgia Power: Microgrid tests and reconfiguration 
  • [Datacenter] Databank: High Temp Chiller & Tower Maintenance

WHY IS IT HAPPENING? 
The extended maintenance period is required to remove conflicting GID’s with campus allowing the expansion of research storage across campus. It is a required component of a strategic initiative and will provide foundational work to provide additional storage options and capacity to researchers. The additional items are part of our regularly scheduled Maintenance Periods which can be found in advance at https://pace.gatech.edu/Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?
All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank You,
– The PACE Team

Upcoming Maintenance Period Extension Required January 31-February 7 (estimated)

Posted by on Tuesday, 17 January, 2023

WHAT IS HAPPENING? 
PACE is updating the group ID of every group & file in our storage infrastructure to remove the conflicts with those assigned campus wide by OIT. The expected time per cluster varies greatly due to the size of the related storage. During the maintenance period, PACE will release clusters as soon as each is complete.  

WHEN IS IT HAPPENING? 
Maintenance period starts on Tuesday, January 31, 2023. The changes to the Phoenix project file system are estimated to take seven days to complete. Thus, the maintenance period will be extended from the typical three days to seven days for the Phoenix cluster. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) are anticipated to finish earlier. PACE will release them as soon as maintenance and migration work are complete. Researchers can expect those clusters to be ready earlier.   

WHY IS IT HAPPENING? 
This is a critical step for us to be able to make new storage available to campus users, removing group-id conflicts we currently have with the Georgia Tech Enterprise Directory (GTED). This will allow us to provide campus and pace mountable storage to our researchers and provide a foundation for additional self-service capabilities. This change will also allow us to evolve the PACE user management tools and processes. We understand that the short-term impact of this outage is problematic, but as we increase storage utilization, the problem will only get worse if left unaddressed. We expect the long-term impact of this update to be low, since the ownership, group names and permissions will remain unchanged. 

WHO IS AFFECTED? 
All users across all PACE’s clusters.  

WHAT DO YOU NEED TO DO? 
Please plan accordingly for an extended Maintenance Period for the Phoenix cluster starting Tuesday, January 31, 2023. The other PACE clusters (Hive, Firebird, CoC-ICE, PACE-ICE, and Buzzard) will be released as storage updates are completed during the maintenance window. If you have a critical deadline that this will impact, please let us know and we can collaborate on possible alternatives.  

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. If escalation is required, please email me, Pam Buffington pam@gatech.edu directly as well.

Best, 
– The PACE Team 
– Pam Buffington – PACE Director 

Phoenix Cluster Migration to Slurm Scheduler – Phase 5

Posted by on Tuesday, 17 January, 2023

[Updated 2023/01/17, 4:02PM EST]

Dear Phoenix researchers,  

The fifth phase of Phoenix Slurm Migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated about 1200 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster. 

As a reminder, the final phase of the migration is scheduled to complete later this month, during which the remaining 119 nodes will join Phoenix-Slurm:

  • Phase 6: January 31, 2023 (PACE Maintenance Period) – remaining 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled this Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users starting January 31st, at 6 AM ET. 

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best,
– The PACE Team

[Updated 2023/01/17, 6:00AM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Today – Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
We will follow up with additional updates and reminders as needed. If you should have any additional questions about the migration, please email us if you have any questions or concerns about the migration.

Best,
– The PACE Team

[Updated 2023/01/13, 1:05PM EST]

WHAT IS HAPPENING? 
For Phase 5 of Phoenix Cluster Slurm migration, 100 nodes will be taken offline on the Phoenix cluster and migrated to the Phoenix-Slurm cluster. 

WHEN IS IT HAPPENING? 
Tuesday, January 17th at 6am ET 

WHY IS IT HAPPENING? 
This is part of the ongoing migration from Phoenix to the Phoenix-Slurm cluster. So far, we have successfully migrated about 1100 nodes (out of about 1319 total). For this fifth phase of the migration, 100 additional nodes will join the Phoenix-Slurm cluster. We do not expect there to be any impact to other nodes or jobs on the Phoenix or Phoenix-Slurm clusters. 

WHO IS AFFECTED? 
All Phoenix cluster users. 

WHAT DO YOU NEED TO DO? 
As recommended at the beginning of the migration, we strongly encourage all researchers to continue shifting their workflows to the Slurm side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm.

Best,
– The PACE Team

Phoenix Project & Scratch Storage Cables Replacement

Posted by on Thursday, 12 January, 2023

WHAT’S HAPPENING?
One cable connecting Phoenix Lustre device to controller 0 and second cable to controller 1, hosting project and scratch storage, both needs to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 1PM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

Posted by on Thursday, 12 January, 2023

WHAT’S HAPPENING?
Two cables connecting one of the two controllers of the Hive Lustre device need to be replaced. Individual cables will be replaced one by one and expected time to finish the work will take about 2 hours.

WHEN IS IT HAPPENING?
Wednesday, January 18th, 2023 starting 10AM ET.

WHY IS IT HAPPENING?
Required maintenance.

WHO IS AFFECTED?
Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?
Since there is a redundant controller when doing work on one cable at the time, there should not be an outage during the cable replacement. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?
For questions, please contact PACE at pace-support@oit.gatech.edu.

PyTorch Security Risk: Please Check & Update

Posted by on Wednesday, 11 January, 2023

WHAT’S HAPPENING?

Researchers who install their own copies of PyTorch may have downloaded a compromised package and should uninstall it immediately.

WHEN IS IT HAPPENING?

Pytorch-nightly for December 25-30, 2022, is impacted. Please uninstall it immediately if you have installed this version.

WHY IS IT HAPPENING?

A malicious Triton dependency was added to the Python Package Index. See https://pytorch.org/blog/compromised-nightly-dependency/ for details.

WHO IS AFFECTED?

Researchers who install PyTorch on PACE or other services and updated with nightly packages December 25-30. PACE has scanned all .conda and .local directories on our systems and has not identified any copies of the Triton package.

Affected services: All PACE clusters

WHAT DO YOU NEED TO DO?

Please uninstall the compromised package immediately. Details are available at https://pytorch.org/blog/compromised-nightly-dependency/. In addition, please alert PACE at pace-support@oit.gatech.edu to let us know that you have identified an installation on our systems.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions, or if you are unsure if you have installed the compromised package on PACE.

Phoenix Cluster Migration to Slurm Scheduler – Phase 4

Posted by on Wednesday, 4 January, 2023

[Update 2022/01/04, 2:18PM EST]

Dear Phoenix researchers,

The fourth phase of migration has been successfully completed – all nodes are back online, with 100 more compute nodes joining the Phoenix-Slurm cluster. We have successfully migrated 1100 nodes (out of about 1319) from Phoenix to the Phoenix-Slurm cluster.

As a reminder, the final phases of the migration are scheduled to continue in January 2023, during which the remaining 219 nodes will join Phoenix-Slurm: 

  • Phase 5: January 17, 2023 – 100 nodes  
  • Phase 6: January 31, 2023 (PACE Maintenance Period) – about 119 nodes 

We strongly encourage all researchers to shift their workflows to the Slurm-based cluster. PACE provides documentation and consulting sessions to support the smooth transition of your workflows to Slurm. Our next PACE Slurm orientation is scheduled for Friday, January 20th @ 11am-12pm via Zoom. Torque/Moab will no longer be available to Phoenix users on January 31st, at 6 AM ET.

PACE will be following up with additional updates and reminders in the upcoming weeks. In the meantime, please contact us with any questions or concerns about this transition. 

Best, 

-The PACE Team

[Update 2022/01/04, 6:00AM EST]

Dear Phoenix researchers, 

Just a reminder that the fourth phase of will start today, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline now (6am ET), and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

We will follow up with additional updates and reminders as needed. In the meantime, please email us if you have any questions or concerns about the migration. 

Best, 

– The PACE Team 

[Update 2022/01/03, 5:26PM EST]

Dear Phoenix researchers,

We have successfully migrated about 1000 nodes (out of about 1319 total) from Phoenix to the Phoenix-Slurm cluster. As a reminder, the fourth phase is scheduled starting tomorrow, January 4th, during which 100 additional nodes will join the Phoenix-Slurm cluster. 

The 100 nodes will be taken offline tomorrow morning (January 4th) at 6am ET, and we do not expect there to be any impact to other nodes or jobs on Phoenix-Slurm. 

As recommended at the beginning of this migration, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. We provide helpful information for the migration in our documentation and will provide a PACE Slurm Orientation on Friday, January 20th @ 11am-12pm via Zoom. 

Please email us or join a PACE Consulting Session if you have any questions or need assistance migrating your workflows to Slurm. 

Best, 

– The PACE Team

Storage-eas read-only during configuration change

Posted by on Wednesday, 21 December, 2022

[Update 1/9/23 10:58 AM]

The migration of storage-eas data to a new location is complete, and full read/write capability is available for all research groups on the device. Researchers may resume regular use of storage-eas, including writing new data to it.

Thank you for your patience as we completed these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

 

[Original Post 12/21/22 11:08 AM]

Summary: Researchers have reported multiple outages of the storage-eas server recently. To stabilize the storage, PACE will make configuration changes. The storage-eas server will become read-only at 3 PM today and will remain read-only until after the Winter Break, while the changes are being implemented. We will provide an update when write access is restored.

Details: PACE will remove the deduplication setting on storage-eas, which is causing performance and stability issues. Beginning this afternoon, the system will become read-only while all data is copied to a new location. After the copy is complete, we will enable access to the storage in the new location, with full read/write capabilities.

Impact: Researchers will not be able to write to storage-eas for up to two weeks. You may continue reading files from it on both PACE and external systems where it is mounted. While this move is in progress, PACE recommends that researchers copy any files that need to be used in Phoenix jobs into their scratch directories, then work from there to write during a job. Scratch provides each researcher with 15 TB of temporary storage on the Lustre parallel filesystem. Files in scratch can be copied to non-PACE storage via Globus.

Thank you for your patience as we complete these configuration changes to improve stability of storage-eas. Please email us at pace-support@oit.gatech.edu with any questions.

Phoenix Project & Scratch Storage Cables Replacement for Redundant Controller

Posted by on Thursday, 8 December, 2022
[Update 2022/12/08, 5:52PM EST]
Work was been completed on the cable replacement on the redundant storage controller and associated systems connecting to the storage were restored back to normal. We were able to replace 2 cables on the controller without interruption to service.

[Update 2022/12/05, 9:00AM EST]
Summary: Phoenix project & scratch storage cable replacement for redundant controller and potential outage and subsequent temporary decreased performance

Details: A cable connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, to the redundant controller needs to be replaced, beginning around 10AM Wednesday, December 8th, 2022. The expected time to finish the work for cable replacement will take about 3-4 hours. After the replacement, pools will need to be rebuilt over the course of about a day.

Impact: Because we are replacing a cable on the redundant controller while maintaining the main controller, there should not be an outage during the cable replacement. However, a similar replacement has previously caused storage to become unavailable, so an outage is possible. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance may be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. If a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Slow Storage on Phoenix

Posted by on Friday, 2 December, 2022

[Update 12/5/22 10:45 AM]

Performance on Phoenix project & scratch storage has returned to normal. PACE continues to investigate the root cause of last week’s slowness, and we would like to thank those researchers we have contacted with questions about your workflows. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 12/2/22 1:11 PM]

Summary: Researchers may experience slow performance on Phoenix project & scratch storage.

Details: Over the past three days, Phoenix has experienced intermittent slowness on the Lustre filesystem hosting project & scratch storage due to heavy utilization. PACE is investigating the source of the heavy load on the storage system.

Impact: Any jobs or commands that read or write on project or scratch storage may run more slowly than normal.

Thank you for your patience as we continue to investigate. Please contact us at pace-support@oit.gatech.edu with any questions.