PACE A Partnership for an Advanced Computing Environment

January 20, 2021

[Completed – PACE Clusters Ready for Research] PACE Maintenance – February 3 – 5, 2021

Filed under: Uncategorized — Semir Sarajlic @ 1:23 pm

[Update — February 5, 2021, 2:14pm]

Dear PACE Users,

Our scheduled maintenance has completed on time. All Coda and Rich datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Here is an update on the tasks, which includes a task that may require user action, please see below:

ITEMS THAT MAY REQUIRE USER ACTION:

  • [COMPLETE] [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)

While updating login-hive[1-2], their SSH server keys changed. As a result, users may get a message that the key is not correct. If this should happen, please clear the entries from your local .ssh/known_hosts that have any reference to login-hive, login-hive1 or login-hive2, then try again.

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Apply updates to all compute nodes
  • [COMPLETE] [Compute] Reboot all compute nodes running Lustre clients
  • [COMPLETE] [Network] Enable subnet managers for Hive
  • [COMPLETE] [Network] Reboot the main Coda InfiniBand HDR switch
  • [COMPLETE] [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [COMPLETE] [Software] Upgrade Intel license server
  • [COMPLETE] [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [COMPLETE] [Storage] Upgrade Lustre clients
  • [COMPLETE] [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [COMPLETE] [Coda Data Center] Georgia Power will install a Power Quality Monitor

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!

The PACE Team

[Update — February 2, 2021, 3:25pm]

This is a friendly reminder that our Maintenance will begin tomorrow at 6:00 AM and conclude on Friday, February 5th, 2021.   As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off.   These jobs will be released as soon as the maintenance activities are complete. Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We have added one additional activity to this maintenance.  Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR  switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [Coda Data Center] Georgia Power will install a Power Quality Monitor

This maintenance is planned to last through Friday that will allow for Georgia Power to install a Power Quality Monitor, which is required to get the microgrid fully operational.  Due to work being performed by Databank on the colling systems, we agreed to do this activity on Friday.  No power outage is expected.   Once Georgia Power is complete with the installation, we will open the clusters to users.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

 

[Original Note – January 20, 2021, 1:23pm]

Dear PACE Users,

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on February 3rd, 2021 and conclude at 11:59 PM on February 5th, 2021. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

January 14, 2021

[Complete] Emergency Firewall Upgrade – starting today (01/14) at 8:00pm

Filed under: Uncategorized — Semir Sarajlic @ 6:43 pm

Dear PACE users,

OIT will conduct an emergency firewall upgrade starting this evening, 01/14/2021, at 8:00pm to 10:00pm.   This upgrade is expected to impact VPN access, as a result, it is possible that connections in and out of PACE (e.g., interactive sessions, file transfers) may be interrupted during that period of time.

Who is impacted: During the emergency firewall upgrade window,  there is a possibility that PACE users may not be able to connect to PACE resources and/or they may lose connection briefly.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources.  Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during this maintenance activity.  This maintenance activity will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during this emergency firewall upgrade to monitor the systems, and report on any interruptions in service.   Up-to-date progress will be provided on Georgia Tech’s Status page, http://status.gatech.edu.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

 

 

January 4, 2021

OIT’s Scheduled Network Maintenance

Filed under: Uncategorized — Semir Sarajlic @ 3:49 pm

[Update – January 5, 2020 11:30am]

Dear PACE users,

The routers that were upgraded late last night had a problem with OSPF, which caused the missing routes, and prevented connection to the system.  Users who may have tried to connect to PACE resources late last night would have received errors such as “no route to host” when attempting to ssh to headnodes.   Network Engineering has downgraded the firmware to the original version, and connectivity has been restored during the scheduled maintenance window.   PACE completed the testing by 2:19am this morning and confirmed that PACE services are operational.

Network Engineering team has engaged the vendor to identify the root cause of the issue given the firmware has been tested on same exact hardware prior to the deployment last night without any issues.   Once the root cause is identified and resolved, another upgrade will be scheduled and communicated accordingly.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

 

[Original Post – January 4, 2020 3:49pm]

Dear PACE users,

OIT’s Network Engineering Team will be conducting maintenance activities starting this evening, 01/04/2021, at 7:00pm through 2:00am (01/05/2021).   Data center routers and firewalls will get firmware upgrades.  All devices have redundancy, and devices will be upgraded one at a time.  No service disruptions are expected.   However, it is possible that connections in and out of PACE (e.g., interactive sessions, file transfers) may be interrupted during that period of time.

Who is impacted: During the maintenance window, we do not expect service disruptions at PACE; however,  there is a possibility that PACE users may not be able to connect to PACE resources and/or they may lose connection briefly.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources. Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during this maintenance activity.  This maintenance activity will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during this maintenance activities to monitor the systems, conduct testing and report on any interruptions in service.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

Powered by WordPress