PACE A Partnership for an Advanced Computing Environment

September 23, 2021

Hive Project & Scratch Storage Battery Replacement

Filed under: Uncategorized — Michael Weiner @ 12:33 pm

[Update 9/23/21 3:15 PM]

The replacement batteries have reached a sufficient charge, and Hive GPFS performance has been restored. Thank you for your patience during this maintenance.

[Original Post 9/23/21 12:30 PM]

Summary: Battery replacement on Hive project & scratch storage will impact performance today.
What’s happening and what are we doing: UPS batteries on the Hive GPFS storage device, holding project (data) and scratch storage, need to be replaced. During the replacement, which will begin shortly this afternoon, storage will shift to write-through mode, and performance will be impacted. Once the new batteries are sufficiently charged, performance will return to normal.
How does this impact me: Hive project and scratch performance will be impacted until the fresh batteries have sufficiently charged, which should take approximately 3 hours. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.
What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

September 22, 2021

Hive and Phoenix Scheduler Configuration Change

Filed under: Uncategorized — Semir Sarajlic @ 5:04 pm

Dear PACE Researchers, 

We would like to announce an upcoming change to the scheduler configuration on the Phoenix and Hive clusters at 9:00 AM on Thursday, September 23rd. This change should improve the scheduler performance given the large number of jobs executed by our users. 

What will PACE be doing: PACE will reduce the retention time for job-specific logs from 24 hours to 6 hours after job completion.  Reducing the amount of job information the scheduler needs to process regularly should provide a more stable and faster job submission environment. Additionally, the downtime associated with scheduler restarts should improve, as job ingestion time will be reduced accordingly.  

Who does this message impact: Any user who attempts to use qstat for a job more than 6 hours after completion will be unable to do so moving forward. In addition to the scheduler job STDOUT/STDERR files, job statistics for completed jobs on Phoenix and Hive can be queried at https://pbstools-coda.pace.gatech.edu. 

What PACE will continue to do: We will monitor the clusters for issues during and after the configuration change to assess any immediate impacts from the update. We will continue to assess the scheduler health to ensure a stable job submission environment. 

As always, please contact us at pace-support@oit.gatech.edu with any questions or concerns regarding this change. 

Best Regards, 
The PACE Team

September 13, 2021

[Complete] PACE Maintenance Period (November 3 – 5, 2021)

Filed under: Uncategorized — Semir Sarajlic @ 10:46 am

[Complete 11/5/21 3:15 PM]

Our scheduled maintenance has completed ahead of schedule! All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, February 9, 2022, and conclude by 11:59PM on Friday, February 11, 2022. We have also tentatively scheduled the remaining maintenance periods for 2022 for May 11-13, August 10-12, and November 2-4.

The following tasks were part of this maintenance period:
ITEMS REQUIRING USER ACTION:
• [Complete] TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details are available on our blog.

ITEMS NOT REQUIRING USER ACTION:
• [Complete][Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
• [Complete][System] Operating system patch installs
• [Complete][Storage/Phoenix] Lustre controller firmware and other upgrades
• [Complete][Storage/Phoenix] Lustre scratch upgrade and expansion
• [Postponed][Storage] Hive GPFS storage upgrade
• [Complete][System] System configuration management updates
• [Complete][System] Updates to NVIDIA drivers and libraries
• [Complete][System] Upgrade some PACE infrastructure nodes to RHEL 7.9
• [Complete][System] Reorder group file
• [Complete][Headnode/ICE] Configure c-group controls on COC-ICE and PACE-ICE headnodes
• [Complete][Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
• [Complete][Network] update ethernet switch firmware
• [Complete][Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 11/1/21 2:00 PM]

C-group controls will be configured on the login nodes for both COC-ICE and PACE-ICE during the maintenance period this week. This should help mitigate overuse of the login node by students running heavy computations, which has slowed the node for others.

Please use compute nodes for all computational work and avoid resource-intensive processes on the login nodes. Students who need an interactive environment are requested to submit an interactive job. Students who are uncertain about how to use ICE schedulers to work on compute nodes should contact their course’s instructor or TA for assistance. They can help you with workflows on the cluster. PACE will stop processes that overuse the login nodes, in order to restore functionality for all students.

Thank you for your efforts to ensure ICE clusters are an available resource for all students in participating courses.

[Reminder 10/26/21 4:30 PM]

Additional details and instructions for the TensorFlow upgrade are available in another blog post.

[Full announcement 10/20/21 10:30 AM]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, November 3, and end at 11:59 PM on Friday, November 5. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Please see below for a tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details and instructions will follow in a separate message.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
  • [System] Operating system patch installs
  • [Storage/Phoenix] Lustre controller firmware and other upgrades
  • [Storage/Phoenix] Lustre scratch upgrade and expansion
  • [System] System configuration management updates
  • [System] Updates to NVIDIA drivers and libraries
  • [System] Upgrade some PACE infrastructure nodes to RHEL 7.9
  • [System] Reorder group file
  • [Headnode/COC-ICE] Configure c-group controls on COC-ICE headnode
  • [Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
  • [Network] update ethernet switch firmware
  • [Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Early announcement]

Dear PACE Users,

This is a friendly reminder that our next Maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to the Maintenance Period, we will communicate the list of activities to be completed and update this blog post.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

September 10, 2021

Globus maintenance downtime on September 18

Filed under: Uncategorized — Michael Weiner @ 3:17 pm
Summary: Globus maintenance downtime on September 18
What’s happening and what are we doing: Globus will be undergoing maintenance worldwide on September 18, beginning at 11:00 AM and expected to last for up to 30 minutes, to complete database upgrades. Details are available on the Globus website.
How does this impact me: You will not be able to access Globus during this time nor start a transfer. Any transfers in progress will be paused and will automatically resume upon completion of maintenance. This affects all Globus services, including endpoints at PACE on our Phoenix and Hive clusters, plus others you may use at other computing sites.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

September 1, 2021

[Complete] PACE is transitioning from current ticketing system FootPrints to ServiceNow

Filed under: Uncategorized — Semir Sarajlic @ 5:17 pm

[Update – September 3]

Dear PACE Users,

PACE has successfully transitioned to ServiceNow, and we have begun receiving user tickets as expected in ServiceNow.

As previously mentioned, you may continue to use the pace-support@oit.gatech.edu email to reach out to PACE support, and for your reference, the following three links listed below are direct links to the ServiceNow forms that you may use going forward to request for help, request new software for the PACE Apps software repository, and request access to ICE cluster.

PACE team will continue to work on the remaining support requests that are in FootPrints system.  Thank you all for your attention and patience through this transition.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

 

[Original Message – September 1]

Dear PACE Users,  

We are reaching out to inform you that PACE is transitioning from our current ticketing system FootPrints to ServiceNow. 

What’s happening and what we are doing:   PACE team is transitioning from current ticketing system, FootPrints, to ServiceNow. From September 3, all new PACE support requests will be processed in ServiceNow.  PACE will continue to work on any existing support requests that are in FootPrints.  As part of this transition, we have created two new request forms that replace our existing Software Request Form and PACE ICE Instructional Cluster Request Form.  

How does this impact me: Overall, the transition is seamless to the users for most cases with the exception of the links to our software and ICE request forms that are changing. On Friday, September 3rd, PACE support email address, pace-support@oit.gatech.edu, will redirect users’ emails/requests to ServiceNow, and the new software and ICE request form links will be available on our website. Please use those new forms if you would like to request new software for the PACE Apps software repository or if you are a course instructor interested in using PACE-ICE for your students.  Users who submitted ticket requests via FootPrints directly may use ServiceNow at https://services.gatech.edu (navigate to “Technology” & then “PACE” tile) and submit their request from the available forms.   

The following direct links to ServiceNow forms will be live and available to users on September 3: 

What we will continue to do:   We will continue to work on the existing tickets that are in FootPrints, and you may check the status of this transition on this blog post.   

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

Powered by WordPress