GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

PACE Maintenance Period (November 3 – 5, 2021)

Posted by on Monday, 13 September, 2021

Dear PACE Users,

This is a friendly reminder that our next Maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to the Maintenance Period, we will communicate the list of activities to be completed and update this blog post.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

Globus maintenance downtime on September 18

Posted by on Friday, 10 September, 2021
Summary: Globus maintenance downtime on September 18
What’s happening and what are we doing: Globus will be undergoing maintenance worldwide on September 18, beginning at 11:00 AM and expected to last for up to 30 minutes, to complete database upgrades. Details are available on the Globus website.
How does this impact me: You will not be able to access Globus during this time nor start a transfer. Any transfers in progress will be paused and will automatically resume upon completion of maintenance. This affects all Globus services, including endpoints at PACE on our Phoenix and Hive clusters, plus others you may use at other computing sites.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Complete] PACE is transitioning from current ticketing system FootPrints to ServiceNow

Posted by on Wednesday, 1 September, 2021

[Update – September 3]

Dear PACE Users,

PACE has successfully transitioned to ServiceNow, and we have begun receiving user tickets as expected in ServiceNow.

As previously mentioned, you may continue to use the pace-support@oit.gatech.edu email to reach out to PACE support, and for your reference, the following three links listed below are direct links to the ServiceNow forms that you may use going forward to request for help, request new software for the PACE Apps software repository, and request access to ICE cluster.

PACE team will continue to work on the remaining support requests that are in FootPrints system.  Thank you all for your attention and patience through this transition.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

 

[Original Message – September 1]

Dear PACE Users,  

We are reaching out to inform you that PACE is transitioning from our current ticketing system FootPrints to ServiceNow. 

What’s happening and what we are doing:   PACE team is transitioning from current ticketing system, FootPrints, to ServiceNow. From September 3, all new PACE support requests will be processed in ServiceNow.  PACE will continue to work on any existing support requests that are in FootPrints.  As part of this transition, we have created two new request forms that replace our existing Software Request Form and PACE ICE Instructional Cluster Request Form.  

How does this impact me: Overall, the transition is seamless to the users for most cases with the exception of the links to our software and ICE request forms that are changing. On Friday, September 3rd, PACE support email address, pace-support@oit.gatech.edu, will redirect users’ emails/requests to ServiceNow, and the new software and ICE request form links will be available on our website. Please use those new forms if you would like to request new software for the PACE Apps software repository or if you are a course instructor interested in using PACE-ICE for your students.  Users who submitted ticket requests via FootPrints directly may use ServiceNow at https://services.gatech.edu (navigate to “Technology” & then “PACE” tile) and submit their request from the available forms.   

The following direct links to ServiceNow forms will be live and available to users on September 3: 

What we will continue to do:   We will continue to work on the existing tickets that are in FootPrints, and you may check the status of this transition on this blog post.   

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

Email Relay Reconfiguration that’s Impacting PACE Utilities

Posted by on Friday, 27 August, 2021

Dear PACE Users,

We are reaching out to inform you that on Monday, August 30, PACE will begin reconfiguring it’s utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address to the following one, no-reply@pace.gatech.edu. These changes are required in order for us to be compliant with email notification requirements by the Institute. We want to bring this to your attention so that you are aware of the new email address that you will be receiving messages from PACE.

What’s happening and what we are doing: PACE will be making changes to utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address. PACE will begin updating it’s utilities on Monday, August 30, that will continue through the coming weeks. More specifically, the following utilities will be reconfigured:

  • [Complete] Scheduler (all clusters): Emails from the scheduler with job status information will change from moabadmin@<scheduler>.pace.gatech.edu to being from no-reply@pace.gatech.edu.
  • PACE Support script (all clusters): Currently the pace-support script is disabled. The script will change how it sends information to the ticketing system to send it from no-reply@pace.gatech.edu and embed your email address to change the source of the ticket rather than sending as from you. This should be transparent to you the user. Previously it was sending the message to the ticket system as though it was sent from you to accomplish getting the source of the ticket identified properly.
  • [Complete] PI and Department CSR Monthly statements for Phoenix and Firebird clusters: These will change from having a pace-support@oit.gatech.edu from address to being from no-reply@pace.gatech.edu, with a reply-to of pace-support@oit.gatech.edu.
  • Security/system information (all clusters): Security violations and general system mail will be redirected to be from no-reply@pace.gatech.edu. This will include mail sent using the mail commands. System mail will be redirected to your email account as identified in GT systems. This may result in you getting mail messages that were previously left on system in an undeliverable state.
  • Head node violation messages (all clusters): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Scratch storage deleter messages (Phoenix & Hive): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Reconfigure PACE servers to send via GT outgoing mail servers (all clusters): This will increase the likelihood of email messages being delivered and also not being identified as spam. This should be transparent to you, but adds email headers for signatures and changes the server that will deliver the email.

How does this impact me: All messages that you receive from PACE utilities will be addressed from no-reply@pace.gatech.edu. If you have created email rules for your inbox for prior messages coming from PACE, please do update them accordingly with this new address, no-reply@pace.gatech.edu

What we will continue to do: In the coming weeks, PACE will work in implementing the changes listed above. You may check the status of each of the changes on this blog post.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Posted by on Friday, 13 August, 2021

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Resolved] Hive scheduler outage

Posted by on Friday, 23 July, 2021

[Update 4:40 PM 7/23/21]

After continued investigation, cleaning up the scheduler logs, and rebooting the scheduler node, we have restored the Hive scheduler to full functionality. Jobs that have been submitted and queued are now running, and there was no interruption to running jobs. New jobs submitted at this time should start as space becomes available, as usual. Thank you for your patience as we investigated this situation.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 1:35 PM 7/23/21]

The Hive scheduler has been experiencing intermittent outages over the last few days while under heavy load, and jobs have been unable to start for nearly all of today (Friday). You may find that jobs you have submitted to Hive remain queued and do not start. We are actively working to restore functionality and will update you as more information becomes available. Thank you for your patience as we investigate this situation.
Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix storage issue

Posted by on Thursday, 22 July, 2021

[Update 2:05 PM 7/22/21]

The controller reboot is complete, and we believe no disruption occurred in access to Phoenix storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 12:55 PM 7/22/21]

In coordination with our support vendor, we are working to resolve an issue with a Phoenix Lustre metadata controller, which supports both project and scratch storage.
At 1:30 PM today, we will reboot one of the controllers. We do not expect any impact to users, as the other controller is running without error at this time. Should there be any unexpected impact, we will work to restore full functionality as quickly as possible. We will provide an update when this work is complete.
Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Project Storage Quota Update

Posted by on Thursday, 22 July, 2021

In coordination with the Hive PIs, PACE has updated our quota policies for project storage on Hive, in order to facilitate easier access to available storage capacity for our users. For project storage, accessed via the “data” symbolic link in your home directory, block quotas are now shared by an entire research group, rather than being set at the user level. All users in a single PI’s storage allocation have access to the entire quota, which brings Hive in line with Phoenix’s quota arrangement. Most research groups have 50 TB of project storage on Hive, with the exception of those specifically provided with a higher allocation in the NSF grant funding the cluster. Each user maintains a limit of 2 million files within their research group’s project storage.

You can review your storage usage on Hive by running the updated “pace-quota” command on any Hive node. Quotas for home (5 GB per user) and scratch (7 TB per user) directories are unchanged. Please visit our documentation for more details about Hive storage.

Please contact us at pace-support@oit.gatech.edu with any questions about using Hive.

[RESOLVED] Phoenix Scratch Outage

Posted by on Monday, 19 July, 2021

Starting around 4 PM Sunday, the Phoenix scratch filesystem became non-responsive, causing issues with access to files and directories stored in ~/scratch. Functionality was restored promptly Monday morning, and at this time, all systems are performing as expected. If you were running jobs that utilized scratch storage during this outage, they may have been negatively impacted; please reach out to pace-support@oit.gatech.edu with related IDs for any such jobs.

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Posted by on Tuesday, 13 July, 2021

[Update – 08/13/2021 – 10:00AM]

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Original Message – 07/13/2021 – 4:15PM that was updated on August 4, 2021 with list of tasks] 

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

Please see the list of activities to be completed:

ITEMS REQUIRING USER ACTION:

  • Currently, none.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [System/Security] Operating system patch installs
  • [System/Security] Endpoint Protection Updates
  • [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [System] Update Nvidia drivers and add Nvidia specific libraries
  • [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team