GT Home : : Campus Maps : : GT Directory

Author Archive

Announcing the PACE OSG Orientation Class

Posted by on Thursday, 7 October, 2021

Dear PACE Researchers, 

PACE is pleased to announce the launch of the PACE Open Science Grid (OSG) Orientation class that introduces Georgia Tech’s research community to OSG and the distributed high throughput computing resources that are available via OSG Connect.   Join us for this virtual orientation to learn about OSG and how it may benefit your research needs. 

Please see below the dates for the sessions and the registration form: 

Dates and times:  October 15, 10:30am – 12:15pm 

                               November 11, 1:30pm – 3:15pm 

Registration:         https://b.gatech.edu/3Bi4Yie 

This class is based in part on the work supported by the NSF CC* award 1925541: “Integrating Georgia Tech into the Open Science Grid for Multi-Messenger Astrophysics”. With this award, PACE, in collaboration with Center for Relativistic Astrophysics, added CPU/GPU/Storage to the existing OSG capacity, as well as the first regional StashCache service that benefits all OSG institutions in the Southeast region, not just Georgia Tech.  

This orientation is the first step into PACE’s longer-term plans to support OSG initiatives on campus. Please be on the lookout for more exciting announcements from our team in the very near future. 

We look forward to you joining us for the OSG orientation. 

Best,

The PACE Team

Hive and Phoenix Scheduler Configuration Change

Posted by on Wednesday, 22 September, 2021

Dear PACE Researchers, 

We would like to announce an upcoming change to the scheduler configuration on the Phoenix and Hive clusters at 9:00 AM on Thursday, September 23rd. This change should improve the scheduler performance given the large number of jobs executed by our users. 

What will PACE be doing: PACE will reduce the retention time for job-specific logs from 24 hours to 6 hours after job completion.  Reducing the amount of job information the scheduler needs to process regularly should provide a more stable and faster job submission environment. Additionally, the downtime associated with scheduler restarts should improve, as job ingestion time will be reduced accordingly.  

Who does this message impact: Any user who attempts to use qstat for a job more than 6 hours after completion will be unable to do so moving forward. In addition to the scheduler job STDOUT/STDERR files, job statistics for completed jobs on Phoenix and Hive can be queried at https://pbstools-coda.pace.gatech.edu. 

What PACE will continue to do: We will monitor the clusters for issues during and after the configuration change to assess any immediate impacts from the update. We will continue to assess the scheduler health to ensure a stable job submission environment. 

As always, please contact us at pace-support@oit.gatech.edu with any questions or concerns regarding this change. 

Best Regards, 
The PACE Team

PACE Maintenance Period (November 3 – 5, 2021)

Posted by on Monday, 13 September, 2021

[Full announcement 10/20/21 10:30 AM]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, November 3, and end at 11:59 PM on Friday, November 5. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Please see below for a tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details and instructions will follow in a separate message.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
  • [System] Operating system patch installs
  • [Storage/Phoenix] Lustre controller firmware and other upgrades
  • [Storage/Phoenix] Lustre scratch upgrade and expansion
  • [System] System configuration management updates
  • [System] Updates to NVIDIA drivers and libraries
  • [System] Upgrade some PACE infrastructure nodes to RHEL 7.9
  • [System] Reorder group file
  • [Headnode/COC-ICE] Configure c-group controls on COC-ICE headnode
  • [Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
  • [Network] update ethernet switch firmware
  • [Network] update IP addresses of switches in BCDC

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Early announcement]

Dear PACE Users,

This is a friendly reminder that our next Maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to the Maintenance Period, we will communicate the list of activities to be completed and update this blog post.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Complete] PACE is transitioning from current ticketing system FootPrints to ServiceNow

Posted by on Wednesday, 1 September, 2021

[Update – September 3]

Dear PACE Users,

PACE has successfully transitioned to ServiceNow, and we have begun receiving user tickets as expected in ServiceNow.

As previously mentioned, you may continue to use the pace-support@oit.gatech.edu email to reach out to PACE support, and for your reference, the following three links listed below are direct links to the ServiceNow forms that you may use going forward to request for help, request new software for the PACE Apps software repository, and request access to ICE cluster.

PACE team will continue to work on the remaining support requests that are in FootPrints system.  Thank you all for your attention and patience through this transition.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

 

[Original Message – September 1]

Dear PACE Users,  

We are reaching out to inform you that PACE is transitioning from our current ticketing system FootPrints to ServiceNow. 

What’s happening and what we are doing:   PACE team is transitioning from current ticketing system, FootPrints, to ServiceNow. From September 3, all new PACE support requests will be processed in ServiceNow.  PACE will continue to work on any existing support requests that are in FootPrints.  As part of this transition, we have created two new request forms that replace our existing Software Request Form and PACE ICE Instructional Cluster Request Form.  

How does this impact me: Overall, the transition is seamless to the users for most cases with the exception of the links to our software and ICE request forms that are changing. On Friday, September 3rd, PACE support email address, pace-support@oit.gatech.edu, will redirect users’ emails/requests to ServiceNow, and the new software and ICE request form links will be available on our website. Please use those new forms if you would like to request new software for the PACE Apps software repository or if you are a course instructor interested in using PACE-ICE for your students.  Users who submitted ticket requests via FootPrints directly may use ServiceNow at https://services.gatech.edu (navigate to “Technology” & then “PACE” tile) and submit their request from the available forms.   

The following direct links to ServiceNow forms will be live and available to users on September 3: 

What we will continue to do:   We will continue to work on the existing tickets that are in FootPrints, and you may check the status of this transition on this blog post.   

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu 

Best, 

The PACE Team 

Email Relay Reconfiguration that’s Impacting PACE Utilities

Posted by on Friday, 27 August, 2021

Dear PACE Users,

We are reaching out to inform you that on Monday, August 30, PACE will begin reconfiguring it’s utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address to the following one, no-reply@pace.gatech.edu. These changes are required in order for us to be compliant with email notification requirements by the Institute. We want to bring this to your attention so that you are aware of the new email address that you will be receiving messages from PACE.

What’s happening and what we are doing: PACE will be making changes to utilities that send out messages to users, which will result in a change in an email address that’s listed in the “from” address. PACE will begin updating it’s utilities on Monday, August 30, that will continue through the coming weeks. More specifically, the following utilities will be reconfigured:

  • [Complete] Scheduler (all clusters): Emails from the scheduler with job status information will change from moabadmin@<scheduler>.pace.gatech.edu to being from no-reply@pace.gatech.edu.
  • PACE Support script (all clusters): Currently the pace-support script is disabled. The script will change how it sends information to the ticketing system to send it from no-reply@pace.gatech.edu and embed your email address to change the source of the ticket rather than sending as from you. This should be transparent to you the user. Previously it was sending the message to the ticket system as though it was sent from you to accomplish getting the source of the ticket identified properly.
  • [Complete] PI and Department CSR Monthly statements for Phoenix and Firebird clusters: These will change from having a pace-support@oit.gatech.edu from address to being from no-reply@pace.gatech.edu, with a reply-to of pace-support@oit.gatech.edu.
  • Security/system information (all clusters): Security violations and general system mail will be redirected to be from no-reply@pace.gatech.edu. This will include mail sent using the mail commands. System mail will be redirected to your email account as identified in GT systems. This may result in you getting mail messages that were previously left on system in an undeliverable state.
  • Head node violation messages (all clusters): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Scratch storage deleter messages (Phoenix & Hive): The from for these messages will change from pace-support@oit.gatech.edu to being from no-reply@pace.gatech.edu and the reply-to being set to pace-support@oit.gatech.edu.
  • Reconfigure PACE servers to send via GT outgoing mail servers (all clusters): This will increase the likelihood of email messages being delivered and also not being identified as spam. This should be transparent to you, but adds email headers for signatures and changes the server that will deliver the email.

How does this impact me: All messages that you receive from PACE utilities will be addressed from no-reply@pace.gatech.edu. If you have created email rules for your inbox for prior messages coming from PACE, please do update them accordingly with this new address, no-reply@pace.gatech.edu

What we will continue to do: In the coming weeks, PACE will work in implementing the changes listed above. You may check the status of each of the changes on this blog post.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Posted by on Friday, 13 August, 2021

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Posted by on Tuesday, 13 July, 2021

[Update – 08/13/2021 – 10:00AM]

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Original Message – 07/13/2021 – 4:15PM that was updated on August 4, 2021 with list of tasks] 

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

Please see the list of activities to be completed:

ITEMS REQUIRING USER ACTION:

  • Currently, none.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [System/Security] Operating system patch installs
  • [System/Security] Endpoint Protection Updates
  • [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [System] Update Nvidia drivers and add Nvidia specific libraries
  • [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Resolved] OIT’s Data Warehouse Service Outage

Posted by on Monday, 12 July, 2021

[Update – July 13, 2021] 

OIT has restored operation to Data Warehouse service on July 12, 11:22AM.  Shortly after, PACE has restored functionality to our database and our administrative services.   OIT has continued to monitor the Data Warehouse service.  At this time, all PACE user facing utilities such as pace-check-queue, pace-quota, and pace-whoami are operational.

Please accept our sincere apology for any inconvenience that this temporary limitation may have caused you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

[Original Message – July 12, 2021]

Dear PACE Users,

We are reaching out to inform you that on Saturday at about 10:00am, there was an outage to OIT’s Enterprise Data Warehouse service, which PACE relies on for hosting our database instance that subsequently went down at 11:07am.  The impact to PACE from this service outage is mainly limited to administrative side, and there is some impact to user facing utilities such as pace-check-queue; however, there is no impact to users’ jobs or ability to submit jobs.

What’s happening and what we are doing:  Currently, OIT is investigating the outage impacting the Data Warehouse service that occurred on Saturday, and this outage is tracked at OIT’s status page.   PACE is monitoring this development closely.

How does this impact me:  This data warehouse service outage impacts user facing utilities such as pace-check-queue, pace-quota, pace-whoami that are partially or fully nonfunctional.   In addition, until the Data Warehouse service is restored, PACE will be unable to create new user and PI account requests.  

What we will continue to do:  PACE team will continue to monitor this development, and we will report as needed.   

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

pace-support.sh is disabled on PACE Clusters — please email pace-support directly for inquiries

Posted by on Tuesday, 29 June, 2021

Dear PACE Users,

It has come to our attention that we are not receiving support requests generated by the pace-support.sh script, which allows submission of support tickets directly from PACE clusters. Our investigation is ongoing.

At this time, please email us at pace-support@oit.gatech.edu from a non-PACE system for all support requests, to ensure that we receive your message.

From our initial investigation, it appears that this outage began at some point in May. We apologize for any lost messages since then. If you have been trying to reach us via the pace-support script, please email us instead. You should receive an automated acknowledgement email from Service Desk when your request is successfully processed.

Please contact us at pace-support@oit.gatech.edu with questions.

The PACE Team

[Urgent] Hive Cluster Storage Controller Cable Replacement – Performance Impact

Posted by on Friday, 25 June, 2021

[Update – 06/25 11:40PM]

The storage controller cable on Hive cluster was replaced this evening and brought back online.  Unfortunately, after the repairs, GPFS storage mounts became unavailable, which had interrupted users’ running jobs this evening.   We’ve paused the scheduler briefly while we restarted the GPFS services across the cluster.  The storage mounts were restored, and scheduler has been resumed.

User’s jobs that have been running/queued between about 7:00pm and 10:30pm today (6/25/2021)  may have been interrupted, and we recommend the users to check on their jobs and resubmit your jobs as needed.  Please accept our sincerest apology for this inconvenience.

We will continue to monitor the services and update as needed.  If you have any questions, please contact us at pace-support@oit.gatech.edu.

[Original Message – 06/25 5:12PM]

Dear Hive Users,

We are reaching out to inform you that one of our storage controllers for Hive cluster has a bad cable that needs to be replaced to ensure optimal performance and data integrity.   We have the cable at hand, and are in a process of replacing this cable this evening, Friday 06/25/2021.  This work will impact storage performance briefly, which users may experience as storage slowness as we are routing all our traffic to a secondary controller during this operation. 

What’s happening and what we are doing:  More specifically, PACE has assessed a high failure rate of the disks in one of the enclosures for the storage controller with a bad cable.  As a precaution, we will be shutting down the controller with the bad cable to unfail the disks and to ensure data integrity of the system.  We will work on replacing the cable this evening during which the controller will be shutdown.  During this work, all storage traffic will be routed to a secondary controller that is fully operational.   Given the anticipated load on the secondary controller, we anticipate users experiencing performance degradation.  

How does this impact me:  With only one storage control in operation, users may experience storage slowness.  In a highly unlikely event, this could cause downtime to the storage which would impact all users’ running jobs; however, we do not anticipate any storage outage during this operation.

What we will continue to do:  PACE team will work on the cable replacement and restore the storage to optimal operation, and update the community as needed. 

Please accept our sincere apology for any inconvenience that this  may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team