PACE A Partnership for an Advanced Computing Environment

July 23, 2021

[Resolved] Hive scheduler outage

Filed under: Uncategorized — Michael Weiner @ 1:35 pm

[Update 4:40 PM 7/23/21]

After continued investigation, cleaning up the scheduler logs, and rebooting the scheduler node, we have restored the Hive scheduler to full functionality. Jobs that have been submitted and queued are now running, and there was no interruption to running jobs. New jobs submitted at this time should start as space becomes available, as usual. Thank you for your patience as we investigated this situation.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 1:35 PM 7/23/21]

The Hive scheduler has been experiencing intermittent outages over the last few days while under heavy load, and jobs have been unable to start for nearly all of today (Friday). You may find that jobs you have submitted to Hive remain queued and do not start. We are actively working to restore functionality and will update you as more information becomes available. Thank you for your patience as we investigate this situation.
Please contact us at pace-support@oit.gatech.edu with any questions.

July 22, 2021

Phoenix storage issue

Filed under: Uncategorized — Michael Weiner @ 12:54 pm

[Update 2:05 PM 7/22/21]

The controller reboot is complete, and we believe no disruption occurred in access to Phoenix storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 12:55 PM 7/22/21]

In coordination with our support vendor, we are working to resolve an issue with a Phoenix Lustre metadata controller, which supports both project and scratch storage.
At 1:30 PM today, we will reboot one of the controllers. We do not expect any impact to users, as the other controller is running without error at this time. Should there be any unexpected impact, we will work to restore full functionality as quickly as possible. We will provide an update when this work is complete.
Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Project Storage Quota Update

Filed under: Uncategorized — Michael Weiner @ 9:57 am

In coordination with the Hive PIs, PACE has updated our quota policies for project storage on Hive, in order to facilitate easier access to available storage capacity for our users. For project storage, accessed via the “data” symbolic link in your home directory, block quotas are now shared by an entire research group, rather than being set at the user level. All users in a single PI’s storage allocation have access to the entire quota, which brings Hive in line with Phoenix’s quota arrangement. Most research groups have 50 TB of project storage on Hive, with the exception of those specifically provided with a higher allocation in the NSF grant funding the cluster. Each user maintains a limit of 2 million files within their research group’s project storage.

You can review your storage usage on Hive by running the updated “pace-quota” command on any Hive node. Quotas for home (5 GB per user) and scratch (7 TB per user) directories are unchanged. Please visit our documentation for more details about Hive storage.

Please contact us at pace-support@oit.gatech.edu with any questions about using Hive.

July 19, 2021

[RESOLVED] Phoenix Scratch Outage

Filed under: Uncategorized — Aaron Jezghani @ 11:12 am

Starting around 4 PM Sunday, the Phoenix scratch filesystem became non-responsive, causing issues with access to files and directories stored in ~/scratch. Functionality was restored promptly Monday morning, and at this time, all systems are performing as expected. If you were running jobs that utilized scratch storage during this outage, they may have been negatively impacted; please reach out to pace-support@oit.gatech.edu with related IDs for any such jobs.

July 13, 2021

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period: August 11-13, 2021

Filed under: Uncategorized — Semir Sarajlic @ 4:15 pm

[Update – 08/13/2021 – 10:00AM]

Dear PACE Users,

Our scheduled maintenance has completed ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021.

Here is an update on the tasks performed during this maintenance period.

ITEMS REQUIRING USER ACTION:

  • None.

ITEMS NOT REQUIRING USER ACTION:

  • [Complete] [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Complete] [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Complete] [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Complete] [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [Complete] [System/Security] Operating system patch installs
  • [Complete] [System/Security] Endpoint Protection Updates
  • [Complete] [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [Complete] [System] Update NVidia drivers and add NVidia specific libraries
  • [Complete] [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Original Message – 07/13/2021 – 4:15PM that was updated on August 4, 2021 with list of tasks] 

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

Please see the list of activities to be completed:

ITEMS REQUIRING USER ACTION:

  • Currently, none.

ITEMS NOT REQUIRING USER ACTION:

  • [Datacenter] Databank will need to replace components of one of the transformers feeding the room that will require a complete power off for the research hall that includes the PACE managed clusters.
  • [Storage] Upgrade controller for the storage appliances: SFA200NV, SFA18KE
  • [Storage] Replace a miniSAS cable on Hive storage appliance: SFA14KXE
  • [Storage] Replace a failed hard drive on a pre-production OSG cluster
  • [System/Security] Operating system patch installs
  • [System/Security] Endpoint Protection Updates
  • [Benchmarks] Conduct IO500 and HPCG benchmarks for Hive and Phoenix clusters
  • [System] Update Nvidia drivers and add Nvidia specific libraries
  • [System] Reboot scheduler nodes

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

July 12, 2021

[Resolved] OIT’s Data Warehouse Service Outage

Filed under: Uncategorized — Semir Sarajlic @ 10:13 am

[Update – July 13, 2021] 

OIT has restored operation to Data Warehouse service on July 12, 11:22AM.  Shortly after, PACE has restored functionality to our database and our administrative services.   OIT has continued to monitor the Data Warehouse service.  At this time, all PACE user facing utilities such as pace-check-queue, pace-quota, and pace-whoami are operational.

Please accept our sincere apology for any inconvenience that this temporary limitation may have caused you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

[Original Message – July 12, 2021]

Dear PACE Users,

We are reaching out to inform you that on Saturday at about 10:00am, there was an outage to OIT’s Enterprise Data Warehouse service, which PACE relies on for hosting our database instance that subsequently went down at 11:07am.  The impact to PACE from this service outage is mainly limited to administrative side, and there is some impact to user facing utilities such as pace-check-queue; however, there is no impact to users’ jobs or ability to submit jobs.

What’s happening and what we are doing:  Currently, OIT is investigating the outage impacting the Data Warehouse service that occurred on Saturday, and this outage is tracked at OIT’s status page.   PACE is monitoring this development closely.

How does this impact me:  This data warehouse service outage impacts user facing utilities such as pace-check-queue, pace-quota, pace-whoami that are partially or fully nonfunctional.   In addition, until the Data Warehouse service is restored, PACE will be unable to create new user and PI account requests.  

What we will continue to do:  PACE team will continue to monitor this development, and we will report as needed.   

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

July 9, 2021

OIT NetApp upgrade

Filed under: Uncategorized — Michael Weiner @ 8:59 am

A low-risk upgrade is planned for Georgia Tech OIT’s NetApp storage appliances, beginning Saturday, July 10, at 6:00 AM. We do not expect any impact on PACE systems from this upgrade.

OIT’s NetApp appliance is in use on PACE’s Phoenix, PACE-ICE, and COC-ICE clusters. It hosts home directories as well as pace-apps, our software module repository. Should there be an unexpected disruption, users may face issues with logins, access to home directories, and loading or using PACE-supported software modules. We will provide updates in the unlikely event of a disruption this weekend.

Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress