PACE A Partnership for an Advanced Computing Environment

September 18, 2020

[Resolved] Emergency Storage maintenance (GPFS/pace2) in Rich datacenter

Filed under: Uncategorized — Semir Sarajlic @ 11:27 am

[Update – 3:02pm] 

We are following up to inform you that our emergency maintenance work on GPFS pace2 storage in Rich datacenter was completed successfully, and at approximately 2:10pm we have released the jobs on the Shared and Dedicated clusters in Rich.   Please note that temporarily the GPFS pace2 file system will be slightly slower as it is concurrently rebuilding 7 drives.  During this maintenance, we did not lose any user data, and we did not interrupt any user jobs that were running during this period.

What PACE will do:  PACE will continue to monitor the storage and report as needed.  Thank you for your attention and patience during this brief emergency storage maintance.  

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Thank you,
The PACE Team

 

[Original – 11:25]

PACE will be conducting an emergency maintenance work on storage in Rich datacenter today at 1:00pm.  As a precaution we have paused all user jobs as of 10:15am today.  Currently running jobs will remain running but may be subject to interruption during our emergency maintenance.  

What’s about to happen: Today, starting at 1:00pm, PACE team will need to conduct an emergency maintenance activity on our GPFS in Rich datacenter, which will involve reseating the primary IO module.  The storage impacted is /data directory on GPFS pace2 that users use on Shared and Dedicated clusters.

Who is impacted: As of 10:15am, all PACE users are unable to submit and run new jobs as the schedulers in Rich datacenter have been paused.  Currently running user jobs may be subject to interruption during the maintenance activity.  If jobs get interrupted, PACE team will follow up with the impacted users to notify them.  

This emergency maintenance activity does not impact Coda datacenter that includes Hive and TestFlight-Coda clusters.  Also, as of 11:30am we have released the jobs for Gryphon and Novazohar that were briefly paused this morning as we assessed the situation.  Gryphon and Novazohar clusters will not be impacted by the 1:00pm scheduled emergency maintenance.

For updates, you may refer to our blog post,LINK  that we will updated as further information is available.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Thank you for your attention and our apologies for this inconvenience.

The PACE Team

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress