PACE A Partnership for an Advanced Computing Environment

February 20, 2020

[Resolved] Scratch filesystem issue

Filed under: Uncategorized — Michael Weiner @ 9:31 pm

[Update 2/20/20 4:40 PM]

Use of the scratch filesystem is restored. It appears that the automated migration task did run but could not keep up with the rate of scratch usage. We will monitor scratch for recurrence of this issue.

Please check any running jobs for errors and resubmit if necessary.

 

[Original message 2/20/20 4:30 PM]

Shortly before 4 PM, we noticed that PACE’s mounted GPFS scratch filesystem (~/scratch) is experiencing an issue that is preventing users from writing to their scratch directories. Any running jobs that write to scratch may experience failures due to write errors.

The scratch filesystem writes first to SSDs, and an automated task migrates data to another location when those drives near capacity. This task did not run as expected, prompting users to receive errors that scratch was full. We have manually begun the migration and will update the blog post when scratch is again available.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any concerns.

February 19, 2020

[Resolved] Rich InfiniBand Switch Power Failure

Filed under: Uncategorized — Aaron Jezghani @ 3:10 pm

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

February 14, 2020

RESOLVED [Hive and Testflight-CODA Clusters] Connectivity Issue to All CODA Resources

Filed under: Uncategorized — Aaron Jezghani @ 4:15 pm

RESOLVED [1:44 PM]:

The network engineers report that they have fixed the issues and are continuing to monitor it, although the cause remains unknown. Jobs appear to have continued uninterrupted on the Hive and Testflight-CODA clusters, but we encourage users to verify.
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5e46cb01fa0e5304bc04ecb5
Any residual issues should be reported to pace-support@oit.gatech.edu. Thank you.

UPDATE [11:33 AM]:

Georgia Tech IT is aware of the situation and is investigating as well:

Original Message:

Around 11:00 AM, we noticed that we could not connect to any resources housed in CODA, including the Hive and Testflight-CODA clusters. At this time, the source of the problem is being investigated, but access to these resources will experience problems. In theory, jobs on these clusters should continue to run. Further details will be provided as they become available.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu. Thank you.

February 13, 2020

[COMPLETED] PACE Maintenance – February 27-29

Filed under: Uncategorized — Aaron Jezghani @ 9:06 pm

[COMPLETED – 6:51 PM 2/28/2020]

We are pleased to announce that our February 2020 maintenance period (https://blog.pace.gatech.edu/?p=6676) has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible. 

As usual, there are a small number of straggling nodes that will require additional intervention.  

A summary of the changes and actions accomplished during this maintenance period are as follows: 

  • (Completed) RHEL7 clusters received critical patches
  • (Completed) Updates will be made to PACE databases and configurations.
  • (Deferred) [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • (Completed) [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • (Completed) [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy.
  • (In Progress) [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version.
  • (Completed) [Hive cluster] Run OSU Benchmark test on idle resources
  • (Completed) [GPFS file system] Apply latest maintenance releases and firmware updates
  • (In Progress) [Lustre file system] Apply latest maintenance releases and firmware updates

Thank you for your patience!

[UPDATE – 8:52 AM 2/27/2020]

The PACE maintenance period is underway. For the duration of maintenance, users will be unable to access PACE resources. Once the maintenance activities are complete, we will notify users of the availability of the cluster.

Also, we have been told by Georgia Power that they expect their work may take up to 72 hours to complete; as such, the maintenance outage for the CODA research hall (Hive and Testflight-CODA clusters) will extend until 6:00 AM Monday morning. We will provide updates as they are available.

[Original Message]

We are preparing for PACE’s next maintenance days on February 27-29, 2020. This maintenance period is planned for three days starting on Thursday, February 27, and ending Saturday, February 29. However, Georgia Power will begin work to establish a Micro Grid power generation facility beginning on Thursday, and while work should complete within 48 hours, any delays may extend the maintenance outage for the Hive and Testflight-CODA clusters through Sunday instead; PACE clusters in Rich will not be impacted by any delays in Georgia Power’s work. Should any issues and resultant delays occur, users will be notified accordingly. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

  • None

ITEMS NOT REQUIRING USER ACTION:

  • RHEL7 clusters will receive critical patches.
  • Updates will be made to PACE databases and configurations.
  • [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy
  • [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version
  • [Hive cluster] Run OSU Benchmark test on idle resources
  • [GPFS and Lustre file systems] Apply latest maintenance releases and firmware updates

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu .

February 10, 2020

[Resolved] Shared Scheduler for Shared Clusters is Down

Filed under: Uncategorized — Semir Sarajlic @ 7:41 pm

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

Powered by WordPress