PACE A Partnership for an Advanced Computing Environment

July 23, 2019

[Complete] PACE Quarterly Maintenance – August 8-10

Filed under: Uncategorized — Tags: , — Semir Sarajlic @ 8:03 pm

[August 9, 2019 Update] Our August 2019 maintenance ( https://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

[August 2, 2019 Update]

NO USER ACTION NEEDED ITEMS:

  • Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • Additional space will be configured for license server.
  • OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • The submit filter for jobs on the RHEL 6 clusters will be modified to allow proper formatting of commands. This filter is already in place on RHEL 7 clusters.
  • Upgrade DNS appliances; no downtime is expected due to redundant configuration.

Please send questions and/or comments to pace-support@oit.gatech.edu

 

[July 23, 2019] We are preparing for a maintenance day on August 8 – 10, 2019. This maintenance day is planned for three days and will start on Thursday, August 8, and go through Saturday, August 10.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  

In general, we will be working on upgrading all of the RHEL7 production nodes to latest 7.6 kernel, update connection to and from PACE routers, and add additional disk capacity to our license server.  While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

July 12, 2019

[Resolved] Campus wide Intermittent network outage impacting PACE

Filed under: Uncategorized — Semir Sarajlic @ 9:19 pm

Today at around 1:55pm,  OIT reported a campus wide intermittent network slowness as one of the DNS servers went down causing trouble with authentication, GRS and more.  OIT has resolved this issue as of 4:12pm, and we have recovered our storage that export home directories as a result of this related issue.  The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.

We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu

For details on the OIT issue reported, please visit their link 

Thank you for your attention to this, and apologies for the inconvenience.

July 11, 2019

[Resolved] Dedicated Scheduler – Job Submissions Paused

Filed under: Uncategorized — Semir Sarajlic @ 2:27 pm

[Update – July 11, 2019 – 2:45pm] Dedicated scheduler is back online and operational after correcting the node associations with queues that has resulted from a faulty configuration.  We have taken measures to correct our automated procedure to prevent such an incident in the future.  We have removed the pause on job submission.  You may now resume submitting your jobs.  Please check on your jobs that were submitted since 3:30pm yesterday (7/10/2019) as many of those jobs have terminated.  

Again, apologies for the inconvenience this has caused.

[Original Post – July 11, 2019 – 10:38pm] Today, at approximately 10:10am we paused job submissions to queues that are managed by the dedicated scheduler. Researcher teams will not be able to submit new jobs to the following queues: kennedy-lab,granulous,atlas-dufek,chow,athena-debug,cochlea,atlas-6,njord-6,atlantis,jabberwocky6,  megatron,acceptance,hadoop,aces,drive,complexity,corso,blue,monkeys-k33,athena-6,core,ase1-debug-6,microbio-1,radius,medprint-6,monkeys_gpu,pampa-6,monkeys,keeneland,athena-intel,atlas-intel,apurimac-bg-6,staml,ofed-test,semap-6,martini,skade,tmlhpc-6,atlas-debug,wohler,rozell,mps,prv-5-6,aryabhata-6,hadean-gpu,epictetus,neutrons-6,davenporter,atlas,athena-8core,uranus-6,hadean,ase1-6,atlas-simon,enterprise,pampa-debug-6,skadi

This action is taken to resolve the issue that we experienced since evening on July 10, in which jobs erroneously were terminated after not reaching their appropriate nodes. We are working to resolve this issue as quickly as possible.   Also, by pausing the job submission we will prevent any new jobs from being terminated. While we work to resolve this issue, we ask that you refrain from trying to submit your jobs to the listed queues above. We will follow up with an update as we work through this issue.  Thank you for your attention to this, and we are sorry for this inconvenience.

Powered by WordPress