PACE A Partnership for an Advanced Computing Environment

January 28, 2019

[Resolved] Networking (InfiniBand) problems

Filed under: Uncategorized — Semir Sarajlic @ 4:01 pm

[Resolved, January 28] We had one of our main Mellanox IB switch’s partially go down on Sunday morning, which has left large amount of compute nodes without access to the IB interconnect.  Our system engineers have resolved the matter at about 9:41am, and the IB switch is back online.  As far as we know, the following queues have been impacted: athena-intel, atlantis, atlas-6-sunge, atlas-intel, force-6, joe-intel, joe-test, novazohar,, pace-devel, swarm, and zohar.   We advise that you review your jobs from this weekend/current jobs as this incident may have interrupted your jobs.  If your jobs have failed due to errors pertaining to MPI errors or files could not write to /scratch/ or  /data/[Your_Files], then please resubmit your jobs. 

We will continue to monitor this switch and update if needed.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you, and sorry for this inconvenience.

January 18, 2019

PACE quarterly maintenance – (Feb 15-16, 2019)

Filed under: Uncategorized — Semir Sarajlic @ 11:48 pm

[Update – 02/11/2019] Our updated quarterly scheduled maintenance task list will include the following:

Compute

  • (no user action needed) Vendor will replace defective components on groups of servers

Network

  • (no user action needed) Ethernet network reconfiguration

Storage

  • (no user action needed) GPFS / DDN enclosure reset
  • (no user action needed) NAS maintenance and reconfiguration

Other

  • (no user action needed) PACE VMWare reconfiguration to remove out of support hosts

 

[Original Post – 01/18/2019] We are preparing for a short maintenance day on February 15, 2019. Unlike our regular schedule, which starts on Thursdays and takes three days, this maintenance will start on a Friday and take only two days.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

In general, we’ll perform maintenance on the GPFS storage, migrate some Virtual Machines to new servers, perform hardware changes on one of the clusters, and finalize the migration of “/usr/local”, which is network attached mount point on all machines, to a more reliable storage pool.

While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

We’ll update this post as we have more details.

 

 

January 3, 2019

Changes to mount points (no user impact expected)

Filed under: Uncategorized — Semir Sarajlic @ 10:24 pm

The investigation results that followed the system failures that temporarily rendered the scientific repository unresponsive (https://blog.pace.gatech.edu/?p=6390) will require some additional maintenance. To facilitate this maintenance, we will make a change to the mount point for /usr/local, which is network mounted and identical on all compute nodes.

Our tests indicate that this swap can be performed live, without impacting running jobs. It’s also completely transparent to users; you don’t need to change or do anything as a result.

In the unlikely event of job crashes that you suspect are caused by this operation, please contact pace-support@oit.gatech.edu and we’ll be happy to assist.

Thank you,
PACE Team

Powered by WordPress