PACE A Partnership for an Advanced Computing Environment

April 30, 2018

PACE quarterly maintenance – (May 10-12, 2018)

Filed under: Uncategorized — Semir Sarajlic @ 9:38 pm

The next PACE maintenance will start on 5/10 (Thr) and may take up to 3 days to complete, as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the systems that day. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 5/10 and resubmit if this will give them enough time to complete successfully.

We will follow up with a more detailed announcement with a list of planned maintenance tasks with their impact on users, if any. If you miss that email, you can still find all of the maintenance day related information in this post, which will be actively updated with the details and progress.

List of Planned Tasks

 

Schedulers

 

  • Job-specific temporary directories (may require user action): We have been receiving reports of  nodes getting offline due to files left over from jobs filling up their local disk. To address this issue, we will start employing a scheduler feature that creates job-specific temporary directories, which are automatically deleted after the job is complete. In this direction, we created a “/scratch” folder on all nodes. Please note that this is different from your scratch directory in your home (note the difference between ‘~/scratch’ and ‘/scratch’). We ensured that if the node has a separate (larger) HD or SSD on the node(e.g. biocluster, dimer, etc), /scratch will be located on it to offer more space.

Without needing any specific user action, the scheduler will create a temporary directory uniquely named after the job under /scratch. For example:

/scratch/324105.shared-sched.pace.gatech.edu

And assign the $TMPDIR environment variable (which is normally ‘/tmp’) to point to this path.

You can creatively use $TMPDIR in your scripts. For example if you have been creating temporary directories under /scratch manually before, e.g. ‘/tmp/mydir123’, please use “$TMPDIR/mydir123” from now on to ensure that this directory will be deleted after the job is complete.

  • ICE (instructional cluster) scheduler migration to a different server (may require user action): We’ll move the scheduler server we use for the ICE queues on a new machine that’s better suited for this service. This change will be completely transparent from the users and there will be no changes in the way jobs are submitted. Jobs that are waiting in the queue will need to be resubmitted and we’ll contact the users separately for that. If you are not a student using ICE clusters, then you will not be affected from this task in any way.

 

Systems Maintenance

 

  • ASDL cluster (requires no user action)We’ll replace some failed CMOS batteries on several compute nodes, replace a failed CPU and add more memory on the file server.
  • Replace PDUs on Rich133 H37 Rack (requires no user action): We’ll replace PDUs on this rack, which includes nodes from a single dedicated cluster with no expected impact on other PACE users or clusters even if something goes wrong.
  • LIGO cluster rack replacement (requires no user action): We’ll replace the LIGO cluster rack with a new one with new power supplies.

 

Storage

 

  • GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): The new version is tested, but please contact pace-support@oit.gatech.edu if you notice any missing mounts, failing data operations or slowness issues after the maintenance day.
  • Run routine system checks on GPFS filesystems (requires no user action): As usual, we’ll run some file integrity checks to find and fix filesystem issues, if any. Some of these checks take a long time and may continue to run after the maintenance day, with some impact on performance, although minimal.

 

Network

 

  • The IB network card firmware upgrades (requires no user action)The new version is tested, but please contact pace-support@oit.gatech.edu if you notice failing data operations or crashing MPI jobs after the maintenance day.
  • Enable 10GbE on physical headnodes (requires no user action)Physical headnode (e.g. login-s, login-d, coc-ice, etc) will be reconfigured to use 10GbE interface for faster networking.
  • Several improvements on networking infrastructure (requires no user action)We’ll reconfigure some of the links, add additional uplinks and replace fabric modules on different components of the network to improve reliability and performance of our network.

 

 

 

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress