PACE A Partnership for an Advanced Computing Environment

July 30, 2018

[COMPLETE] PACE quarterly maintenance – (Aug 9-11, 2018)

Filed under: Uncategorized — Semir Sarajlic @ 6:08 pm

update (Aug 10, 2018, 8:00pm): Our Aug 2018 maintenance is complete, one day ahead of schedule. All of the tasks are completed as planned. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.

Please note the important changes regarding decommissioned login nodes, including the commonly used force-6 headnode.
Our next maintenance period is scheduled for Thursday, Nov 1 through Saturday, Nov 3, 2018.
Original message:

The next PACE maintenance will start on 8/9 (Thr) and may take up to 3 days to complete, as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 8/9 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Headnodes

  • (some user action needed) Most PACE headnodes (login nodes) are currently Virtual Machines (VM) with slow response time and sub-optimal storage performance, which are often the cause of slowness.

We are in progress of replacing these VMs with more capable physical servers. After the maintenance day, your login attempts to these VMs will be rejected with a message that tells you which hostname should you be using instead. In addition, we are in the progress of sending each user a customized email with a list of old and new login nodes. Please don’t forget to configure your SSH clients to use these new hostnames.

Simply, “login-s.pace.gatech.edu” will be used for all shared clusters and “login-d.pace.gatech.edu” will be for dedicated clusters. You’ll notice that once you  login, you’ll be redirected to one of the several physical nodes automatically (e.g. login-s1, login-d2, …) depending on their current load.

There will be no changes to clusters which already come with a dedicated (and physical) login node (e.g. gryphon, asdl, ligo, etc)

  • (some user action needed) As some of the users have already noticed, users can  no longer edit cronjobs (e.g. crontab -e) on the headnodes. This is on purpose because the access to new login nodes (login-d and login-s) are dynamically routed to different servers depending on their load. This means, you may not be able to see the cron jobs you have installed the next time you login to one of these nodes. For this reason, only PACE admins can install the cronjobs on behalf of users to ensure consistency (only login-d1 and login-s1 will be used for crons jobs). If you need to add (or edit) cronjobs, please contact pace-support@oit.gatech.edu. If you already have user cron jobs setup on one of the decommissioned VMs, they will be moved over to login-d1 or login-s1 during the maintenance so they’ll continue to run.

Storage

  • (no user action needed) Add a dedicated protocol node to the GPFS system to increase capacity and response time for non-InfiniBand connected systems. This system will gradually replace the IB gateway systems that are currently in operation.
  • (no user action needed) Replace batteries to DDN/GPFS storage controllers

Network

  • (no user action needed) Upgrades to the DNS appliances in both PACE datacenters
  • (no user action needed) Add redundant storage links to specific clusters

Other

  • (no user action needed) Perform network upgrades
  • (no user action needed) Replace devices that are out of support

July 22, 2018

[Resolved] Shared scheduler problems

Filed under: Uncategorized — Semir Sarajlic @ 5:52 am
Update (07/22/2018, 2:30am): The scheduler is back in operation again after we cleared a large number of jobs submitted by a user. We’ll continue to monitor the system for similar problems and work with users to normalize their workflows.
The shared scheduler has been going through some difficulties, which looks like due to large number of job arrays submitted recently. We don’t know the exact cause yet, but we are aware of the problems currently working on a resolution.
Until this issue is resolved, commands like qsub and qstat will not work, and showq will return an incomplete list of jobs.
This problem only applies to job submission and monitoring , your running and queued jobs are safe otherwise.

July 20, 2018

The PACE Scratch storage just got faster!

Filed under: Uncategorized — Semir Sarajlic @ 6:12 pm
We have made some improvements to the scratch file system, namely by adding SSD drives to be used for faster metadata management and data storage. We are pleased to report that this strategic allocation of relatively small number of SSDs yielded impressive performance improvements, more than doubling the write and read speeds (according to our standard benchmarks).
This work, performed under the guidance of the vendor, didn’t require any downtime and no jobs were impacted.
We hope you’ll enjoy the increased performance for a faster, better research!

 

Powered by WordPress