update (Aug 10, 2018, 8:00pm): Our Aug 2018 maintenance is complete, one day ahead of schedule. All of the tasks are completed as planned. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
The next PACE maintenance will start on 8/9 (Thr) and may take up to 3 days to complete, as scheduled.
As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 8/9 and resubmit if this will give them enough time to complete successfully.
Planned Tasks
Headnodes
- (some user action needed) Most PACE headnodes (login nodes) are currently Virtual Machines (VM) with slow response time and sub-optimal storage performance, which are often the cause of slowness.
We are in progress of replacing these VMs with more capable physical servers. After the maintenance day, your login attempts to these VMs will be rejected with a message that tells you which hostname should you be using instead. In addition, we are in the progress of sending each user a customized email with a list of old and new login nodes. Please don’t forget to configure your SSH clients to use these new hostnames.
Simply, “login-s.pace.gatech.edu” will be used for all shared clusters and “login-d.pace.gatech.edu” will be for dedicated clusters. You’ll notice that once you login, you’ll be redirected to one of the several physical nodes automatically (e.g. login-s1, login-d2, …) depending on their current load.
There will be no changes to clusters which already come with a dedicated (and physical) login node (e.g. gryphon, asdl, ligo, etc)
- (some user action needed) As some of the users have already noticed, users can no longer edit cronjobs (e.g. crontab -e) on the headnodes. This is on purpose because the access to new login nodes (login-d and login-s) are dynamically routed to different servers depending on their load. This means, you may not be able to see the cron jobs you have installed the next time you login to one of these nodes. For this reason, only PACE admins can install the cronjobs on behalf of users to ensure consistency (only login-d1 and login-s1 will be used for crons jobs). If you need to add (or edit) cronjobs, please contact pace-support@oit.gatech.edu. If you already have user cron jobs setup on one of the decommissioned VMs, they will be moved over to login-d1 or login-s1 during the maintenance so they’ll continue to run.
Storage
- (no user action needed) Add a dedicated protocol node to the GPFS system to increase capacity and response time for non-InfiniBand connected systems. This system will gradually replace the IB gateway systems that are currently in operation.
- (no user action needed) Replace batteries to DDN/GPFS storage controllers
Network
- (no user action needed) Upgrades to the DNS appliances in both PACE datacenters
- (no user action needed) Add redundant storage links to specific clusters
Other
- (no user action needed) Perform network upgrades
- (no user action needed) Replace devices that are out of support