GT Home : : Campus Maps : : GT Directory

PACE maintenance – complete

This entry was posted by on Wednesday, 17 July, 2013 at

We’ve finished.  Feel free to login and compute.  Previously submitted jobs are running in the queues.  As always, if you see odd issues, please send a note to pace-support@oit.gatech.edu.

We were able to complete our transition to the database-driven configuration, and apply the Panasas code upgrade.  Some of you will be seeing warning messages stemming from your utilization of the scratch space.  Please remember that this is a shared, and limited, resource.  The RHEL5 side of the FoRCE cluster was also retired, and reincorporated into the RHEL6 side.

We were able to achieve some of the network redundancy work, but this took substantially longer than planned and we didn’t get as far as we would have liked.  We’ll complete this during future maintenance window(s).

We spent a lot of time today trying to address the storage problems, but time was just to short to fully implement.  We were able to do some work to address the storage for the virtual machine infrastructure (you’ll notice this as the head/login nodes).  Over the next days and weeks, we will work on a robust way to deploy these updates to our storage servers and come up with a more feasible implementation schedule.

Some of the less time consuming items we also accomplished was to increase the amount of memory the Infiniband cards were able to allocate.  This should help those of you with codes that send very large messages.  We also increased the size of the /nv/pz2 filesystem – those of you on the Athena cluster, that filesystem is now nearly 150TB.  We found some Infiniband cards that had outdated firmware and brought those into line with what is in use elsewhere in PACE.  We also added a significant amount of capacity to one of our backup servers, added some redundant links into our Infiniband fabric and added some additional 10-gigabit ports for our growing server & storage infrastructure.

In all of this, we have been reminded that PACE has grown quite a lot over the last few years – from only a few thousand cores, to upwards of 25,000.  As we’ve grown, it’s become more difficult to complete our maintenance in four days a year.  Part of our post-mortem discussions will be around ways we can more efficiently use our maintenance time, and possibly increasing the amount of scheduled downtime.  If you have thoughts along these lines, I’d really appreciate hearing from you.

Thanks,

Neil Bright

Comments are closed.