GT Home : : Campus Maps : : GT Directory

HPC Status Update – 4/16/2010

This entry was posted by on Friday, 16 April, 2010 at

Hi folks,

I’ve been a bit remiss in issuing updates recently, and I apologize for that.  We’ve made some good progress, passed some significant milestones, and had a few setbacks as well.

In an effort to improve our communication, we’ve created a blog for this project, available at http://blog.pace.gatech.edu.  Feel free to subscribe to the RSS feed if desired.  Previous status updates have been published on this site as well.  As we near the end of implementation, we’ll switch to a fixed, time-based schedule for updates, rather than communication at milestones.

Penguin has completed their portion of the project, although a couple of items of documentation were left incomplete and a few parts remain to be delivered.  All of these are non-critical.  The documentation pieces have been completed by the OIT/PACE team, and Penguin has reiterated their commitment to ship the remaining parts.  In exchange for GT accepting the purchase and issuing 100% payment, Penguin has agreed to supply an additional pair of nodes using the latest 12-core processors from AMD.  Each of these nodes has a total of 48 cores, 128GB of memory and QDR Infiniband.  These nodes will be added to the Community Cluster, and represent an approximately 10% increase in capacity for campus use.

As mentioned in our previous update, our testing has uncovered a design flaw in one particular model of compute node.  Mostly, these nodes are being used in the Community Cluster.  Other groups that have this model have been notified, and we are working to repair those as well.  Our testing has determined that this issue does not affect other models in the federation.  We’ve been working with Penguin, and their supplier SuperMicro, to develop, test and implement a hardware fix.  This fix involves removal of a diode, and a change of capacitor and resistor which are all affixed to the motherboard.  A technician from SuperMicro has been onsite this week implementing these repairs, and we are in the process of retesting the repaired nodes.

Progress on the governance committee has been made, and we hope to have a version 1.0 of the resource allocation policy out soon.

We’ve also completed the hiring of our Systems Support Specialist III.  Brian MacLeod will be joining the OIT/PACE team, effective immediately.  Brian comes to us from the Architecture and Infrastructure team within OIT.  As such, he is familiar with many of our methods and procedures and hope to be able to get him up to speed quickly.  Hiring of a Research Technologist II remains.

We have also heard from a number of faculty (some of who are on this list!) interested in additional purchases and have been working through the specification, configuration and ordering process in order to meet their purchasing deadlines.  We also continue to gather information towards a container-based strategy for campus in the event that more space/power/cooling for HPC is needed.

The following is a list of software we are considering for purchase.  Are there others that should be on this list?  Please let me know recommendations of other packages, and if these would be useful to you.

Currently in place:

Moab workload scheduler: use existing site license
RedHat Linux operating system: use existing site license
Backups: combination of open source and OS included software
OS management & provisioning: in-house implementation used by OIT
Intel & Portland Group compilers: purchased from Penguin
Under discussion:

Jacket (GPU acceleration for Matlab) – $10k/yr site license
Matlab distributed toolkit
128 workers (CPUs) – $16k initial, $3k annual
256 workers (CPUs) – $30k initial, $6k annual
In short, we’ve been making good progress towards implementing this current acquisition, but I’m afraid we may be slipping a little bit.  Rather than the early/mid April previously planned, we’re looking more like late April or early May.  We’ll send out another status update on Friday the 23rd.

Finally, one of the “important” items to do is to name your clusters.  If you could send me your choices at your earliest convenience, we’ll name your head nodes accordingly.

Comments are closed.