GT Home : : Campus Maps : : GT Directory

Archive for April, 2010

HPC Status update – 4/30/2010

Posted by on Friday, 30 April, 2010

Much progress has been made, and we’ll likely be ready for our first users next week.  Please send along a list of users you would like us to enable access to your clusters.  If at all possible, please include their GT account names (eg gtg123x).

We are still in need of names for the individual clusters.  Please send me an alternate if you would prefer something other than our working names below:

  • M. Chou (Physics) – “Atlantis” cluster
  • S. Harvey, et.al. (Biology) – “B1″ cluster
  • S. Kumar (ME) – “K1″ cluster
  • P. Laguna (Physics) – “L1″ cluster
  • D. Sholl, et.al. (ChBE) – “Joe” cluster
  • V. Yang (AE) – “Y1″ cluster
  • M. Zhou (ME) – “Z1″ cluster

In order to meet purchasing deadlines, we intend to proceed with the purchase of Jacket and Matlab Distributed toolkit next week.  This is your final chance to object!

I’ve indicated changes from last week in blue.

Base networking – 97% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – delayed (not critical)
  • 10 GbE links to BCDC (replace 1gig) – one link complete, second delayed in favor of new OIT NAS connectivity (not critical)

home & project directories – 95% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • configure file servers to use PACE LDAP – (new task, complete)
  • provision user filesystems and set quotas – deferred until next week
  • configure backups for user filesystems – deferred until next week

scratch storage – 95% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – configuration script complete, need to deploy on nodes

server infrastructure – 99% complete

  • install OS and configure support servers – complete
  • install OS and configure head nodes – complete
  • name head nodes (we need your names!)
  • install and configuration of DNS & DHCP appliances – complete

compute nodes

  • support scripts – 90% complete
  • configure lights-out network consoles (new task) – 85% complete
  • creation of diskless system images (16 types) – 30% complete
  • 8 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – under discussion
  • GPU software

Penguin tasks

  • 17 of 50 Penguin blades out for capacitor/diode/resistor repair
  • Supermicro has identified a further resistor fix needed by all 50 blades
  • 50 of 50 Penguin blades in need of BIOS information fix

HPC Status Update – 4/23/2010

Posted by on Friday, 23 April, 2010

Much progress has been made over the last week, and we’re on track for availability in early May.  We’ve heard back from a couple of you with names for your clusters.  If you haven’t replied with a desired name, please do so soon.  If you have comments one way or the other regarding the purchase of the Jacket or Matlab Distributed toolkit, please let me know.  We need to make purchasing decisions soon.

Base networking – 95% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – next week
  • 10 GbE links to BCDC (replace 1gig) – next week

home & project directories – 95% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • provision user filesystems and set quotes – next week
  • configure backups for user filesystems – next week

scratch storage – 80% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – next week

server infrastructure

  • install OS and configure support servers
  • install OS and configure head nodes (we need your names!)
  • install and configuration of DNS & DHCP appliances – complete

compute nodes

  • support scripts – 90% complete
  • creation of diskless system images (16 types) – 20% complete
  • 8 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – under discussion
  • GPU software

Penguin tasks

  • 13 of 50 Penguin blades out for capacitor/diode/resistor repair
  • 50 of 50 Penguin blades in need of BIOS information fix

HPC Status Update – 4/16/2010

Posted by on Friday, 16 April, 2010

Hi folks,

I’ve been a bit remiss in issuing updates recently, and I apologize for that.  We’ve made some good progress, passed some significant milestones, and had a few setbacks as well.

In an effort to improve our communication, we’ve created a blog for this project, available at http://blog.pace.gatech.edu.  Feel free to subscribe to the RSS feed if desired.  Previous status updates have been published on this site as well.  As we near the end of implementation, we’ll switch to a fixed, time-based schedule for updates, rather than communication at milestones.

Penguin has completed their portion of the project, although a couple of items of documentation were left incomplete and a few parts remain to be delivered.  All of these are non-critical.  The documentation pieces have been completed by the OIT/PACE team, and Penguin has reiterated their commitment to ship the remaining parts.  In exchange for GT accepting the purchase and issuing 100% payment, Penguin has agreed to supply an additional pair of nodes using the latest 12-core processors from AMD.  Each of these nodes has a total of 48 cores, 128GB of memory and QDR Infiniband.  These nodes will be added to the Community Cluster, and represent an approximately 10% increase in capacity for campus use.

As mentioned in our previous update, our testing has uncovered a design flaw in one particular model of compute node.  Mostly, these nodes are being used in the Community Cluster.  Other groups that have this model have been notified, and we are working to repair those as well.  Our testing has determined that this issue does not affect other models in the federation.  We’ve been working with Penguin, and their supplier SuperMicro, to develop, test and implement a hardware fix.  This fix involves removal of a diode, and a change of capacitor and resistor which are all affixed to the motherboard.  A technician from SuperMicro has been onsite this week implementing these repairs, and we are in the process of retesting the repaired nodes.

Progress on the governance committee has been made, and we hope to have a version 1.0 of the resource allocation policy out soon.

We’ve also completed the hiring of our Systems Support Specialist III.  Brian MacLeod will be joining the OIT/PACE team, effective immediately.  Brian comes to us from the Architecture and Infrastructure team within OIT.  As such, he is familiar with many of our methods and procedures and hope to be able to get him up to speed quickly.  Hiring of a Research Technologist II remains.

We have also heard from a number of faculty (some of who are on this list!) interested in additional purchases and have been working through the specification, configuration and ordering process in order to meet their purchasing deadlines.  We also continue to gather information towards a container-based strategy for campus in the event that more space/power/cooling for HPC is needed.

The following is a list of software we are considering for purchase.  Are there others that should be on this list?  Please let me know recommendations of other packages, and if these would be useful to you.

Currently in place:

Moab workload scheduler: use existing site license
RedHat Linux operating system: use existing site license
Backups: combination of open source and OS included software
OS management & provisioning: in-house implementation used by OIT
Intel & Portland Group compilers: purchased from Penguin
Under discussion:

Jacket (GPU acceleration for Matlab) – $10k/yr site license
Matlab distributed toolkit
128 workers (CPUs) – $16k initial, $3k annual
256 workers (CPUs) – $30k initial, $6k annual
In short, we’ve been making good progress towards implementing this current acquisition, but I’m afraid we may be slipping a little bit.  Rather than the early/mid April previously planned, we’re looking more like late April or early May.  We’ll send out another status update on Friday the 23rd.

Finally, one of the “important” items to do is to name your clusters.  If you could send me your choices at your earliest convenience, we’ll name your head nodes accordingly.