GT Home : : Campus Maps : : GT Directory

Archive for May, 2010

HPC Status Update 5/21/2010 – ready for research!

Posted by on Friday, 21 May, 2010

I just sent out a bulk email to all of the users we have on our list as needing accounts.  We’re open for business!

I’m sure there will be issues to resolve getting users running over the next couple of weeks, and we’ll do our best to resolve those as quickly as possible.  The welcome message asks that these be directed to pace-support@oit.gatech.edu.  I’d like to reiterate that request here as doing so will allow us to track and resolve issues efficiently.  Please remind your students to do so.

We have a very simple queue structure in place at the moment (one queue per cluster).  I expect to work with you, or a designate, over time to adjust as needed.

HPC Status Update 5/20/2010

Posted by on Thursday, 20 May, 2010

Folks, we’re ready to go.  Accounts are created, and a very basic queue structure is set up – one queue for each cluster.  We’ll adjust those as needed.  We’ll send out welcome messages with login instructions in the morning.

HPC Status Update 5/19/2010

Posted by on Wednesday, 19 May, 2010

Just a quick update for today.  We’ve identified some bugs in our user creation process.  This is the last step before we can start adding users.  Should be fixed shortly…

HPC Status Update 5/15/2010

Posted by on Saturday, 15 May, 2010

Hi folks,

Just a quick update here – late last night we completed the bringup of a few of the dedicated clusters and the portions of the Community Cluster that are not affected by the problem with the BladeRunner platform.  We’re running some final “sanity check” tests on these and expect to open them up for use on Monday or Tuesday.  We’ll also complete the bringup of the remainder of the clusters Monday.  We have some behind-the-scenes work to do, but expect to be able to accomplish most of that without downtime or impact on research.

We think we have a resolution to the performance issues with the home and project directory storage and are verifying those changes.

HPC Status Update 5/7/2010

Posted by on Saturday, 8 May, 2010

Greetings,

We had hoped to have some users on-line this week, but there are a few more things that need to be done before the dedicated clusters will be ready for full-bore research.  We anticipate being able to provide access by the end of next week to the individual clusters.

FYI, we have spent a significant amount of time working with Penguin and their supplier, SuperMicro, trying to resolve an issue that specifically impacts the BladeRunner II platform.  Penguin has taken further steps towards resolution requiring a hardware fix, and we are having to run another series of tests for verification.  In previous attempts, these nodes have taken up to a week to exhibit fatal failures.  We are now looking at availability of the Community Cluster sometime during the week of May 17-21, assuming the current round of testing is successful.  This also impacts some of the dedicated clusters, and we have been in touch with their owners.  We have escalated this issue to the highest levels within Penguin, and they have given us assurances that all efforts are being made to resolve this as quickly as possible.

We are also having some unanticipated performance issues with the home and project directory storage.  (The high-performance scratch is performing quite well.)  We have a few promising leads as well as a contingency strategy that can be quickly implemented.

The purchase of Jacket and Matlab Distributed toolkit is underway.

We have names for all of the dedicated clusters now, as well as a number of users.  If you have not submitted your list of users yet, please do so.

Please let me know if you have any questions.  I’ve added some more detail in the blog post, and I’ll post another update as we have more information.

I’ve indicated changes from last week in red.

Base networking – 97% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – delayed (not critical)
  • 10 GbE links to BCDC (replace 1gig) – one link complete, second delayed in favor of new OIT NAS connectivity (not critical)

home & project directories – 80% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • configure file servers to use PACE LDAP – (new task, complete)
  • provision user filesystems and set quotas – complete
  • configure backups for user filesystems – complete
  • resolve NFS performance issues with Home directories, Project Directories, Infrastructure storage and backup server (high impact)

scratch storage – 100% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – complete

server infrastructure – 99% complete

  • install OS and configure support servers – complete
  • install OS and configure head nodes – complete
  • install and configuration of DNS & DHCP appliances – complete
  • rename head nodes

compute nodes

  • support scripts – 90% complete
  • configure lights-out network consoles (new task) – 85% complete
  • creation of diskless system images (16 types) – 50% complete
  • 0 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – requisitions are awaiting approval from Central Purchasing
  • GPU software

Penguin tasks

  • Supermicro has identified a further resistor fix (phase 3) needed by all 50 blades
  • 11 of 50 Penguin blades out for capacitor/diode/resistor repair (39 complete)
  • 11 of 50 Penguin blades in need of BIOS information fix (39 complete)