PACE A Partnership for an Advanced Computing Environment

March 30, 2012

New rhel6 shared/hybrid queues are ready!

Filed under: Inchworm deployment — Semir Sarajlic @ 9:20 pm

We are happy to announce the availability of shared/hybrid queues for all sharing rhel6 clusters. Please run “/opt/pace/bin/pace-whoami” to see which of these queues you have access to. We did our best to test and validate these queues, but there could still be some issues left overlooked. Please contact us at pace-support@oit.gatech.edu if you notice any problems.

Here’s a list of these queues:

  • mathforce-6
  • critcelforce-6
  • apurimacforce-6
  • prometforce-6 (prometheusforce-6 was too long for the scheduler)
  • eceforce-6
  • cygnusforce-6
  • iw-shared-6

Happy computing!

 

May 21, 2010

HPC Status Update 5/21/2010 – ready for research!

Filed under: Inchworm deployment: Winter 2009 — admin @ 5:54 pm

I just sent out a bulk email to all of the users we have on our list as needing accounts.  We’re open for business!

I’m sure there will be issues to resolve getting users running over the next couple of weeks, and we’ll do our best to resolve those as quickly as possible.  The welcome message asks that these be directed to pace-support@oit.gatech.edu.  I’d like to reiterate that request here as doing so will allow us to track and resolve issues efficiently.  Please remind your students to do so.

We have a very simple queue structure in place at the moment (one queue per cluster).  I expect to work with you, or a designate, over time to adjust as needed.

May 20, 2010

HPC Status Update 5/20/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 9:50 pm

Folks, we’re ready to go.  Accounts are created, and a very basic queue structure is set up – one queue for each cluster.  We’ll adjust those as needed.  We’ll send out welcome messages with login instructions in the morning.

May 19, 2010

HPC Status Update 5/19/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 6:09 pm

Just a quick update for today.  We’ve identified some bugs in our user creation process.  This is the last step before we can start adding users.  Should be fixed shortly…

May 15, 2010

HPC Status Update 5/15/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 12:22 pm

Hi folks,

Just a quick update here – late last night we completed the bringup of a few of the dedicated clusters and the portions of the Community Cluster that are not affected by the problem with the BladeRunner platform.  We’re running some final “sanity check” tests on these and expect to open them up for use on Monday or Tuesday.  We’ll also complete the bringup of the remainder of the clusters Monday.  We have some behind-the-scenes work to do, but expect to be able to accomplish most of that without downtime or impact on research.

We think we have a resolution to the performance issues with the home and project directory storage and are verifying those changes.

May 8, 2010

HPC Status Update 5/7/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 2:27 am

Greetings,

We had hoped to have some users on-line this week, but there are a few more things that need to be done before the dedicated clusters will be ready for full-bore research.  We anticipate being able to provide access by the end of next week to the individual clusters.

FYI, we have spent a significant amount of time working with Penguin and their supplier, SuperMicro, trying to resolve an issue that specifically impacts the BladeRunner II platform.  Penguin has taken further steps towards resolution requiring a hardware fix, and we are having to run another series of tests for verification.  In previous attempts, these nodes have taken up to a week to exhibit fatal failures.  We are now looking at availability of the Community Cluster sometime during the week of May 17-21, assuming the current round of testing is successful.  This also impacts some of the dedicated clusters, and we have been in touch with their owners.  We have escalated this issue to the highest levels within Penguin, and they have given us assurances that all efforts are being made to resolve this as quickly as possible.

We are also having some unanticipated performance issues with the home and project directory storage.  (The high-performance scratch is performing quite well.)  We have a few promising leads as well as a contingency strategy that can be quickly implemented.

The purchase of Jacket and Matlab Distributed toolkit is underway.

We have names for all of the dedicated clusters now, as well as a number of users.  If you have not submitted your list of users yet, please do so.

Please let me know if you have any questions.  I’ve added some more detail in the blog post, and I’ll post another update as we have more information.

I’ve indicated changes from last week in red.

Base networking – 97% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – delayed (not critical)
  • 10 GbE links to BCDC (replace 1gig) – one link complete, second delayed in favor of new OIT NAS connectivity (not critical)

home & project directories – 80% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • configure file servers to use PACE LDAP – (new task, complete)
  • provision user filesystems and set quotas – complete
  • configure backups for user filesystems – complete
  • resolve NFS performance issues with Home directories, Project Directories, Infrastructure storage and backup server (high impact)

scratch storage – 100% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – complete

server infrastructure – 99% complete

  • install OS and configure support servers – complete
  • install OS and configure head nodes – complete
  • install and configuration of DNS & DHCP appliances – complete
  • rename head nodes

compute nodes

  • support scripts – 90% complete
  • configure lights-out network consoles (new task) – 85% complete
  • creation of diskless system images (16 types) – 50% complete
  • 0 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – requisitions are awaiting approval from Central Purchasing
  • GPU software

Penguin tasks

  • Supermicro has identified a further resistor fix (phase 3) needed by all 50 blades
  • 11 of 50 Penguin blades out for capacitor/diode/resistor repair (39 complete)
  • 11 of 50 Penguin blades in need of BIOS information fix (39 complete)

April 30, 2010

HPC Status update – 4/30/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 10:40 pm

Much progress has been made, and we’ll likely be ready for our first users next week.  Please send along a list of users you would like us to enable access to your clusters.  If at all possible, please include their GT account names (eg gtg123x).

We are still in need of names for the individual clusters.  Please send me an alternate if you would prefer something other than our working names below:

  • M. Chou (Physics) – “Atlantis” cluster
  • S. Harvey, et.al. (Biology) – “B1” cluster
  • S. Kumar (ME) – “K1” cluster
  • P. Laguna (Physics) – “L1” cluster
  • D. Sholl, et.al. (ChBE) – “Joe” cluster
  • V. Yang (AE) – “Y1” cluster
  • M. Zhou (ME) – “Z1” cluster

In order to meet purchasing deadlines, we intend to proceed with the purchase of Jacket and Matlab Distributed toolkit next week.  This is your final chance to object!

I’ve indicated changes from last week in blue.

Base networking – 97% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – delayed (not critical)
  • 10 GbE links to BCDC (replace 1gig) – one link complete, second delayed in favor of new OIT NAS connectivity (not critical)

home & project directories – 95% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • configure file servers to use PACE LDAP – (new task, complete)
  • provision user filesystems and set quotas – deferred until next week
  • configure backups for user filesystems – deferred until next week

scratch storage – 95% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – configuration script complete, need to deploy on nodes

server infrastructure – 99% complete

  • install OS and configure support servers – complete
  • install OS and configure head nodes – complete
  • name head nodes (we need your names!)
  • install and configuration of DNS & DHCP appliances – complete

compute nodes

  • support scripts – 90% complete
  • configure lights-out network consoles (new task) – 85% complete
  • creation of diskless system images (16 types) – 30% complete
  • 8 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – under discussion
  • GPU software

Penguin tasks

  • 17 of 50 Penguin blades out for capacitor/diode/resistor repair
  • Supermicro has identified a further resistor fix needed by all 50 blades
  • 50 of 50 Penguin blades in need of BIOS information fix

April 23, 2010

HPC Status Update – 4/23/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 10:42 pm

Much progress has been made over the last week, and we’re on track for availability in early May.  We’ve heard back from a couple of you with names for your clusters.  If you haven’t replied with a desired name, please do so soon.  If you have comments one way or the other regarding the purchase of the Jacket or Matlab Distributed toolkit, please let me know.  We need to make purchasing decisions soon.

Base networking – 95% complete

  • 1 gig links to BCDC (for backups) – complete
  • 475 names and addresses defined in DNS & DHCP – complete
  • 360 ethernet ports configured – complete
  • dual 10GbE uplink from HPC backbone to campus backbone – complete
  • 10 GbE uplink from Inchworm to HPC backbone – complete
  • second 10GbE uplink Inchworm to HPC backbone – next week
  • 10 GbE links to BCDC (replace 1gig) – next week

home & project directories – 95% complete

  • iSCSI targets – complete
  • configure dual 10GbE interfaces on storage servers – complete
  • configure ZFS filesystems on storage servers – complete
  • provision user filesystems and set quotes – next week
  • configure backups for user filesystems – next week

scratch storage – 80% complete

  • Panasas setup and configuration – complete
  • Infiniband router setup and configuration – complete
  • basic host network configuration – complete
  • Panasas client software install & configuration – next week

server infrastructure

  • install OS and configure support servers
  • install OS and configure head nodes (we need your names!)
  • install and configuration of DNS & DHCP appliances – complete

compute nodes

  • support scripts – 90% complete
  • creation of diskless system images (16 types) – 20% complete
  • 8 Community Cluster nodes online
  • bringup of ~275 compute nodes

Moab workload scheduler

  • creation and testing of prologue & epilogue scripts
  • initial configuration of scheduler queues

Software

  • GSL, GIT, ACML – installed
  • Intel Compiler Suite – installed
  • Portland Group PGI Server Complete – installed
  • mvapich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich2 (w/ gcc, intel and PGI permutations) – installed
  • mpich (w/ gcc and intel permutations) – installed
  • ATLAS – in progress
  • lammps – in progress
  • Jacket & Matlab distributed toolkit – under discussion
  • GPU software

Penguin tasks

  • 13 of 50 Penguin blades out for capacitor/diode/resistor repair
  • 50 of 50 Penguin blades in need of BIOS information fix

April 16, 2010

HPC Status Update – 4/16/2010

Filed under: Inchworm deployment: Winter 2009 — admin @ 12:16 pm

Hi folks,

I’ve been a bit remiss in issuing updates recently, and I apologize for that.  We’ve made some good progress, passed some significant milestones, and had a few setbacks as well.

In an effort to improve our communication, we’ve created a blog for this project, available at https://blog.pace.gatech.edu.  Feel free to subscribe to the RSS feed if desired.  Previous status updates have been published on this site as well.  As we near the end of implementation, we’ll switch to a fixed, time-based schedule for updates, rather than communication at milestones.

Penguin has completed their portion of the project, although a couple of items of documentation were left incomplete and a few parts remain to be delivered.  All of these are non-critical.  The documentation pieces have been completed by the OIT/PACE team, and Penguin has reiterated their commitment to ship the remaining parts.  In exchange for GT accepting the purchase and issuing 100% payment, Penguin has agreed to supply an additional pair of nodes using the latest 12-core processors from AMD.  Each of these nodes has a total of 48 cores, 128GB of memory and QDR Infiniband.  These nodes will be added to the Community Cluster, and represent an approximately 10% increase in capacity for campus use.

As mentioned in our previous update, our testing has uncovered a design flaw in one particular model of compute node.  Mostly, these nodes are being used in the Community Cluster.  Other groups that have this model have been notified, and we are working to repair those as well.  Our testing has determined that this issue does not affect other models in the federation.  We’ve been working with Penguin, and their supplier SuperMicro, to develop, test and implement a hardware fix.  This fix involves removal of a diode, and a change of capacitor and resistor which are all affixed to the motherboard.  A technician from SuperMicro has been onsite this week implementing these repairs, and we are in the process of retesting the repaired nodes.

Progress on the governance committee has been made, and we hope to have a version 1.0 of the resource allocation policy out soon.

We’ve also completed the hiring of our Systems Support Specialist III.  Brian MacLeod will be joining the OIT/PACE team, effective immediately.  Brian comes to us from the Architecture and Infrastructure team within OIT.  As such, he is familiar with many of our methods and procedures and hope to be able to get him up to speed quickly.  Hiring of a Research Technologist II remains.

We have also heard from a number of faculty (some of who are on this list!) interested in additional purchases and have been working through the specification, configuration and ordering process in order to meet their purchasing deadlines.  We also continue to gather information towards a container-based strategy for campus in the event that more space/power/cooling for HPC is needed.

The following is a list of software we are considering for purchase.  Are there others that should be on this list?  Please let me know recommendations of other packages, and if these would be useful to you.

Currently in place:

Moab workload scheduler: use existing site license
RedHat Linux operating system: use existing site license
Backups: combination of open source and OS included software
OS management & provisioning: in-house implementation used by OIT
Intel & Portland Group compilers: purchased from Penguin
Under discussion:

Jacket (GPU acceleration for Matlab) – $10k/yr site license
Matlab distributed toolkit
128 workers (CPUs) – $16k initial, $3k annual
256 workers (CPUs) – $30k initial, $6k annual
In short, we’ve been making good progress towards implementing this current acquisition, but I’m afraid we may be slipping a little bit.  Rather than the early/mid April previously planned, we’re looking more like late April or early May.  We’ll send out another status update on Friday the 23rd.

Finally, one of the “important” items to do is to name your clusters.  If you could send me your choices at your earliest convenience, we’ll name your head nodes accordingly.

March 9, 2010

update on testing and next steps from Dr. Ron Hutchins

Filed under: Inchworm deployment: Winter 2009 — admin @ 9:41 am

Folks, here’s a quick update on progress with the new cluster installation in the Rich Building and some future planning.

Everything is progressing well with the implementation and initial testing of the new Federated HPC Cluster.  FYI, here is a very brief update regarding the approach to testing.  Penguin is currently conducting cluster performance and open source application tests that must run error free for several 24-hour periods.  Once these tests are completed and validated, GT will make the 2nd of 3 payments on the cluster.  Prior to making the 3rd and final payment, tests must be run on the Panasas high performance scratch storage, and applications that require a license server (Fluent, MayaKranc, Quantum ESPRESSO).

After the cluster is “turned over” to GT and appropriately configured, we have time allotted for your teams to test codes they developed on the “Interim Test Flight Cluster” that Neil mentioned in his last general update.  Please encourage your research teams to take advantage of the Interim Test Flight Cluster during the next few weeks to prepare for in-house application tests. GT processes have already uncovered one significant performance issue related to one set of user code (running VASP) that Penguin is in the process of fixing.  This highlights the importance of being as thorough as possible in our testing.

It is very exciting to be in the final stages of implementing this significant federation of research computing resources. I’m interested in getting any feedback you or your research staff have for us.

Also, several faculty have expressed interest in another round of procurements. Let us know if you have future needs and we’ll do our best to work with you on them. Our machine room upgrades are scheduled to be completed sometime in May/June and we are in the beginning phases of looking into a future temporary container based strategy for campus if more space/power/cooling is needed.

A meeting of the faculty selected by the Deans to work on the governance structure for community assets is being scheduled now.  If you have any thoughts/questions about this process let me know.

-Ron

Older Posts »

Powered by WordPress