Greetings,
We had hoped to have some users on-line this week, but there are a few more things that need to be done before the dedicated clusters will be ready for full-bore research. We anticipate being able to provide access by the end of next week to the individual clusters.
FYI, we have spent a significant amount of time working with Penguin and their supplier, SuperMicro, trying to resolve an issue that specifically impacts the BladeRunner II platform. Penguin has taken further steps towards resolution requiring a hardware fix, and we are having to run another series of tests for verification. In previous attempts, these nodes have taken up to a week to exhibit fatal failures. We are now looking at availability of the Community Cluster sometime during the week of May 17-21, assuming the current round of testing is successful. This also impacts some of the dedicated clusters, and we have been in touch with their owners. We have escalated this issue to the highest levels within Penguin, and they have given us assurances that all efforts are being made to resolve this as quickly as possible.
We are also having some unanticipated performance issues with the home and project directory storage. (The high-performance scratch is performing quite well.) We have a few promising leads as well as a contingency strategy that can be quickly implemented.
The purchase of Jacket and Matlab Distributed toolkit is underway.
We have names for all of the dedicated clusters now, as well as a number of users. If you have not submitted your list of users yet, please do so.
Please let me know if you have any questions. I’ve added some more detail in the blog post, and I’ll post another update as we have more information.
I’ve indicated changes from last week in red.
Base networking – 97% complete
- 1 gig links to BCDC (for backups) – complete
- 475 names and addresses defined in DNS & DHCP – complete
- 360 ethernet ports configured – complete
- dual 10GbE uplink from HPC backbone to campus backbone – complete
- 10 GbE uplink from Inchworm to HPC backbone – complete
- second 10GbE uplink Inchworm to HPC backbone – delayed (not critical)
- 10 GbE links to BCDC (replace 1gig) – one link complete, second delayed in favor of new OIT NAS connectivity (not critical)
home & project directories – 80% complete
- iSCSI targets – complete
- configure dual 10GbE interfaces on storage servers – complete
- configure ZFS filesystems on storage servers – complete
- configure file servers to use PACE LDAP – (new task, complete)
- provision user filesystems and set quotas – complete
- configure backups for user filesystems – complete
- resolve NFS performance issues with Home directories, Project Directories, Infrastructure storage and backup server (high impact)
scratch storage – 100% complete
- Panasas setup and configuration – complete
- Infiniband router setup and configuration – complete
- basic host network configuration – complete
- Panasas client software install & configuration – complete
server infrastructure – 99% complete
- install OS and configure support servers – complete
- install OS and configure head nodes – complete
- install and configuration of DNS & DHCP appliances – complete
- rename head nodes
compute nodes
- support scripts – 90% complete
- configure lights-out network consoles (new task) – 85% complete
- creation of diskless system images (16 types) – 50% complete
- 0 Community Cluster nodes online
- bringup of ~275 compute nodes
Moab workload scheduler
- creation and testing of prologue & epilogue scripts
- initial configuration of scheduler queues
Software
- GSL, GIT, ACML – installed
- Intel Compiler Suite – installed
- Portland Group PGI Server Complete – installed
- mvapich2 (w/ gcc, intel and PGI permutations) – installed
- mpich2 (w/ gcc, intel and PGI permutations) – installed
- mpich (w/ gcc and intel permutations) – installed
- ATLAS – in progress
- lammps – in progress
- Jacket & Matlab distributed toolkit – requisitions are awaiting approval from Central Purchasing
- GPU software
Penguin tasks
- Supermicro has identified a further resistor fix (phase 3) needed by all 50 blades
- 11 of 50 Penguin blades out for capacitor/diode/resistor repair (39 complete)
- 11 of 50 Penguin blades in need of BIOS information fix (39 complete)