PACE A Partnership for an Advanced Computing Environment

November 14, 2011

Cygnus nodes back online

Filed under: tech support — Tags: — Semir Sarajlic @ 4:09 pm

The storage problem has been fixed, and the nodes are available for use. Thanks for your patience.

November 10, 2011

Updated: Network troubles, redux (FIXED)

Filed under: tech support — admin @ 8:38 pm

We’ve got the switch back.  The outage looks to have caused our virtual machine farm to reboot, so connections to head nodes will have been dropped.

This also affected the network path between compute nodes and the file servers.  With a little luck, the NFS traffic should resume, but you may want to check on any running jobs to make sure.

Word from the network team is that they were following published instructions from the switch vendor to integrate the two switches when the failure occurred.  We’ll be looking into pretty intensely, as this these switches are seeing a lot of deployments in other OIT functions.

Network troubles, redux – 11/10 3:00pm

Filed under: tech support — admin @ 8:00 pm

Hi folks,

In an attempt to restore network redundancy from the switch failure on 10/31, the Campus Network team has experienced some troubles connecting the new switch.  At this point, the core of our HPC network is non-functional.  Senior experts from the network team are working on restoring connectivity as soon as possible.

November 9, 2011

Full filesystems this morning

Filed under: tech support — admin @ 1:54 pm

This morning, we found the hp8, hp10, hp12, hp14, hp16, hp18, hp20, hp22, hp24, and hp26 filesystems full.  All of these filesystems reside on the same fileserver and share capacity.  The root cause was a an oversight on our part – a lack of quota enforcement on a particular users home directory.  The proper 5GB home directory quotas have been reinstated and we are working with this user to move their data to their project directory.  We’ve managed to free up a little space at the moment, but it will take a little time to move a couple TB of data.  We’re also doing an audit to ensure that all appropriate storage quotas are in place.

 

This would have affected users on the following clusters:

  • Athena
  • BioCluster
  • Aryabhata
  • Atlantis
  • FoRCE
  • Optimius (not production yet)
  • ECE (not production yet)
  • Prometheus (not production yet)
  • Math (not production yet)
  • CEE (not production yet)

November 8, 2011

PACE staffing, week of November 14

Filed under: News — admin @ 3:30 pm

Greetings all,

As I’m sure some of you are aware, next week is the annual Supercomputing ’11 conference in Seattle.  Many of the PACE staff will be attending, but Brian MacLeod and Andre McNeill have graciously agreed to hold the fort here.  The rest of us will be focused on conference activities but will have connectivity and can assist with urgent matters should it be required.

November 2, 2011

Cygnus/Atlantis nodes back online

Filed under: tech support — Tags: — Semir Sarajlic @ 5:23 pm

The disk array rebuild has completed. Some nodes were being brought up during the rebuild to help take some jobs, but now all should be online.

Powered by WordPress