GT Home : : Campus Maps : : GT Directory

Archive for October, 2011

Updated: network troubles this morning (FIXED)

Posted by on Monday, 31 October, 2011

All head nodes and critical servers are back online (some required an emergency reboot).  The network link to PACE equipment in TSRB is restored as well.

We do not believe any jobs were lost.

All Inchworm clusters should be back to normal.

Please let us know via pace-support@oit.gatech.edu if you notice anything out of the ordinary at this point.

 

network troubles this morning – 0908

Posted by on Monday, 31 October, 2011

Looks like we have a problem with a network switch this morning.  Fortunately, our resiliency improvements have mitigated some of this, but not all as we haven’t yet extended those improvements down to the individual server level.  We’re working with the OIT network team to get things back functional as soon as possible.

UPDATED: Cygnus/Force: Second failure of new VM Storage (FIXED)

Posted by on Friday, 21 October, 2011

————————————————————

UPDATE: At 8:45pm EDT, Force resumed normal function. The normal computing environment is now restored.

————————————————————

UPDATE: At 7:35pm EDT, Cygnus resumed normal function. Force is still under repair.

————————————————————

5:30pm:

Well folks, I hate to do this to you again, but it looks like I need
to take cygnus and force down again thanks to problems with the storage.

Again, I’ll down Cygnus & Force at 7pm EDT. Please begin the process
of saving your work.

At this point, I’m moving these back to the old storage system, which,
while slow (and it did impact the responsiveness of these machines) at
least stayed running without issues. The new machine has not
previously shown issues in its prior use, so, I admit to being a bit
flummoxed as to what is going on.

This downtime will be longer as I need to scrub a few things clean,
make sure the VMs will be intact and usable.

I’ll let you know when things are back online. I don’t have good
estimates this time.

No scheduled compute jobs will be impacted.

I, and the rest of the PACE team apologize for the continued
interruption in service and we hope to rectify these issues in a
couple of hours from now.

Thanks for your patience.

bnm

UPDATE: Cygnus & FoRCE now back online.

Posted by on Thursday, 20 October, 2011

The reboot immediately fixed issues with Cygnus.

FoRCE had a little extra work to be done, but it too is now online.

Thanks for the patience. Compute away!

Urgent: Cygnus & FoRCE head nodes reboot at 7pm due to Storage issues

Posted by on Thursday, 20 October, 2011

Hey folks,

We suffered a temporary loss of connectivity to the backend storage
serving our VM farm earlier this afternoon. As such, several running
VMs moved their OS filesystems to a read-only state.

The filesystems on which your data is stored are fine, however.

Unfortunately, though, the head nodes for Cygnus and the FoRCE
clusters were affected, and judging by our previous experience with
this, we need to reboot these nodes soon. As such, we ask any
currently logged in users to please save their data now and logout.

We are scheduling a reboot of these systems at 7:00pm EDT. A few
minutes after that, the nodes should be available and fully functional.

No jobs have been, nor will be lost in this process.

We are sorry for the inconvenience, and plan to keep you up to date
with any further issues with these, as well as the rest of the machines.

We’re back up

Posted by on Wednesday, 19 October, 2011

The maintenance day ran rather a bit longer than anticipated but the clusters are now back in operation and processing jobs. As usual, please send any reports of trouble to pace-support@oit.gatech.edu.

Maintenance Day Has Begun (All Clusters are Down)

Posted by on Tuesday, 18 October, 2011

As scheduled, the compute clusters have been brought down for maintenance activities.

Some of the work now progressing:

  • Network redundancy changes
  • Filesystem moves
  • Updates to critical systems
  • Change to directory services

We’ll let you know when we’re back and ready to compute.

Upcoming quarterly maintenance – 10/18/2011

Posted by on Thursday, 13 October, 2011

Reminder folks, the clusters will be down on this coming Tuesday, October 18.

All of the currently running jobs will have completed by then, and the scheduler has been instructed to not start any new jobs that will not complete by then. Jobs that have been submitted, but wouldn’t complete by Tuesday morning are being held by the scheduler, and will be released as nodes become available after our maintenance activities.

Major items on the list this time around are:

  • swap over to redundant network switches for the core of the HPC network
  • Panasas software update to version 4.1
  • routine Solaris and RedHat patching to non-user facing infrastructure services
  • routine security patches to ssh everywhere
  • migration of infrastructure services to virtual machines
  • migration to new infrastructure-facing LDAP schema
  • reinstating storage quotas missed in our previous maintenance

Some further minor things we’ll take care of as well:

  • load testing on some infrastructure servers
  • migrate the /hp3 filesystem to different fileserver, we put it on the wrong one; (no user impact expected)
  • OIT/Operations will be performing preventative maintenance on the UPS
  • OIT/Operations will be verifying some electrical circuit locations
  • update ganglia monitoring agents on all RHEL5 machines
  • reboot everything

 

Technical Computing with MATLAB at Emory University

Posted by on Thursday, 13 October, 2011

MathWorks is running a seminar over at Emory.  Please see http://www.mathworks.com/company/events/seminars/seminar60145.html for the details.  This technical seminar will discuss:

  • Importing data from Excel in to MATLAB
  • Using powerful graphics features to quickly and easily visualize and understand your data
  • Processing data using pre-written functions
  • Dynamically link MATLAB to Excel
  • Share your result
  • Speed up MATLAB applications with Parallel Computing
  • Handling large datasets with Parallel Computing

 

PACE passes 10,000 CPU core milestone

Posted by on Friday, 7 October, 2011

As we are making progress with the installation of our latest round of faculty purchases, we have now passed a significant milestone – 10,000 CPU cores managed by the PACE team.  Woohoo!  Along with all of those processors, we are also managing over half a petabyte of storage across the faculty project space, infrastructure and backups, plus a couple more hundred terabytes on our high-performance filesystems.

Given the amount of spending we have in progress at the moment, it’s very likely that we will surpass the 15,000 CPU core mark by the end of the calendar 2011.  We’ll also likely get close to a petabyte of storage.

 

Wow.

 

I’d like to take a moment and thank all of you for placing your trust in us.  On behalf of myself and the entire PACE team, I look forward to helping you take Georgia Tech to even higher levels in the future.

 

THANK YOU!!