GT Home : : Campus Maps : : GT Directory

Author Archive

Cluster Downtime December 19th for Scratch Space Cancelled

Posted by on Wednesday, 12 December, 2012

We have been working very closely with Panasas regarding the necessity of emergency downtime for the cluster to address the difficulties with the high-speed scratch storage. At this time, they have located a significant problem in their code base that, they believe, is responsible for this and other issues. Unfortunately, the full product update will not be ready in time for the December 19th date so we have cancelled this emergency downtime and all jobs running or scheduled will continue as expected.

We will update you with the latest summary information from Panasas when available. Thank you for your continued patience and cooperation with this issue.

– Paul Manno

Cluster Downtime December 19th for Scratch Space Issues

Posted by on Friday, 7 December, 2012

As many of you have noticed, we have experienced disruptions and undesirable performance with our high-speed scratch space. We are continuing to work diligently with Panasas to discover the root cause and repair for these faults.

As we are working toward a final resolution of the product issues, we will need to schedule an additional cluster-wide downtime on the Panasas to implement a potential resolution. We are scheduling a short downtime (2 hours) for Wednesday, December 19th at 2pm ET. During this window, we expect to install a tested release of software.

We understand this is an inconvenience to all our users but feel this is important enough to the PACE community to warrant this disruption. If this particular date and duration falls at a time that is especially difficult, please contact us and let us know and we will do our best to negotiate a better date or time.

It is our hope this will implement a permanent solution to these near-daily disruptions.

– Paul Manno

Scratch storage issues: update

Posted by on Tuesday, 11 September, 2012

Scratch storage status update:

We continue to work with Panasas on the difficulties with our high-speed scratch storage system. Since the last update, we have received and installed two PAS-11 test shelves and have successfully reproduced our problems on them under the current production software version. We then updated to their latest release and re-tested only to observe a similar problem with this new release as well.

We’re continuing to do what we can to encourage the company to find a solution but are also exploring alternative technologies. We apologize for the inconvenience and will continue to update you with our progress.

Scratch Storage and Scheduler Concerns

Posted by on Monday, 20 August, 2012

The PACE team is urgently working on two ongoing critical issues with the clusters:

Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.

Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.

In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.

– Paul Manno

We’re back up

Posted by on Wednesday, 19 October, 2011

The maintenance day ran rather a bit longer than anticipated but the clusters are now back in operation and processing jobs. As usual, please send any reports of trouble to pace-support@oit.gatech.edu.

Clusters Are Back!

Posted by on Friday, 9 September, 2011

1530

After days of continuous struggle and troubleshooting, we are happy to tell you that the clusters are finally back in a running state. You can now start submitting your jobs. All of your data have been safe, however the jobs that were running during the incident were killed and they need to be restarted. We understand how this interruption must have adversely impacted your research and apologize for all the trouble. Please let us (pace-support@oit.gatech.edu) know if there is anything we can do to bring you up to speed once again.

The brief technical explanation of what happened:
At the heart were a set of fiber optic cables that interacted to intermittently interrupt communications among the Panasas storage modules.  This would result in the remaining modules beginning to move the services handled by a non-communicating module to a backup location.  During the process of moving the service, one of the other modules (including the one accepting the new service) would either send or receive some garbled information causing the move now in process to be re-recovered or an additional service to be relocated, depending upon which modules were involved.  Interestingly, the cables themselves appear not to be bad but instead interacted badly with the networking components. Thus, when cables were replaced or switch ports or network switch itself were swapped, the problems would appear “fixed” for a short while then return before a full recovery could be completed. The three vendors involved provided access to their top support and engineering resources and these have never seen this kind of behavior. Our experience and adversity have been entered into their knowledge bases for future diagnostics.

Thank you once again for your understanding and patience!

Regards,
PACE Team

Cluster status 9 Sept

Posted by on Friday, 9 September, 2011

0800

Our replacements continue to function properly and the overnight tests have all passed.  We are starting the process of bringing the cluster back up though, as indicated in yesterday evening’s update, this will take us some hours to accomplish as we will test each step along the way to ensure the success of operation.

1050

So far, so good.  Looks like we’re still on track to have the cluster ready early afternoon today.  We’ll post a notice here and in the email lists announcing when it is ready.

Note, all the jobs that were running should be assumed failed.  Any job that was in the queue and had not yet started remains in the queue and will be started as soon as the cluster is ready.

Problems Continued (8 Sept)

Posted by on Thursday, 8 September, 2011

0720

Our replacement Force-10 switch is expected to be delivered before noon by FedEx.  While waiting, we have continued working with Panasas overnight (between short sleeps) providing them with updated information.   We’ll continue to update this blog as we have further information.

1500

The replacement was received, installed but the problems remain.  Panasas and Penguin are both on-site and are working to restore service with remote support.  We have escalated this to the highest levels at both corporations and believe they are doing their best at this time to help restore service.  We will update this blog and send emails once we have some positive news.

2100

After replacement of the Force-10 switch, we continued to have network instability issues with the Panasas.  After much sleuthing, it was discovered there is an interaction between the cable type used to connect the Panasas devices to the Force-10 switch and the switch itself.  This was a bit difficult to diagnose due to the intermittent nature of the failures where some cable paths would work fine for some long period of time then partially corrupt packets for some short period of time, then be fine once again.  The corruption time and significance was sufficient to alert the Panasas software there was a problem with one or more of the units and it would attempt recovery.  Part-way through the recover process, the data path would be fine but often another path would begin to fail in a similar mode.  Unusual behaviour for any cables at best and not seen before with this cable type.  This interaction is now logged in both the Panasas and Force-10 knowledge archives.

Once the cables were replaced, there remained some significant problems with the file systems themselves.  Again, there is no loss or corruption of data.  Just the volumes of information being moved automatically at too-frequent an interval.  After the Panasas realm was settled, it is a long task to re-certify the data partitions and insure the data is correct.  This process is ongoing now and will continue for some while overnight as it re-certifies the many TB of data.

If all goes according to plan, we will arrive in the morning to a still-stable Panasas storage and will begin the recovery of the cluster operation.  Expectation is we will have the cluster back in operation by early afternoon, earlier if at all possible.  Once we have successfully tested the cluster, we will restart all the scheduler services and announce both here and via the mailing lists. Hopefully, we’ll have the good word in the morning.

Problems continued (7 Sept)

Posted by on Wednesday, 7 September, 2011

0830

Panasas engineers on their way to be on-site.  Expecting tracking information for Force-10 switch replacement.

We re-configured the networking for the Panasas shelves bypassing part of the Force-10 switch.  The Panasas realm appears stable but unavailable via the Infiniband.  We will continue to work with Panasas on this resolution.

1430

The replacement Force-10 switch has not been received. Panasas and PenguinComputing (now both on site) are investigating.  Alternate networking topologies are being explored with the campus backbone team and level 3 Panasas engineers.  The PACE staff continues to work as quickly as possible in an effort to restore service for this critical resource.  We are investigating some ways we can restore service access to the files but at a temporarily reduced capacity until a final solution may be obtained.  We are also investigating how we may bring access to your files to the head nodes so you may be able to access them or copy them as you need though, perhaps, at a reduced speed than normal.

These possibilities are being examined by PACE, OIT network, Panasas and Penguin Computing personnel.  We will continue to update you here with information as it becomes available.

1850

The replacement Force-10 switch has not been received and will not be on-site until tomorrow morning. We have a FedEx tracking number and will monitor it’s progress.  Some NFS-access to the Panasas has been restored by creating an alternate network topology.  While not ideal, this will allow us to bring a few parts of the clusters back into operation.  However, at this time, we are still unable to access the Panasas from most of the systems and head nodes. We continue to work as quickly as possible to restore service with this critical resource.  We can confirm at this time, no files have been lost or corrupted through all of the actions to date.  All that is left is to restore access to them.

Panasas engineers remain on-site and and engaged and will continue working on the panfs access problem.

FoRCE headnode emergency restart

Posted by on Monday, 1 August, 2011

As most of you had noticed over the past week, the FoRCE headnode has had difficulty keeping up with the high utilization demand, and often times was unresponsive. We have reconfigured this node to increase its memory and CPU power to address this issue. The configuration and reboot required a short offline period, but none of the submitted jobs on the compute nodes were affected.

We would like once again remind our users that FoRCE headnode, or other headnodes in that matter, are not intended for running computations. Head nodes are best for editing, compilations, submitting jobs, but not for actual computations. Please limit your use of GUI sessions, such as browsers, Comsol, Matlab, etc., which put a lot of pressure on shared system resources. We will continue to work to improve the responsiveness and function of the FoRCE headnode.

Thanks for your understanding!

-PACE support