The disk array rebuild has completed. Some nodes were being brought up during the rebuild to help take some jobs, but now all should be online.
November 2, 2011
October 21, 2011
UPDATED: Cygnus/Force: Second failure of new VM Storage (FIXED)
————————————————————
UPDATE: At 8:45pm EDT, Force resumed normal function. The normal computing environment is now restored.
————————————————————
UPDATE: At 7:35pm EDT, Cygnus resumed normal function. Force is still under repair.
————————————————————
5:30pm:
Well folks, I hate to do this to you again, but it looks like I need
to take cygnus and force down again thanks to problems with the storage.
Again, I’ll down Cygnus & Force at 7pm EDT. Please begin the process
of saving your work.
At this point, I’m moving these back to the old storage system, which,
while slow (and it did impact the responsiveness of these machines) at
least stayed running without issues. The new machine has not
previously shown issues in its prior use, so, I admit to being a bit
flummoxed as to what is going on.
This downtime will be longer as I need to scrub a few things clean,
make sure the VMs will be intact and usable.
I’ll let you know when things are back online. I don’t have good
estimates this time.
No scheduled compute jobs will be impacted.
I, and the rest of the PACE team apologize for the continued
interruption in service and we hope to rectify these issues in a
couple of hours from now.
Thanks for your patience.
bnm
October 20, 2011
Urgent: Cygnus & FoRCE head nodes reboot at 7pm due to Storage issues
Hey folks,
We suffered a temporary loss of connectivity to the backend storage
serving our VM farm earlier this afternoon. As such, several running
VMs moved their OS filesystems to a read-only state.
The filesystems on which your data is stored are fine, however.
Unfortunately, though, the head nodes for Cygnus and the FoRCE
clusters were affected, and judging by our previous experience with
this, we need to reboot these nodes soon. As such, we ask any
currently logged in users to please save their data now and logout.
We are scheduling a reboot of these systems at 7:00pm EDT. A few
minutes after that, the nodes should be available and fully functional.
No jobs have been, nor will be lost in this process.
We are sorry for the inconvenience, and plan to keep you up to date
with any further issues with these, as well as the rest of the machines.
August 31, 2011
Joe Cluster storage issues
Hey folks,
It looks like the project server for Joe started having issues with its hardware around 3:40pm on August 30. The particular unit of hardware affected the ability of this server to effectively store/retrieve data from the storage array. As such, it is possible that there has been some data loss, as well as some issues with jobs.
Please check the status of your jobs if they have been running on the Joe cluster between 3:39pm on August 30 to 12:00pm on August 31.
The project server has been brought online, and should be functioning normally. We will be keeping an eye out to make sure that if the system sees this error again that we will know about it immediately and can address it. We are also checking additional equipment on hand for similar issues.