The PACE team is urgently working on two ongoing critical issues with the clusters:
Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.
Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.
In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.
– Paul Manno