Update: (3/10/2016, 5:00pm) Most issues are resolved, back to normal operation
The GPFS storage is back to normal performance and has now been stable for several days. However, we will continue to explore additional steps with DDN to improve the performance of the GPFS storage and schedule any upgrades recommended for our maintenance time in April. Please continue to let us know if you observe any difficulties with this critical component of the PACE clusters.
What happened:
As with most any significant storage issue, there were multiple vectors of difficulty encountered. We identified multiple factors contributing to the GPFS performance problems including uncommonly high user activity, a bad cable connection, a memory misconfiguration on most systems when we added the new GPFS scratch file system and a scheduler configuration to correctly use the new scratch space.
What was impacted:
Performance of all GPFS file systems suffered greatly during the event. Compute, login and interactive nodes and scheduler servers temporarily lost their mount points, impacting some of the running jobs. There was never any loss of user data or data integrity.
What we did:
We contacted the storage vendor support and worked with them via several phone and screen-sharing sessions to isolate and correct each of the several problems. We have added storage and node monitoring to be able to detect the memory and file system conditions that were contributing factors to this failure and have discussed operation and optimization steps with the necessary users.
What is ongoing:
We continue to work with the vendor to resolve any remaining issues and will strive to further improve performance of the file system.
Update: (3/4/2016, 6:30pm) GPFS storage still stable, albeit with intermittent slownessÂ
GPFS storage has been mostly stable. While not back to previous levels, the performance of the GPFS storage continued to improve today. We identified multiple factors contributing to the problem, including uncommonly high user activity. There are almost half a billion files in the system and the bandwidth usage has approximated the design peak a few times, which is unprecedented. While it’s great that the system is utilized at that levels, the impact of problems inevitably gets amplified under high load. We continue to work with the vendor to resolve the remaining issues and really appreciate your patience with us during this long battle.
You can help us a great deal by avoiding large data operations (e.g. cp, scp, rsync) on the headnodes. The headnodes are low capacity VMs that do not mount GPFS using native clients. Instead, all the traffic goes through a single NFS fileserver. The proper location for all data operations is the datamover node (iw-dm-4.pace.gatech.edu), which is a physical machine with fast network connections to all storage servers. Please limit your activity on the datamover machine strictly with data operations. We noticed several users running regular computations on this node and had to kill those processes.
Update: (3/1/2016, 7:30pm) GPFS storage has stabilized and schedulers are resumed.
GPFS storage appears to have stabilized, without any data loss, and we resumed scheduling to allow new jobs on the cluster.
It seems our system had outgrown an important storage configuration parameter (tokenMemLimit) having to do roughly with the number of open files times file systems times number of nodes for the whole storage-system. There was no warning message given by the storage system of impending failure. There were some symptoms observed which we were investigating which we, of course, now understand more clearly. We have asked the vendor to review the remaining parameters and recommend any additional changes.
Update: (3/1/2016, 4:30pm) Schedulers paused, new jobs will not start
Unfortunately we lost GPFS storage on the majority of compute nodes, potentially impacting running jobs using this file storage system (most project directories and all scratch). To prevent future problems, we temporarily paused schedulers. Your submissions (qsub) will appear to be hanging until we resume the scheduling functions.
What’s happening
Many users noticed that GPFS storage has slowed down recently. In some cases, this causes unresponsive commands (e.g. ‘ls’) on headnodes.
Who’s impacted
GPFS storage includes some project space (data), the new scratch, and Tardis-6 queue home directories.
How PACE is responding
We are taking the first of the several steps to address this issue. Instead of an unplanned downtime, we are planning to submit jobs to request entire nodes to facilitate the fix. This way, the solution can be applied when there are no jobs actively running on the node.
These jobs will be submitted by “pcarey8” user and will run on all of the queues. You will continue to see these jobs until all nodes are fixed, which may span a long time period (depending on when the nodes become available, which are already running long-walltime jobs). Once a node is acquired, the fix will not take too long to apply, however.
How can you help
* Replace “$PBS_O_WORKDIR” with the actual path to working directory in submission (PBS) scripts.
* Prevent concurrent data transfers and operations for very large number of files.