GT Home : : Campus Maps : : GT Directory

Author Archive

GPFS Problem

Posted by on Friday, 1 September, 2017

We are actively debugging a GPFS storage problem on our systems that unfortunately brought many queues offline. We do not yet fully know the cause and solution, but will update as soon as possible.

We apologize for the inconvenience and are actively working on a solution.

Please Use iw-dm-4 for File Transfers

Posted by on Tuesday, 29 August, 2017

We experienced a slowdown yesterday on all headnodes caused by an unusually large amount of user file operations from headnodes. All headnodes are virtual machines and connect through a network file system gateway to the GPFS filesystem. This gateway became overwhelmed by the user file operations, and subsequently slowed down all headnode file operations.

If you have heavy file operations (i.e. winSCP, FileZilla, SCP), please perform these by logging directly into iw-dm-4.pace.gatech.edu instead of a headnode. Additionally, other file operations such as tarring/zipping are best performed on compute nodes by submitting interactive or batch jobs, as well as iw-dm-4.

We’re actively looking into alternatives to virtual machine headnodes, and will provide more detailed updates as we approach our upcoming scheduled maintenance in November (via blog.pace.gatech.edu). If you have any questions, please email us at pace-support@oit.gatech.edu.

Resolved: Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

We’ve addressed some of the problems with the TrueNAS storage and CSH/TCSH should now be working again. As it turns out, this problem wasn’t actually related to the maintenance last week, and we will continue to work with the vendor regarding the cause.

Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

Currently, we’ve observed a correlation between hanging processes on all PACE systems and csh/tcsh, and are continuing to investigate. For the time being, if you have commands related to csh or tcsh, we ask that you please refrain from running them for the time being. It appears to be related to the TrueNAS storage system. We’re currently working with OIT and iXsystems to resolve the issue.

Storage (GPFS) Issue Update

Posted by on Tuesday, 25 July, 2017

We are seeing a reduction in the GPFS filesystem problems over the past weekend, and are continuing to actively work with the vendor. We don’t have a complete solution yet, but have observed greater stability for compute nodes in the GPFS filesystem. Thank you for your patience – we will continue to keep you updated as much as possible as the situation changes.

Storage (GPFS) Issue Update

Posted by on Friday, 14 July, 2017

While the problem wasn’t very widespread and we have improved the reliability, we have not yet arrived at a full solution and are still actively working on the problem. We now believe the problem is due to the recent addition of many compute nodes, ultimately bringing us into the next tier of system-level tuning needed for the filesystem. Thank you for your patience – we will continue to provide updates as they become available.

Storage (GPFS) Issue

Posted by on Wednesday, 12 July, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts scratch and project directories (~/scratch, and ~/data). At the moment, we are exploring this failure with the vendor if this may be related to the recent cluster nodes that have been brought online.

This issue has potential impact on running jobs. We are actively working on the problem, apologize for the inconvenience, and will update as soon as possible.

Storage (GPFS) and datacenter problems resolved

Posted by on Monday, 19 June, 2017

All nodes and GPFS filesystem issues affected from the power failure should be resolved as of late Friday evening (June 16) . If you are still experiencing problems, please let us know at pace-support@oit.gatech.edu.

Large Scale Problem

Posted by on Wednesday, 7 June, 2017

Update (6/7/2017, 1:20pm): The network issues are now addressed and systems are back in normal operation.Please check your jobs and resubmit failed jobs as needed. If you continue to experience any problems, or need our assistance for anything else, please contact us at pace-support@oit.gatech.edu. We are sorry for this inconvenience and thank you once again for your patience.

Original message: We are experiencing a large scale network problem impacting multiple storage servers and software repository with a potential impact on running jobs. We are currently actively working to get it resolved and will provide updates as much as possible. We appreciate your patience and understanding, and are committed to resolving the issue as soon as we possibly can.

College of Engineering (COE) license servers available starting 5:10 pm yesterday

Posted by on Wednesday, 12 April, 2017

Starting 5:10 pm 11 April 2017, COE license servers are available again.

Multiple Georgia power outages are plaguing multiple license servers on campus. All efforts have been made to keep systems available. If your jobs report missing or unavailable licenses, please check http://licensewatcher.ecs.gatech.edu/ for the most up to date information.