GT Home : : Campus Maps : : GT Directory

Author Archive

Resolved: Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

We’ve addressed some of the problems with the TrueNAS storage and CSH/TCSH should now be working again. As it turns out, this problem wasn’t actually related to the maintenance last week, and we will continue to work with the vendor regarding the cause.

Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

Currently, we’ve observed a correlation between hanging processes on all PACE systems and csh/tcsh, and are continuing to investigate. For the time being, if you have commands related to csh or tcsh, we ask that you please refrain from running them for the time being. It appears to be related to the TrueNAS storage system. We’re currently working with OIT and iXsystems to resolve the issue.

Storage (GPFS) Issue Update

Posted by on Tuesday, 25 July, 2017

We are seeing a reduction in the GPFS filesystem problems over the past weekend, and are continuing to actively work with the vendor. We don’t have a complete solution yet, but have observed greater stability for compute nodes in the GPFS filesystem. Thank you for your patience – we will continue to keep you updated as much as possible as the situation changes.

Storage (GPFS) Issue Update

Posted by on Friday, 14 July, 2017

While the problem wasn’t very widespread and we have improved the reliability, we have not yet arrived at a full solution and are still actively working on the problem. We now believe the problem is due to the recent addition of many compute nodes, ultimately bringing us into the next tier of system-level tuning needed for the filesystem. Thank you for your patience – we will continue to provide updates as they become available.

Storage (GPFS) Issue

Posted by on Wednesday, 12 July, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts scratch and project directories (~/scratch, and ~/data). At the moment, we are exploring this failure with the vendor if this may be related to the recent cluster nodes that have been brought online.

This issue has potential impact on running jobs. We are actively working on the problem, apologize for the inconvenience, and will update as soon as possible.

Storage (GPFS) and datacenter problems resolved

Posted by on Monday, 19 June, 2017

All nodes and GPFS filesystem issues affected from the power failure should be resolved as of late Friday evening (June 16) . If you are still experiencing problems, please let us know at pace-support@oit.gatech.edu.

Large Scale Problem

Posted by on Wednesday, 7 June, 2017

Update (6/7/2017, 1:20pm): The network issues are now addressed and systems are back in normal operation.Please check your jobs and resubmit failed jobs as needed. If you continue to experience any problems, or need our assistance for anything else, please contact us at pace-support@oit.gatech.edu. We are sorry for this inconvenience and thank you once again for your patience.

Original message: We are experiencing a large scale network problem impacting multiple storage servers and software repository with a potential impact on running jobs. We are currently actively working to get it resolved and will provide updates as much as possible. We appreciate your patience and understanding, and are committed to resolving the issue as soon as we possibly can.

College of Engineering (COE) license servers available starting 5:10 pm yesterday

Posted by on Wednesday, 12 April, 2017

Starting 5:10 pm 11 April 2017, COE license servers are available again.

Multiple Georgia power outages are plaguing multiple license servers on campus. All efforts have been made to keep systems available. If your jobs report missing or unavailable licenses, please check http://licensewatcher.ecs.gatech.edu/ for the most up to date information.

College of Engineering license servers going dark at 3:35 pm

Posted by on Tuesday, 11 April, 2017

College of Engineering (COE) license servers will go dark at 3:35pm. Research and Instruction to be impacted.

COE system engineers have stated: Running out of UPS run time. Ansys / Comsol / Abaqus / Solidworks and other software will go dark. Matlab / Autocad and NX should still be up (running in a different location).

UPS Power System Repair

Posted by on Wednesday, 1 February, 2017

PACE and other systems in the Rich 133 computer room experienced a brief power event on the afternoon of Monday, January 30th. This power event involved significant failure of one of the three uninterruptible power supply (UPS) systems that supply the Rich computer room with stable, filtered power. The UPS system switched over to bypass mode as designed and one of the main power feeder transfer switches also experienced a failure. Stable power continued to the PACE systems and all systems and network devices continued to operate without interruption.

Repair of the failed UPS is underway but parts may not be available for up to two weeks. During this time, the UPS power system will remain in bypass mode connecting many of the PACE systems to standard campus power. Our experience shows the campus power is usually clean enough to operate normally and so we are operating normally. Repair and re-testing of the UPS can take place without interruption of the existing power. We will announce this repair transition when we have additional information.

Should there be any significant campus power interruption during this interim time, we may lose power to some of the PACE systems. Rest assured the PACE staff will do our best to recover all systems affected by such an event. We will keep you informed via blog and announcement mailing lists of the repair progress.