GT Home : : Campus Maps : : GT Directory

PACE clusters ready for research

Friday, May 12, 2017 Posted by
Comments closed

Our May 2017 maintenance period is now complete, far ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, August 10 through Saturday, August 12, 2017.

New operating system kernel

  • All compute, interactive, and head nodes have received the updated kernel. No user action needed.

DDN firmware updates

  • This update brought low level firmware on drives up to date per recommendation from DDN. No user action needed.

Networking

  • DNS/DHCP and firewall updates per vendor recommendation applied by OIT Network Engineering.
  • IP address reassignments for some clusters completed. No user action needed.

Electrical

  • Power distribution repairs completed by OIT Operations. No user action needed.

PACE quarterly maintenance – May 11, 2017

Monday, May 8, 2017 Posted by
Comments closed

PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Systems

  • We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.

Storage

  • Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)

Network

  • Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
  • Software upgrades to the PACE firewall appliance to address a known bug
  • New subnets and re-assignment of IP addresses for some of the clusters

Power

  • PDU fixes that are impacting 3 nodes in c29 rack

The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.

College of Engineering (COE) license servers available starting 5:10 pm yesterday

Wednesday, April 12, 2017 Posted by
Comments closed

Starting 5:10 pm 11 April 2017, COE license servers are available again.

Multiple Georgia power outages are plaguing multiple license servers on campus. All efforts have been made to keep systems available. If your jobs report missing or unavailable licenses, please check http://licensewatcher.ecs.gatech.edu/ for the most up to date information.

College of Engineering license servers going dark at 3:35 pm

Tuesday, April 11, 2017 Posted by
Comments closed

College of Engineering (COE) license servers will go dark at 3:35pm. Research and Instruction to be impacted.

COE system engineers have stated: Running out of UPS run time. Ansys / Comsol / Abaqus / Solidworks and other software will go dark. Matlab / Autocad and NX should still be up (running in a different location).

Please test the new patched kernel on TestFlight nodes

Wednesday, March 1, 2017 Posted by
Comments closed

As some of you are already aware, the dirty cow exploit was a source of great concern for PACE. This exploit can allow a local user to gain elevated privileges. For more details, please see “https://access.redhat.com/blogs/766093/posts/2757141”.

In response, PACE has applied a mitigation on all of the nodes. While this mitigation is effective in protecting the systems, it has a downside of causing debugging tools (e.g. strace, gdb and DDT) to stop working. Unfortunately, none of the new (and patched) kernel versions made available by Red Hat supports our Infiniband network drivers (OFED), so we had to leave the mitigation running for a while. This caused inconvenience, particularly for users who are actively developing codes and relying on these debuggers.

As a long term solution, we patched the source code of the kernel and recompiled it, without changing anything else. Our initial tests were successful, so we deployed it on three of the four online nodes in the testflight queue:

rich133-k43-34-l recompiled kernel
rich133-k43-34-r recompiled kernel
rich133-k43-35-l original kernel
rich133-k43-35-r recompiled kernel

We would like to ask you to please test your codes on this queue. Our plan is to deploy this recompiled kernel to all of the PACE nodes, including headnodes and compute nodes. We would like to make sure that your codes will continue to run after this deployment without any difference.

The deployment will be a rolling update, that is, we will opportunistically patch nodes starting from the idle nodes. So, there will be a mix of nodes with old and recompiled kernels in the same queues until the deployment is complete. For this reason, we strongly recommend testing multi-node parallel applications that will include the node with the original kernel (rich133-k43-35-l) in the hostlist to test the behavior of your code with mixed hostlists.

As always, please keep your testflight runs short to allow other users to test their own codes. Please report any problems to pace-support@oit.gatech.edu and we will be happy to help. Hopefully, this deployment will be completely transparent to most users, if not all.

UPS Power System Repair

Wednesday, February 1, 2017 Posted by
Comments closed

PACE and other systems in the Rich 133 computer room experienced a brief power event on the afternoon of Monday, January 30th. This power event involved significant failure of one of the three uninterruptible power supply (UPS) systems that supply the Rich computer room with stable, filtered power. The UPS system switched over to bypass mode as designed and one of the main power feeder transfer switches also experienced a failure. Stable power continued to the PACE systems and all systems and network devices continued to operate without interruption.

Repair of the failed UPS is underway but parts may not be available for up to two weeks. During this time, the UPS power system will remain in bypass mode connecting many of the PACE systems to standard campus power. Our experience shows the campus power is usually clean enough to operate normally and so we are operating normally. Repair and re-testing of the UPS can take place without interruption of the existing power. We will announce this repair transition when we have additional information.

Should there be any significant campus power interruption during this interim time, we may lose power to some of the PACE systems. Rest assured the PACE staff will do our best to recover all systems affected by such an event. We will keep you informed via blog and announcement mailing lists of the repair progress.

PACE clusters ready for research

Friday, January 20, 2017 Posted by
Comments closed

Our January 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and data available.  Our next maintenance period is scheduled for Thursday May 11 through Saturday May 13, 2017.

Removal of obsolete /usr/local/packages

  • This set of old software has been made inaccessible. Loading the ‘oldrepo’ module will now result in an error and instructions for contacting PACE support for assistance in utilizing the current software repository.

Infiniband switch swap

  • Replacement complete, no user action needed.

Readdressing network management

  • Work complete, no user action needed.

Upgrade of scheduler server for the NovaZohar cluster

  • Upgrade complete, no user action needed.  Further detail has been provided to the users of this cluster.

 

PACE quarterly maintenance – January 2017

Thursday, January 12, 2017 Posted by
Comments closed

Dear PACE users,

It is again time for our quarterly maintenance. Starting at 6:00am Thursday, January 19, all resources managed by PACE will be taken offline. Maintenance is scheduled to continue through Saturday evening. Our next maintenance period is scheduled for Thursday May 11 through Saturday May 13, 2017.  We have a reduced scope this time around, as compared to our previous maintenance periods, with only one item visible to users.

Removal of obsolete /usr/local/packages
We will be removing (nearly) all content from /usr/local/packages. This set of software represents a repository two versions old, much of which is incompatible with the currently deployed operating system. We believe that this software is not currently in use – with one exception. We will continue to work with that user to accommodate their needs. Newer and/or compatible versions of all software being removed are available in the current repository.

Old modules, including the module that has been used to access to this old repository (oldrepo) will be removed. If attempt to load this module(s) in your environment or in PBS scripts, you will get an error. Please contact pace-support@oit.gatech.edu if you need assistance with finding replacement modules in the current repository.

Infiniband switch swap
We will replace a small infiniband switch used by infrastructure servers with one that has redundant power supplies. This was identified during the recent electrical maintenance by OIT. No user action is required.

Readdressing network management
With the assistance of the OIT Network Engineering team, we will move the management IP addresses for a number of network devices. This will make room for additional user-facing services. As these devices are not accessible to the user community, no user action is required.

Upgrade of scheduler server for the NovaZohar cluster
The scheduler server responsible for the NovaZohar cluster will be upgraded during the maintenance period. This will provide for additional performance for scheduler related tasks (submitting jobs, querying status, etc.) Previously submitted jobs will be retained, and resumed at the conclusion of maintenance. No user action is expected.

holiday support and PACE staffing

Friday, December 23, 2016 Posted by
Comments closed

Greeting, and Happy Holidays!

Please note that all PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. The Rich building will be closed and the OIT Operations staff will also be unavailable over the holiday, though available via phone. If you become aware of a catastrophic, system-wide failure please notify OIT Operations at (404) 894-4669. They will be able to get in touch with us.

On a much more somber note, Ms. Josephine Palencia, one of the PACE Research Scientists, will be leaving the team for a position in industry, effective January 4. This leaves PACE in a very difficult position with 4 vacant full time positions from a team of 11.5 FTEs. We will continue to do our best to keep things operational, however delays are unavoidable while we complete the respective hiring searches. Please direct interested parties to http://www.pace.gatech.edu/careers.

Power maintenance 12/19/2016 (Monday)

Friday, December 16, 2016 Posted by
Comments closed

(No user action needed)

We have been informed GT Facilities will perform critical power maintenance beginning at 6am Monday 12/19/2016, in one of the PACE datacenters.

We believe, after a careful investigation, PACE systems should have sufficient power redundancy to allow the careful work to be completed without required downtime or failure. However, there is always a small risk that some jobs or service will be impacted. We will work closely with the OIT operations and facilities teams to help protect running jobs from failures. We will keep all PACE users informed of progress or should failures occur.