GT Home : : Campus Maps : : GT Directory

Author Archive

Campus preparedness and hurricane Irma

Posted by on Friday, 8 September, 2017

Greetings PACE community,

As hurricane Irma makes its way along the projected path through Florida and into Georgia, I’d like to let you know what PACE is doing to prepare.

OIT Operations will be closely monitoring the path of the storm and any impacts it might have on the functionality of computer rooms in the Rich Computer Center and our backup facility on Marietta Street. In the event that either of these facilities were to loose power, they will enact emergency procedures and react as best as possible.

What does this mean for PACE?

The room where we keep the compute nodes only has a few minutes of battery protected power. While this is plenty to ride through any momentary glitches in power, it only lasts a few minutes. In the event of a power loss, compute nodes will power down and terminate whatever jobs are running. The rooms where we keep our servers, storage and backups have additional generator power which can keep them running longer. This too is a finite resource. In the event of power loss, PACE will begin orderly shutdown of servers and storage in order to reduce the chance of data corruption or loss.

Bottom line is that our priority will be to protect the critical research data, and enable successful resumption of research once power is restored.

Where to get further updates?

Our primary communications channels remain our mailing list, pace-availability@lists.gatech.edu, and the PACE blog (http://blog.pace.gatech.edu). However, substantial portions of the IT infrastructure required for these to operate are also located in campus data centers. Additionally, OIT employs a cloud-based service to publish status updates. In the event that our blog is unreachable, please visit https://status.gatech.edu.

GPFS problem (resolved)

Posted by on Saturday, 2 September, 2017

This was much ado about nothing.  Running jobs continued to execute normally through this event, and no data was at risk.  What did happen is that jobs that could potentially have started were delayed.

A longer explanation –

We have monitoring agents that prevent jobs from starting if they detect a potential problem with the system.  The idea is to avoid starting a job if there’s a known reason that would cause a crash.  During our last maintenance period, we brought a new DDN storage system online and configured these agents to watch it for issues.  It did develop an issue, the monitoring agents flagged it and took nodes offline to new jobs.  However, we have yet to put any production workloads on this new storage so no running jobs were affected.

At the moment, we’re pushing out a change to the monitoring agents to ignore the new storage.  As this finishes rolling out, compute nodes will come online and resume normal processing.  We’re also working with DDN to address the issue on the new storage system.

PACE clusters ready for research

Posted by on Friday, 12 May, 2017

Our May 2017 maintenance period is now complete, far ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, August 10 through Saturday, August 12, 2017.

New operating system kernel

  • All compute, interactive, and head nodes have received the updated kernel. No user action needed.

DDN firmware updates

  • This update brought low level firmware on drives up to date per recommendation from DDN. No user action needed.

Networking

  • DNS/DHCP and firewall updates per vendor recommendation applied by OIT Network Engineering.
  • IP address reassignments for some clusters completed. No user action needed.

Electrical

  • Power distribution repairs completed by OIT Operations. No user action needed.

PACE clusters ready for research

Posted by on Friday, 20 January, 2017

Our January 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and data available.  Our next maintenance period is scheduled for Thursday May 11 through Saturday May 13, 2017.

Removal of obsolete /usr/local/packages

  • This set of old software has been made inaccessible. Loading the ‘oldrepo’ module will now result in an error and instructions for contacting PACE support for assistance in utilizing the current software repository.

Infiniband switch swap

  • Replacement complete, no user action needed.

Readdressing network management

  • Work complete, no user action needed.

Upgrade of scheduler server for the NovaZohar cluster

  • Upgrade complete, no user action needed.  Further detail has been provided to the users of this cluster.

 

PACE quarterly maintenance – January 2017

Posted by on Thursday, 12 January, 2017

Dear PACE users,

It is again time for our quarterly maintenance. Starting at 6:00am Thursday, January 19, all resources managed by PACE will be taken offline. Maintenance is scheduled to continue through Saturday evening. Our next maintenance period is scheduled for Thursday May 11 through Saturday May 13, 2017.  We have a reduced scope this time around, as compared to our previous maintenance periods, with only one item visible to users.

Removal of obsolete /usr/local/packages
We will be removing (nearly) all content from /usr/local/packages. This set of software represents a repository two versions old, much of which is incompatible with the currently deployed operating system. We believe that this software is not currently in use – with one exception. We will continue to work with that user to accommodate their needs. Newer and/or compatible versions of all software being removed are available in the current repository.

Old modules, including the module that has been used to access to this old repository (oldrepo) will be removed. If attempt to load this module(s) in your environment or in PBS scripts, you will get an error. Please contact pace-support@oit.gatech.edu if you need assistance with finding replacement modules in the current repository.

Infiniband switch swap
We will replace a small infiniband switch used by infrastructure servers with one that has redundant power supplies. This was identified during the recent electrical maintenance by OIT. No user action is required.

Readdressing network management
With the assistance of the OIT Network Engineering team, we will move the management IP addresses for a number of network devices. This will make room for additional user-facing services. As these devices are not accessible to the user community, no user action is required.

Upgrade of scheduler server for the NovaZohar cluster
The scheduler server responsible for the NovaZohar cluster will be upgraded during the maintenance period. This will provide for additional performance for scheduler related tasks (submitting jobs, querying status, etc.) Previously submitted jobs will be retained, and resumed at the conclusion of maintenance. No user action is expected.

holiday support and PACE staffing

Posted by on Friday, 23 December, 2016

Greeting, and Happy Holidays!

Please note that all PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. The Rich building will be closed and the OIT Operations staff will also be unavailable over the holiday, though available via phone. If you become aware of a catastrophic, system-wide failure please notify OIT Operations at (404) 894-4669. They will be able to get in touch with us.

On a much more somber note, Ms. Josephine Palencia, one of the PACE Research Scientists, will be leaving the team for a position in industry, effective January 4. This leaves PACE in a very difficult position with 4 vacant full time positions from a team of 11.5 FTEs. We will continue to do our best to keep things operational, however delays are unavoidable while we complete the respective hiring searches. Please direct interested parties to http://www.pace.gatech.edu/careers.

PACE clusters ready for research

Posted by on Saturday, 15 October, 2016

Our October 2016 maintenance period is now complete. We’ve compute nodes, login nodes and interactive (post-processing) nodes to the RedHat Enterprise Linux 6.7 previously deployed on the TestFlight cluster.  This included a large number of bugfix and security patches, a major step forward in the Infiniband layer and recompiled versions of various MPI, scientific libraries and applications in /usr/local.  Please do let us know (via email to pace-support@oit.gatech.edu) if you see issues with your jobs.

We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap.

DDN/GPFS work
Hardware repairs to the project directory (~/data) system is complete.  Minor repairs to the scratch system will be rescheduled for a future maintenance period.  The issue is minor and should not disrupt performance or availability of the scratch system.  No user actions are expected.

Networking
The OIT Network Engineering team upgraded the software running on many of our switches to match that which is running elsewhere on campus.  This included our firewalls.  No user actions are expected.

Electrical work
These problems were a bit more extensive than originally anticipated.  With some help from the OIT Operations team, we have a alternate solution in place, and will complete this work at a future maintenance period.  No user action is expected.

Bonus objectives
We were able to add capacity to the project directory system, and we now have our first single filesystem that’s greater than a petabyte, coming at about 1.3PB.  Maybe that’ll last us a couple of weeks.  😉  Disks for the scratch system have been installed, we will add them into the scratch filesystem shortly.  This can be done live without impact to running jobs.

PACE quarterly maintenance – October 2016

Posted by on Monday, 10 October, 2016

Dear PACE users,

It is again time for our quarterly maintenance. Starting at 6:00am Thursday, October 13, all resources managed by PACE will be taken offline. Maintenance is scheduled to continue through Saturday evening. Our next maintenance period is scheduled for January 19-21, 2017.

As previously posted, our major activity this time around will be an upgrade of all compute nodes, head nodes and interactive nodes from RedHat 6.5 to RedHat 6.7. This is simply deploying the operating system and software repository we been testing internally and have made available on the TestFlight cluster for the broader PACE community to perform testing as well. I’ll issue one more plea for additional testing to the user community and ask that you report experiences, either positive or negative, to pace-support@oit.gatech.edu.

DDN/GPFS work
We have some hardware repairs to perform on both DDN devices. Internal redundancy has prevented any availability or integrity problems so we’ll take advantage of the downtime for DDN engineers to make repairs. No user action is expected.

Networking
The OIT Network Engineering team will be upgrading the software running on a number of PACE network devices. These updates are for stability and have been running on similar devices in the campus network for some time. No user action is expected.

Electrical work
OIT Operations will be replacing some faulty rack-level power distribution in a couple of racks. No user action is expected.

Bonus objectives
If sufficient time remains in the maintenance period, we will add capacity to both the GPFS scratch and project systems. We are able to perform these activities during normal operations, so we will prioritize bringing things back into operation as soon as possible.

TestFlight cluster available with RHEL6.7

Posted by on Thursday, 22 September, 2016

The TestFlight cluster is now available with the updated RHEL6.7 load, as well as some recompilations of software in /usr/local. Please, do login to testflight-6.pace.gatech.edu and try submitting your jobs to the ‘testflight’ queue. If you have any problems, please send a note to pace-support@oit.gatech.edu.

PACE quarterly maintenance – October 2016

Posted by on Tuesday, 13 September, 2016

Dear PACE users,

Quarterly maintenance is fast approaching. Starting at 6:00am on Thursday, October 13, all resources managed by PACE will be taken offline. Maintenance will continue through Saturday evening unless we are able to finish sooner.

Our major activity this maintenance period will be an operating system upgrade for all compute nodes, head nodes and interactive nodes. This update will take us from RedHat 6.5 to RedHat 6.7, and includes important security and bug fix updates to the operating system, a new Infiniband layer and some recompiled versions of existing /usr/local software. Some applications have shown increased performance as well.

*** IMPORTANT USER ACTION NEEDED ***

PACE staff have been testing this upgrade using various existing applications but we need your help to ensure a smooth rollout. As of today, we have begun applying these updates to our TestFlight cluster, which is available for all to use. We’ll send out a follow up communication when it is ready. PLEASE, PLEASE, PLEASE, use the next few weeks to try your codes on the TestFlight cluster and send feedback to pace-support@oit.gatech.edu. We would especially like to hear of any issues you may have, but reports of working applications would be helpful as well.

Our goal is to provide the best possible conversion to the updated operating system, and ask that you please take the opportunity to help us ensure a smooth transition back into normal operation by availing yourself of the TestFlight cluster.