GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

College of Engineering license servers going dark at 3:35 pm

Posted by on Tuesday, 11 April, 2017

College of Engineering (COE) license servers will go dark at 3:35pm. Research and Instruction to be impacted.

COE system engineers have stated: Running out of UPS run time. Ansys / Comsol / Abaqus / Solidworks and other software will go dark. Matlab / Autocad and NX should still be up (running in a different location).

Please test the new patched kernel on TestFlight nodes

Posted by on Wednesday, 1 March, 2017

As some of you are already aware, the dirty cow exploit was a source of great concern for PACE. This exploit can allow a local user to gain elevated privileges. For more details, please see “https://access.redhat.com/blogs/766093/posts/2757141”.

In response, PACE has applied a mitigation on all of the nodes. While this mitigation is effective in protecting the systems, it has a downside of causing debugging tools (e.g. strace, gdb and DDT) to stop working. Unfortunately, none of the new (and patched) kernel versions made available by Red Hat supports our Infiniband network drivers (OFED), so we had to leave the mitigation running for a while. This caused inconvenience, particularly for users who are actively developing codes and relying on these debuggers.

As a long term solution, we patched the source code of the kernel and recompiled it, without changing anything else. Our initial tests were successful, so we deployed it on three of the four online nodes in the testflight queue:

rich133-k43-34-l recompiled kernel
rich133-k43-34-r recompiled kernel
rich133-k43-35-l original kernel
rich133-k43-35-r recompiled kernel

We would like to ask you to please test your codes on this queue. Our plan is to deploy this recompiled kernel to all of the PACE nodes, including headnodes and compute nodes. We would like to make sure that your codes will continue to run after this deployment without any difference.

The deployment will be a rolling update, that is, we will opportunistically patch nodes starting from the idle nodes. So, there will be a mix of nodes with old and recompiled kernels in the same queues until the deployment is complete. For this reason, we strongly recommend testing multi-node parallel applications that will include the node with the original kernel (rich133-k43-35-l) in the hostlist to test the behavior of your code with mixed hostlists.

As always, please keep your testflight runs short to allow other users to test their own codes. Please report any problems to pace-support@oit.gatech.edu and we will be happy to help. Hopefully, this deployment will be completely transparent to most users, if not all.

UPS Power System Repair

Posted by on Wednesday, 1 February, 2017

PACE and other systems in the Rich 133 computer room experienced a brief power event on the afternoon of Monday, January 30th. This power event involved significant failure of one of the three uninterruptible power supply (UPS) systems that supply the Rich computer room with stable, filtered power. The UPS system switched over to bypass mode as designed and one of the main power feeder transfer switches also experienced a failure. Stable power continued to the PACE systems and all systems and network devices continued to operate without interruption.

Repair of the failed UPS is underway but parts may not be available for up to two weeks. During this time, the UPS power system will remain in bypass mode connecting many of the PACE systems to standard campus power. Our experience shows the campus power is usually clean enough to operate normally and so we are operating normally. Repair and re-testing of the UPS can take place without interruption of the existing power. We will announce this repair transition when we have additional information.

Should there be any significant campus power interruption during this interim time, we may lose power to some of the PACE systems. Rest assured the PACE staff will do our best to recover all systems affected by such an event. We will keep you informed via blog and announcement mailing lists of the repair progress.

PACE clusters ready for research

Posted by on Friday, 20 January, 2017

Our January 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and data available.  Our next maintenance period is scheduled for Thursday May 11 through Saturday May 13, 2017.

Removal of obsolete /usr/local/packages

  • This set of old software has been made inaccessible. Loading the ‘oldrepo’ module will now result in an error and instructions for contacting PACE support for assistance in utilizing the current software repository.

Infiniband switch swap

  • Replacement complete, no user action needed.

Readdressing network management

  • Work complete, no user action needed.

Upgrade of scheduler server for the NovaZohar cluster

  • Upgrade complete, no user action needed.  Further detail has been provided to the users of this cluster.

 

resolved: storage problems this morning

Posted by on Monday, 1 August, 2016

We look to be back up at this point.  The root cause seems to have been a problem with the subnet manager that controls the Infiniband network.  Since GPFS uses this network, the issue initially manifested as a storage problem.  However, many MPI codes use this network as well and may have crashed.

Again, we apologize for the inconvenience.  Please do check on your jobs if you use MPI.

storage problems this morning

Posted by on Monday, 1 August, 2016

Happy Monday!

Since about 2:30am this morning, we have been experiencing a GPFS problem and, while all data is safe, all GPFS services are currently unavailable.  This includes the scratch space, and project directory (~/data) filesystems for many users.  We are working on restoring service as quickly as possible and apologize for the inconvenience.

Some PACE accounts temporarily disabled

Posted by on Friday, 22 July, 2016

Some of our uses will not be able to login to PACE clusters after the maintenance day due to ongoing scratch data migration process. Here are the details:

One of the tasks performed during this maintenance day (http://blog.pace.gatech.edu/?p=5943) was to deploy a new fileserver to serve the “scratch” storage.

The new system is now operational, however the scratch data for some of the accounts are still in the process of migration. We temporarily inactivated these accounts to prevent modifications to incomplete data sets. The submitted jobs by these users are also held by the scheduler. Once the transfer is complete for a user, we will enable the account, release user’s jobs, and send a notification email. Currently we have no means to estimate how long the migration will take, but we are doing all we can to make this process go as quickly as possible.

If you are one of these users and need any of your data urgently, please contact us pace-support@oit.gatech.edu. All of your data are intact and accessible by us. We will try to find a solution for delivering the data you need for your research.

Head node availability

Posted by on Friday, 15 July, 2016

UPDATE 10:30am

All head nodes and support nodes in the VM farm are online.

Initial Post – 09:15am

Early this morning (2016/07/15, approximately 2:00am), we had a critical storage failure that cause our VM farm to declare all running head nodes as invalid.  We’re looking into this seriously, as this is one of those “not supposed to happen” moments.  In the mean time, the PACE team is working on getting these nodes back up and running for all users.

New pace website launched

Posted by on Wednesday, 4 May, 2016

Welcome to our updated website! We’ve transitioned all of our content to a new website, available at pace.gatech.edu. Please be sure to check out the updated user support section, available via the front page link ‘Current User Support‘. While we aim to provide as up-to-date content as possible, if you notice anything that seems outdated, please let us know.

If you miss our old website or need content that isn’t present on our new website, please let us know – it’s temporarily available at prev.pace.gatech.edu.

As always, thanks for choosing PACE.

PACE quarterly maintenance – April ’16

Posted by on Wednesday, 13 April, 2016

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, April 19 and continuing through Wednesday, April 20.  We are planning several improvements that hopefully will provide a much better PACE experience.

GPFS storage improvements

Removal of all /nv/gpfs-gateway-* mount points (user action recommended): In the past, we had noticed performance and reliability problems with mounting GPFS natively on machines with slow network connections (including most headnodes, some compute nodes, and some system servers). To address this problem, we deployed a physical ‘gateway’ machine that mounts GPFS natively and serves its content via NFS to machines with slow network (see http://blog.pace.gatech.edu/?p=5842).

We have been mounting this gateway on *all* of the machines using these locations:

/nv/gpfs-gateway-pace1
/nv/gpfs-gateway-scratch1
/nv/gpfs-gateway-menon1

Unfortunately, these mount points caused some problems in the longer run, especially when a system variable (PBS_O_WORKDIR) being assigned these locations as the “working directory” for jobs even on machines with fast network connections. As a result, a large fraction of the data operations went through the gateway server, instead of the GPFS server, causing significant slowness.

We partially addressed this problem by fixing the root cause for unintended PBS_O_WORKDIR assignment, and also with user communication/education.

On this maintenance day, we are getting rid of these mount points completely. Instead, GPFS will always be mounted on:

/gpfs/pace1
/gpfs/scratch1
/gpfs/menon1

Regardless of how that particular node is mounting GPFS (natively or via the gateways).

User action: We would like to ask all of our users to please check your scripts to ensure that old locations are not being used. Jobs that try to use these locations will fail after the maintenance day (including those that have already been submitted).

A new GPFS gateway (no user action required): We increasingly rely on GPFS filesystem for multiple storage needs, including the scratch, majority of project directories, and some home directories.  While the gateway provided some benefits, some users continued to report unresponsive/slow commands on headnodes due to a combination of high levels of activity and limited NFS performance.
On this maintenance, we are planning to deploy a second gateway server to separate headnodes from other functions (compute nodes and backup processes). This will improve the responsiveness of headnodes, providing our users with a better interactivity on headnodes. In other words, you will see much less slowness when running system commands, such as “ls”.

GPFS server and client tuning (no user action required): We identified several configuration tuning parameters to improve the performance and reliability of GPFS in light of vendor recommendations and our own analysis. We are planning to  apply these configuration changes on this maintenance day as a fine tuning step.

Decommissioning old Panasas scratch (no user action required)

When we made the switch to the new scratch space (GPFS) on the January maintenance, we kept the old (Panasas) system accessible as read-only. Some users received a link to their old data if their migration had not completed within the maintenance window. We are finally ready to pull the plug on this Panasas system. You should have no dependencies on this system anymore, but please contact the PACE support as soon as possible if you have any concerns or questions regarding decommissioning of this system.

Enabling debug mode (limited user visibility)

RHEL6, which has been used on all PACE systems for a long while,  optionally  comes with a implementation of the memory-allocation functions to perform additional heap error/consistency checks at runtime. We’ve had this functionality installed, but memory errors have been silently ignored per our configuration, which is not ideal. We are planning to change the configuration to print diagnostics on the stderr when an error is detected. Please note, you should not see any differences in the way your codes are running, this only changes how memory errors are reported.  This behavior is controlled by the MALLOC_CHECK_ environment variable. A simple example is when a dynamically allocated array is freed twice (e.g. using the ‘free’ statement in C). Here’s a demo for different behaviors for three different values of MALLOC_CHECK_ when an array is freed twice:

MALLOC_CHECK_=0
(no output)


MALLOC_CHECK_=1

*** glibc detected *** ./malloc_check: free(): invalid pointer: 0x0000000000601010 ***

MALLOC_CHECK_=2

Aborted (core dumped)

We currently have this value set to “0” and will make “1” the default to dump some description of the error(s). If this change is causing any problems for you, or you simply don’t want any changes in your environment, then you can simply assign “0” to this value in your “~/.bashrc” to overwrite the new default.

Removal of compatibility links for migrated storage (some user action may be required)

We had migrated some of the NFS project storages (namely pcee1 and pme[1-8]) to GPFS in the past. When we did that, we placed links in the older storage (that starts with /nv/…) that points to the new gpfs location (starts with /gpfs/pace1/project/…) to protect active jobs from crashing. This was only temporary to facilitate the transition.

As a part of this maintenance day, we are planning to remove these links completely. We already contacted all of the users whose project are on these locations and confirmed that their ~/data links are updated accordingly, so we expect no user impact. That said, if you are one of these users, please make sure that none of your scripts reference to the old locations mentioned in our email.

Scheduler updates (no user action required)

We have a patched version of the resource manager (Torque) that had been deployed on the scheduler servers shortly after the January maintenance day. This patch addresses a bug in the administration functions only. While it’s not critical for compute nodes, we will go ahead and update all compute nodes to bring their version at par with the scheduler for consistency. This update will not cause any visible differences for the users. No user action required.

Networking Improvements (no user action required)

Spring is here and it’s time for some cleanup. We will get rid of unused cables in the datacenter and remove some unused switches from the racks. We are also planning some recabling to take better advantage of existing switches to improve redundancy. We will continue to test and enable jumbo frames (where possible) to lower networking overhead. None of these tasks require user actions.

Diskless node transition (no user action required)

We will continue the transition away from diskless nodes that we started in October 2015.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Security updates (no user action required)

We are also planing to update some system packages and libraries to address known security vulnerabilities and bugs. There should be no user impact.