GT Home : : Campus Maps : : GT Directory

Author Archive

localized network outage has some nodes offline

Posted by on Thursday, 25 August, 2016

At approximately 10:40 this morning, the failure of a top-of-rack network switch in the P31 rack of our data center failed. This caused a loss of network connectivity for approximately 44 compute nodes across a wide variety of queues. (see below) No other compute nodes are affected. Jobs running on these nodes will likely have failed as a result. The OIT network team is swapping in a replacement at the moment, and PACE staff are working to restore service as quickly as possible.

If you have access to any of the queues below, please check on their status and resubmit as needed. You can check which queues you have access to by using the ‘pace-whoami’ command.

We apologize for the inconvenience, and will work to bring these nodes back online as soon as possible.  If you have additional questions, please email pace-support@oit.gatech.edu.

aces
athena-intel
biocluster-6
bioforce-6
blue
chow
cochlea
dimer-6
dimerforce-6
granulous
hygene-6
hygeneforce-6
iw-shared-6
joe-6-intel
math-6
mathforce-6
orbit
prometforce-6
prometheus
sonar-6
sonarforce-6
starscream

resolved: storage problems this morning

Posted by on Monday, 1 August, 2016

We look to be back up at this point.  The root cause seems to have been a problem with the subnet manager that controls the Infiniband network.  Since GPFS uses this network, the issue initially manifested as a storage problem.  However, many MPI codes use this network as well and may have crashed.

Again, we apologize for the inconvenience.  Please do check on your jobs if you use MPI.

storage problems this morning

Posted by on Monday, 1 August, 2016

Happy Monday!

Since about 2:30am this morning, we have been experiencing a GPFS problem and, while all data is safe, all GPFS services are currently unavailable.  This includes the scratch space, and project directory (~/data) filesystems for many users.  We are working on restoring service as quickly as possible and apologize for the inconvenience.

PACE clusters ready for research

Posted by on Friday, 22 July, 2016

Our July 2016 maintenance is now substantially complete.  Again, we sincerely apologize for the unfortunate additional unplanned downtime.

As previously communicated, we’ve had an unexpected delay caused by the data migrations from the old scratch system to newly acquired system. Some of these transfers are still in progress, with a limited number of users still remaining.  We have temporarily disabled access for these users to prevent jobs running on incomplete scratch data. We are reaching out to the affected users individually with more details. These users will not be able to login and their previously submitted jobs will not run until their scratch migration is complete. If you have not received a further notification from us and experience problems with logins or anything else, please do let us know as soon as possible by sending an email to pace-support@oit.gatech.edu.

Scratch performance may be reduced as these migrations complete, and we are doing everything we can to finish these migrations as soon as possible.

We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap. 

DDN/GPFS work

The new DDN SFA-7700 system is now operational and serving scratch storage for all users. We updated client software versions on all nodes.  We have encountered an anomaly that reduces its internal redundancy but does not affect normal operation.  We expect be able to rectify this while in production. 

Electrical work

Tasks complete as described

Bonus objectives

Network and local storage upgrades were implemented on schedulers as planned.  Additional diskless nodes were converted to diskfull as planned.

EXTENDED: PACE quarterly maintenance – July ’16

Posted by on Thursday, 21 July, 2016

Dear PACE users,

Despite our best efforts, the data copies for the PACE scratch space have not gone as quickly as we had projected. We have also encountered an anomaly with the new storage system though we expect be able to rectify this while in production. At this writing, we have many of the compute nodes online but cannot start jobs until the data copy is complete.

As with all our maintenance periods, there is always a remote possibility we will run over our estimated time. This is one of those times. Please accept our apology for this unavoidable delay. We can assure you all the data is intact and we are continuing to work to optimize the transfers to achieve a speedy return to service.

Our staff will continue to update you with our progress.

Regards,
The PACE team

UNDERWAY: PACE quarterly maintenance – July ’16

Posted by on Tuesday, 19 July, 2016

Quarterly maintenance is now underway.  All clusters managed by PACE, including Tardis, are now offline.  Please see our previous post for details.

PACE quarterly maintenance – July ’16

Posted by on Monday, 18 July, 2016

Dear PACE users,

Quarterly maintenance is once again upon us.  Starting at 6:00am TOMORROW MORNING, all resources managed by PACE will be taken offline.  Maintenance will continue through Wednesday evening.  Our activities are adhering to our originally published two-day schedule.

As a heads up, please make note of our Fall maintenance, which is now scheduled to begin at 6:00am on Thursday, October 13 and continue through Saturday, October 15.  Please note that this is a three-day period, including weekend work.  Further details to come as we get closer to October.

As previously communicated, our original plan to update various system software components in July has been deferred to a future maintenance period.  We will be in touch in advance of the October maintenance with details on this, including where you can test your codes against the updated software.  (highly recommended!)

Our major activity this time around will be updates to our GPFS filesystems and DDN storage devices.

DDN/GPFS work

  • We have acquired a new DDN SFA-7700, to which we will transition the scratch space.  This will provide more consistent scratch performance, a path for future capacity and performance increases, and provide as good or better performance to what we have now.  Initially, the SFA-7700 will provide approximately 375TB of space.  We will be increasing this to the 500TB we have currently as soon as additional disks can be procured.  No user action will be required.  We currently have approximately 220TB in use on scratch, so we do not expect this temporary decrease in available capacity to be an inconvenience.
  • We have DDN engineers engaged to update firmware and software on our current SFA-12k.  This will provide additional management and quality-of-service features, as well as the ability to transition to larger capacity drives.  Additionally, we will reallocate the drives previously used for the scratch space to provide additional project space capacity and metadata performance.  No user action will be required.
  • To support the two above updates, we will also be upgrading the version of the GPFS client software (where installed) from version 3.5 to version 4.2.  No user action will be required.

Electrical work

  • Facilities electricians will be performing some electrical work in the data center that will require the power to be temporarily removed from a number of our racks.  This work is to support some newly purchased equipment.  No user action will be required.

Bonus objectives

  • Additionally, as time permits, we will upgrade the network on some of our schedulers to 10-gigabit, and add additional local logging storage.  This will not affect the Gryphon, NovaZohar or Tardis clusters.  No user action will be required.
  • Also as time permits, we will continue the transition away from diskless nodes.  This mainly affects nodes in the 5-6 years old range.  No user action will be required.

PACE clusters ready for research

Posted by on Thursday, 21 April, 2016

Our April maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

We did run into an unanticipated maintenance item with the GPFS storage – no data has been lost.  As we’ve added disks to the DDN storage system, we’ve neglected to perform a required rebalancing operation to spread load amongst all the disks.  The rebalancing operation has been running over the majority of our maintenance window, but the task is large and progress has been much slower than expected.  We will continue to perform the rebalancing during off-peak times in order to mitigate the impact on storage performance as best we are able.

Removal of /nv/gpfs-gateway-* mount points

Task complete as described.  The system should no longer generate these paths.  If you have used these paths explicitly, your jobs will likely fail.  Please continue to use paths relative to your home directory for future compatibility.  (e.g. ~/data, ~/scratch, etc.)

New GPFS gateway

Task complete as described

GPFS server and client tuning

Task complete as described

Decommission old Panasas scratch

Task complete as described.  Paths starting with /panfs no longer work.  Everybody should have been transitioned to the new scratch long ago, so we do not expect anybody to have issues here.

Enabling debug mode

Task complete as described.  You may see additional warning messages if your code not well behaved with regards to memory utilization.  This is a hint that you may have a bug.

Removal of compatibility links for migrated storage 

Task complete as described.  Affected users (Prometheus and CEE clusters)  were contacted before maintenance day.  No user impact is expected, but please send in a ticket if you think there is problem.

Scheduler updates

Task complete as described

Networking Improvements

Task complete as described

Diskless node transition

Task complete as described

Security updates

Task complete as described

UNDERWAY: PACE quarterly maintenance – April ’16

Posted by on Tuesday, 19 April, 2016

Quarterly maintenance is now underway.  All clusters managed by PACE, including Tardis, are now offline.  Please see our previous post for details.

PACE clusters ready for research

Posted by on Thursday, 28 January, 2016

Our January maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

Transition to new scratch storage
Of approximately 1,700 PACE users, we were unable to migrate less than 35.  All users should have received an email as to their status.  Additionally, those users who were not migrated will have support tickets created on their behalf so we can track their migrations through completion.  We expect about 25 of those 35 users to complete within the next 72 hours.  The remaining 10 have data in excess of the allowable quota and will be handled on a case-by-case basis.

Scheduler update
The new schedulers are in place and processing jobs.

Server networking
Task is complete as described.

GPFS tuning
Task is complete as described.

Filesystem migration – /nv/pk1
Task is complete as described.

Read-Only /usr/local
Task is complete as described.

Diskess node transition
We upgraded approximately 65 diskless nodes with local operating system storage.