PACE A Partnership for an Advanced Computing Environment

May 24, 2018

Major Outage of GT network on Sunday, May 27

Filed under: Uncategorized — Semir Sarajlic @ 9:54 pm

OIT Operations team informed us about a service outage on Sunday (5/27, 8am). Their detailed note is copied below.

This outage should not impact running jobs, however you will not be able to login to headnodes and VPN for the duration of this outage.

If you have ongoing data transfers (using SFTP, scp, rsync), they *will* be terminated. We strongly recommend waiting until successful completion of this work before starting any large data transfers. Similarly, your active SSH connections will be interrupted, please save your work and exit all sessions as you can.

PACE team will be in contact with the Operations team and provide status updates in this blog post as needed: https://blog.pace.gatech.edu/?p=6259

More details:

There will be a major service disruption to Georgia Tech’s network due to a software upgrade to a core campus router beginning on Sunday, May 27 at 8:00 a.m. Overall, network traffic from on campus to off and off campus to on will also be affected. Some inter-campus traffic will remain up during the work, but most services will not be available.While the software upgrade is expected to be complete by 9:00 a.m., and most connectivity restored, there may be outages with various centrally provided services. Therefore, a maintenance window is reserved from 8:00 a.m. until 6 p.m. The following services may be affected and therefore not available. These include, but are not limited to CAS (login.gatech.edu), VPN, LAWN (GTwifi, eduroam, GTvisitor), Banner/Oscar, Touchnet/Epay, Buzzport, Email (delayed delivery of e-mail but no e-mail lost), Passport, Canvas, Resnet network connectivity, Vlab, T-Square, DegreeWorks, and others.Before services go down, questions can be sent to support@oit.gatech.edu or via phone call at 404-894-7173.  During the work, please visit status.gatech.edu for updates. Our normal status update site, status.oit.gatech.edu, will not be available during this upgrade. After the work is completed, please report issues to the aforementioned e-mail address and phone number or call OIT Operations at 404-894-4669 for urgent matters.The maintenance consists of a software upgrade to a core campus router that came at the recommendation of the vendor following an unexpected error condition that caused a brief network outage earlier this week. “We expect the network connectivity to be restored by noon, and functionality of affected campus services to be recovered by 6:00 PM on Sunday May 27, though many services may become available sooner,” says Andrew Dietz, ITSM Manager, Sr., Office of Information Technology (OIT).We apologize for the inconvenience this may cause and appreciate your understanding while we conduct this very important upgrade.

 

May 18, 2018

Storage (GPFS) slowness impacting pace1 and menon1 systems

Filed under: Uncategorized — Semir Sarajlic @ 7:30 pm

update (5/18/2018, 4:15pm): We’ve identified a large number of jobs overloading the storage and worked with their owners to delete them. This resulted in an immediate improvement in performance. Please let us know if you observe any of the slowness comes back over the weekend.

original post: PACE is aware of GPFS (storage) slowness that impacts a large fraction of users from the pace1 and menon1 systems. We are actively working, with guidance from the vendor, to identify the root cause and resolve this issue ASAP.

This slowness is observed from all nodes mounting this storage, including headnodes, compute nodes and the datamover.

We believe that we’ve found the culprit, but more investigation is needed for verification. Please continue to report any slowness problems to us.

May 11, 2018

PACE clusters ready for research

Filed under: Uncategorized — Semir Sarajlic @ 9:43 pm

Our May 2018 maintenance (https://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, Aug 9 through Saturday, Aug 11, 2018.

Schedulers

Job-specific temporary directories (may require user action): Complete as planned. Please see the maintenance day announcement (https://blog.pace.gatech.edu/?p=6158)  to see how this impacts your jobs.

ICE (instructional cluster) scheduler migration to a different server (may require user action): Complete as planned. Users should not notice any differences.

Systems Maintenance

ASDL cluster (requires no user action): Complete as planned.   Bad CMOS batteries are replaced and the fileserver has a replacement CPU. Memory problems were related to bad CPU, which are resolved without changing any Memory DIMMs.
Replace PDUs on Rich133 H37 Rack (requires no user action): Deferred per the request of cluster owner.

LIGO cluster rack replacement (requires no user action): Complete as planned.

Storage

GPFS filesystem client updates on all of the PACE compute nodes and servers (requires no user action): Complete as planned, and tested. Please report any missing storage mounts to pace-support.
Run routine system checks on GPFS filesystems (requires no user action): Complete as planned, no problems found!
Network

The IB network card firmware upgrades (requires no user action): Complete as planned.
Enable 10GbE on physical headnodes (requires no user action): Complete as planned.
Several improvements on networking infrastructure (requires no user action): Complete as planned.

 

May 3, 2018

[Resolved] Large Scale Storage Problems

Filed under: Uncategorized — Semir Sarajlic @ 7:46 pm

Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.

update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.

update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.

 

Original Message:

We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.

We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.

We are sorry for this inconvenience and thank you for your patience.

 

 

Powered by WordPress