PACE A Partnership for an Advanced Computing Environment

July 31, 2020

[Mitigated] Globus Access Restored

Filed under: Uncategorized — Michael Weiner @ 11:53 pm

PACE’s globus-internal server, which hosts the PACE Internal endpoint, experienced an outage beginning earlier this afternoon. We have redirected traffic to an alternate interface, and access to PACE storage via Globus is restored.

The PACE Internal endpoint provides access to the main PACE system in Rich, including home, project, and scratch storage, in addition to serving as the interface to PACE Archive storage. Hive is accessed via a separate Globus endpoint and was not affected.

As a reminder, you can find instructions on how to use Globus for file transfer to/from PACE at http://docs.pace.gatech.edu/storage/globus/. Please contact us at pace-support@oit.gatech.edu with any questions.

 

July 29, 2020

VPN Upgrades

Filed under: Uncategorized — Michael Weiner @ 4:35 pm

We would like to inform you of several upcoming updates to Georgia Tech’s VPNs, which you use to connect to PACE from off-campus locations.

The GlobalProtect VPN client will be updated on August 4, 8-10 PM. This will improve support for macOS 10.15.4+ (removing the Legacy System Extension message) and address other bugs. There will be an automatic update, but you may choose to test it early, as described at faq.oit.gatech.edu/content/how-do-i-get-started-globalprotect-campus-vpn#labportal.

The AnyConnect VPN client will also be getting upgraded. As with previous upgrades, your client will automatically download the new client the first time you attempt to connect after the update. You may choose to upgrade early by connecting your client to dev.vpn.gatech.edu, then returning to the normal address when the update is installed. The PACE VPN (used for CUI/ITAR clusters only) will be upgraded on August 4, 8-10 PM. The anyc VPN (used for most PACE resources and the rest of the GT campus) will be upgraded on August 11, 8-10 PM.

Please visit status.gatech.edu for further details on all pending updates to Georgia Tech’s VPN service.

July 28, 2020

[UPDATE] shared-scheduler Degraded Performance

Filed under: Uncategorized — Aaron Jezghani @ 9:06 pm

7/31/2020 UPDATE

Dear Researchers,

In addition to the previously announced maintenance day activities, we will be migrating the Torque component of shared-sched to a dedicated server to address the recent performance issues. This move should improve the scheduler’s response time to client queries such as qstat, and decrease job submission and start times when compute resources are available. While you do not need to do anything to prepare for this migration, we advise that you make note of any jobs queued at the start of maintenance just in case. As always, please direct any questions or concerns to pace-support@oit.gatech.edu. We thank you for your patience.

The PACE Team

 

7/29/2020 UPDATE

Dear Researchers,

At this time the scheduler is functional, although some commands may be slow to respond. We will continue investigating to ascertain the source of these problems, and will update accordingly. Thank you.

[ORIGINAL MESSAGE]

We are aware of a significant slowdown in the performance of the shared-scheduler since last week. Initial attempts to resolve the issue towards the end of the week appeared successful, but the problems have restarted and we are continuing our investigation along with scheduler support. We appreciate your patience as we work to restore full functionality to shared-scheduler.

The PACE Team

July 24, 2020

[Resolved]: PACE Maintenance Days 8/6/2020-8/8/2020

Filed under: Maintenance — Semir Sarajlic @ 7:10 pm

Dear PACE Users,

RESOLVED: PACE is now ready for research.

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on August 6th, 2020 and conclude at 11:59 PM on August 8th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

– None Current.

ITEMS NOT REQUIRING USER ACTION:

– [Resolved] Coda Lustre Upgrade (This will start on Wednesday (08/05), which will impact testflight-coda only, and a scheduler reservation was put in place to prevent any jobs from running past 6:00AM on Wednesday – August 5).

– [Resolved] Install additional line cards for CS8500 Infiniband switch.

– [Resolved] Deploy PBSToools RPM on schedulers

– [Resolved] Upgrade Hive Infiniband switches firmware to version 3.9.0914

– [Resolved] Upgrade Coda Infiniband director switches firmware to version 3.9.0914

– [Resolved] Move DNS appliance from Rich to Coda.

– [Resolved] Update coda-apps file system mounts to use qtrees from NetApp on all servers.

– [Deferred] Update Nvidia GPU Drivers in Coda to support Cuda 11 SDK.

– [Resolved] Reboot of all nodes.

– [Resolved] Rebooted the subnet manager.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

The PACE Team

July 20, 2020

[Resolved] Georgia Power Micro Grid Testing (continued)

Filed under: Uncategorized — Michael Weiner @ 1:03 pm

[Update 7/22/20 1:00 PM]

Hive and testflight-coda systems were restored early this morning. Systems have returned to normal operation, and user jobs are running. If you were notified of a lost job, please resubmit it at this time.

Georgia Power does not plan to conduct any tests today. No additional information about the cause of yesterday’s outage is available at this time.

[Update 7/21/20 11:00 PM]

The power outage in CODA has been bypassed, and power is returning to the Coda research hall.  However, because the cooling plant has been offline for so long, it will require about 2 hours to restart and stabilize before we can resume full operation.  Due to the late hour, we will begin to bring systems back on in the morning and provide another update when we’re back to normal operation.  Georgia Power will be researching the root cause of this outage in the morning, and we will share details if available.

[Update 7/21/20 3:15 PM]

Unfortunately, the planned testing of the Georgia Power Micro Grid this week has led to a loss of power in the Coda research hall, home to compute nodes for Hive & testflight-coda. Any running jobs on those clusters will have failed at this time. Access to login nodes and storage, housed in the Coda enterprise hall, is uninterrupted.

We are sorry for what we know if a significant interruption to your work.

We will follow up with users who had jobs running at the time of the power outage to provide more specific information.

At this time, teams are working to restore power to the system. We will provide an update when available.

 

[Update 7/14/20 4:00 PM]

Georgia Power will be conducting additional bypass tests for the MicroGrid power generation facility for the Coda datacenter (Hive & testflight-coda clusters) during the week of July 20-24. These tests represent a slightly higher risk of disruption than the tests conducted in June, but the risk has been substantially lowered by additional testing last month.

As before, we do not expect any disruption to PACE compute resources. PACE’s storage and head nodes have UPS and generator backup power, but compute nodes do not. In the event of an unexpected complication during testing, compute nodes could lose power for a brief period, disrupting running jobs. Georgia Power, DataBank, OIT’s network team, and PACE will all have staff on standby during these tests to ensure a quick repair in the event of an unexpected outage.

Please contact us at pace-support@oit.gatech.edu with any questions.

Visit https://blog.pace.gatech.edu/?p=6778 for full details on this power testing.

July 14, 2020

[Resolved] PACE License Server Outage

Filed under: Uncategorized — Michael Weiner @ 7:49 pm

The PACE license server experienced an outage earlier this afternoon, which has since been resolved.

The following software licenses were not available on PACE during the outage: Intel compiler, Gurobi, Allinea, PGI. If you experienced difficulty accessing these services earlier today, please retry your job at this time.

The outage did not affect the College of Engineering license server, which hosts campus-wide licenses for some licensed software widely used on PACE, including MATLAB, COMSOL, Abaqus, and Ansys.

Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress