PACE A Partnership for an Advanced Computing Environment

February 28, 2019

[Resolved] Storage problem impacting applications and login

Filed under: Uncategorized — Semir Sarajlic @ 8:23 pm

At about 2:30pm, during a routine storage server procedure, we experienced a problem that was related to a service not starting properly. We have resolved the issue within 15 minutes. This incident caused temporary unavailability of some applications and home directories. The symptoms include hanging commands, codes, and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

Thank you for your attention, and apologies for this inconvenience.

February 18, 2019

[Resolved] Expected Network Interruptions Due to Campus Network Maintenance – Intermittent delays or disruption to major campus IT services

Filed under: Uncategorized — Semir Sarajlic @ 9:16 pm

[Original Post – February 18, 2019] On Sunday, Feb. 24, OIT will perform a series of data center upgrades and migrations. This service window includes intermittent delays or disruption to major campus IT services between 7 a.m. and 8 p.m. as well as occasional interruptions in wireless connectivity between 9 a.m. and 12 p.m.

During this service upgrade, the intermittent service interruptions will result in periods when users may not be able to connect to PACE managed resources or they may be disconnected from their sessions, which may  interrupt interactive jobs that rely on an active SSH connection to a given cluster.   However, these upgrades will not impact running or queued batch jobs.  OIT anticipates all the service upgrades and migrations to conclude by 8 p.m., and PACE users should resume their work as usual.

For additional information and details on the services that OIT will be upgrading and migrating, please refer to the status page link at https://status.gatech.edu

February 16, 2019

PACE clusters ready for research

Filed under: Uncategorized — Semir Sarajlic @ 11:37 pm
Our February 2019 maintenance (https://blog.pace.gatech.edu/?p=6419) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Please let us know any problems you may notice: pace-support@oit.gatech.edu

Compute
* (COMPLETE) Vendor will replace defective components on groups of servers
Network

* (COMPLETE) Ethernet network reconfiguration

Storage
* (COMPLETE) GPFS / DDN enclosure reset

* (COMPLETE) NAS maintenance and reconfiguration

Other
• (COMPLETE) PACE VMWare reconfiguration to remove out of support hosts

* (COMPLETE) Migration of Megatron cluster to RHEL7

February 2, 2019

[Resolved] Scheduler problem on RHEL7 Dedicated Clusters

Filed under: Uncategorized — Semir Sarajlic @ 3:56 am

[Resolved – February 1, 21:35] At about 5:20pm on February 1, scheduler for the new RHEL7 dedicated clusters went down after encountering a segmentation fault error.  We’ve resolved the incident, and brought the scheduler back online.  As far as we know, this incident impacted two jobs based on our assessment.  We advise that you review your jobs from today.  Additionally, users who may have attempted to submit jobs between 5:20pm – 9:35pm may have experienced scheduler communication errors when running qstat, qsub… commands.

We will continue to monitor the scheduler and update if needed. If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your attention, and apologies for this inconvenience.

 

Powered by WordPress