GT Home : : Campus Maps : : GT Directory

Changes to mount points (no user impact expected)

Thursday, January 3, 2019 Posted by
Comments closed

The investigation results that followed the system failures that temporarily rendered the scientific repository unresponsive (http://blog.pace.gatech.edu/?p=6390) will require some additional maintenance. To facilitate this maintenance, we will make a change to the mount point for /usr/local, which is network mounted and identical on all compute nodes.

Our tests indicate that this swap can be performed live, without impacting running jobs. It’s also completely transparent to users; you don’t need to change or do anything as a result.

In the unlikely event of job crashes that you suspect are caused by this operation, please contact pace-support@oit.gatech.edu and we’ll be happy to assist.

Thank you,
PACE Team

[Resolved] Wide spread problems impacting all PACE machines

Tuesday, December 11, 2018 Posted by
Comments closed

Update (12/21, 10:15am): A correction: The problems have started this morning around 8:15am, not yesterday evening as previously communicated. The systems were back online at 8:45am.

Update (12/21, 9:15am): There has been another incident started last night, causing the same symptoms (hanging and unavailability of scientific repository). OIT storage engineers reverted the services on the redundant system (high availability pair) and the storage is available again. We continue to work on investigating the root cause of recurring failures experienced since the past several weeks.

Update (12/12, 6:30pm): The services are successfully migrated to the high availability pair and the filesystems are once again accessible. We’ll continue to monitor the systems and take a close look into the errant components. It’s still a possibility that some of these problems may recur, but we’ll be ready to address them should they happen.

Update (12/12, 5:30pm): Unfortunately the problems seem to be coming back. We continue to work on this. Thank you for your patience.

Update (12/12, 11:30am): We identified the root cause as a configuration conflict between two devices and resolved the problem. All systems are back online and available for jobs.

Update (12/12, 10:00am): Our battle with the storage system continues. This filesystem is designed as a high availability service with redundancy components to prevent such situations, but unfortunately the second system failed to take over successfully. We are investigating the possibility of network being the culprit. We continue to work rigorously to bring the systems back online ASAP.

Update (12/11, 9:00pm): Continued problems, we are working on it with support from related OIT units. 

Update (12/11, 7:30pm): We mitigated the issue, but the intermittent problems may continue to recur until the root cause is addressed. We continue to work on it.

Original message:

Dear PACE Users,

At around 3:45pm on Dec 11  the fileserver that serves the shared “/usr/local” on all PACE machines started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

We are working to resolve these problems ASAP and will keep you updated on this post.

 

 

 

 

Brief Interruption to VPN During Urgent VPN Service Maintenance

Wednesday, November 28, 2018 Posted by
Comments closed

On November 29, 2018, from 10:00pm (EST)- 11:00pm (EST), OIT will be conducting maintenance of our VPN service.  During this period, users that are connected to our clusters via the VPN (anyc.vpn.gatech.edu) will be disconnected, and you will need to reconnect to the VPN and then the cluster.  This service maintenance will not impact any of the running batch jobs, but it may impact running interactive jobs during this period.  For additional details on the maintenance taking place, please visit the following site: https://status.gatech.edu/incidents/9ljkjx72462x

Thank you for your attention to this urgent maintenance that OIT is conducting.

[Resolved] CoC-ICE Cluster: Multi-node job problem

Wednesday, November 21, 2018 Posted by
Comments closed

[Update – November 26, 2018] We’ve identified the issue and resolved the configuration error.  Users are now able to submit multi-node jobs on the CoC-ICE cluster.

[Original Post – November 21, 2018]

We are investigating an issue in which users experience hanging jobs when they submit a multi-node job on CoC-ICE cluster.   This issue does not impact users who are submitting jobs on a single node.  Also, this issue is not impacting the PACE-ICE cluster.

Thank you for your patience, and we apologize for this inconvenience while we resolve this issue.

[Resolved] ICE Clusters – Intermittent account problems

Thursday, November 8, 2018 Posted by
Comments closed

We received multiple reports about jobs crashing after being allocated on the instructional clusters (COC-ICE and PACE-ICE).   We’ve determined that intermittent account problems are the cause of this issue, and we are working towards a solution.

Thank you for your patience, and we apologies for the inconvenience.

 

[RESOLVED] Scratch storage problems

Wednesday, November 7, 2018 Posted by
Comments closed

We received multiple reports of jobs crashing due to insufficient scratch storage, but our physical usage is only at 41%.

We’ve identified the issue is related to the disk pools that were not able to migrate data to other pools internally as a result of a threshold process/procedure that was not started post maintenance day.  Now, we initiated this process, and we are migrating the data to appropriate pools, which should resolve the issues experienced in jobs crashing due to insufficient scratch storage.

We will continue to monitor the scratch storage to ensure its operation is optimal.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your patience, and apologies for the inconvenience.

PACE clusters ready for research

Saturday, November 3, 2018 Posted by
Comments closed

Our November 2018 maintenance (http://blog.pace.gatech.edu/?p=6360) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Completed Tasks

Compute

  • Complete – (no user action needed) Replace power components in a rack in Rich 133
  • Complete(no user action needed) Replace defective PCIe connectors on multiple servers
      • As a precaution, additional identified nodes will have their PCIe connectors replaced  when parts are delivered.  There will be no user action needed.

Network

  • Complete(no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • Complete(no user action needed) Change uplink connections from management switches

Storage

  • Complete(no user action needed) Verify integrity of GPFS file systems
  • Complete(no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • Complete(no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • Complete (some user action needed) Replaced PACE ICE schedulers with a physical server, to increase capacity and reliability.   Some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[COMPLETE] PACE Quarterly Maintenance – November 1-3, 2018

Monday, October 29, 2018 Posted by
Comments closed

[Update – November 3, 2018, 4:45pm] 

Our November 2018 maintenance is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Please note that some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[Update – November 2, 2018] 

Verification of the integrity of the GPFS file system is taking longer than initial estimate.  As a result, this maintenance day will last the full three days, as scheduled, that will allow us to complete the certification of the file system and ensure the highest integrity of the data. 

[Original Post – October 29, 2018]

Our next PACE maintenance day will start November 1 and run through November 3 as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on November 1 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Compute

  • (no user action needed) Replace power components in a rack in Rich 133
  • (no user action needed) Replace defective PCIe connectors on multiple servers

Network

  • (no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • (no user action needed) Change uplink connections from management switches

Storage

  • (no user action needed) Verify integrity of GPFS file systems
  • (no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • (no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • (no user action needed) Replace PACE ICE schedulers with a physical server, to increase capacity and reliability

[Resolved] Issues with Ansys and Abaqus License Server

Friday, October 26, 2018 Posted by
Comments closed

[Update – October 29, 2018] Abaqus and Ansys license servers are restored.

[Original Post – October 26, 2018] On Thursday, October 25, multiple virtual servers experienced problems due to data corruption from a storage issue.  OIT’s  storage team is working to correct this matter and operations is working on rebuilding the machines so they can be restored.  This service interruption has impacted the license servers for Ansys and Abaqus.  This has impacted PACE users’ Ansys and Abaqus jobs.  If you had Ansys and Abaqus jobs submitted during this period, please check your jobs, and resubmit them once license server is back online.

Currently, Ansys has been brought back online.

For additional information regarding this incident, please follow the status page link at https://status.gatech.edu

 

[Resolved] Temporary Network Interruption

Monday, October 15, 2018 Posted by
Comments closed

We experienced a failure in the primary InfiniBand subnet manager that may have impacted both running and starting jobs.   The malfunction happened in such a way that the backup IB subnet manager (SM) didn’t notice the primary was failing to operate normally. We disabled the primary SM, and the secondary SM took over as designed. The service outage lasted from 12:56pm to 01:07pm today, October 15, 2018. PACE staff will continue to investigate this failure mode and adjust the procedures to help prevent it in the future.  As this brief network interruption may have impacted the running and starting jobs, please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu