GT Home : : Campus Maps : : GT Directory

ICE Clusters – Intermittent account problems

Thursday, November 8, 2018 Posted by
Comments closed

We received multiple reports about jobs crashing after being allocated on the instructional clusters (COC-ICE and PACE-ICE).   We’ve determined that intermittent account problems are the cause of this issue, and we are working towards a solution.

Thank you for your patience, and we apologies for the inconvenience.

 

[RESOLVED] Scratch storage problems

Wednesday, November 7, 2018 Posted by
Comments closed

We received multiple reports of jobs crashing due to insufficient scratch storage, but our physical usage is only at 41%.

We’ve identified the issue is related to the disk pools that were not able to migrate data to other pools internally as a result of a threshold process/procedure that was not started post maintenance day.  Now, we initiated this process, and we are migrating the data to appropriate pools, which should resolve the issues experienced in jobs crashing due to insufficient scratch storage.

We will continue to monitor the scratch storage to ensure its operation is optimal.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your patience, and apologies for the inconvenience.

PACE clusters ready for research

Saturday, November 3, 2018 Posted by
Comments closed

Our November 2018 maintenance (http://blog.pace.gatech.edu/?p=6360) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Completed Tasks

Compute

  • Complete – (no user action needed) Replace power components in a rack in Rich 133
  • Complete(no user action needed) Replace defective PCIe connectors on multiple servers
      • As a precaution, additional identified nodes will have their PCIe connectors replaced  when parts are delivered.  There will be no user action needed.

Network

  • Complete(no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • Complete(no user action needed) Change uplink connections from management switches

Storage

  • Complete(no user action needed) Verify integrity of GPFS file systems
  • Complete(no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • Complete(no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • Complete (some user action needed) Replaced PACE ICE schedulers with a physical server, to increase capacity and reliability.   Some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[COMPLETE] PACE Quarterly Maintenance – November 1-3, 2018

Monday, October 29, 2018 Posted by
Comments closed

[Update – November 3, 2018, 4:45pm] 

Our November 2018 maintenance is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Please note that some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[Update – November 2, 2018] 

Verification of the integrity of the GPFS file system is taking longer than initial estimate.  As a result, this maintenance day will last the full three days, as scheduled, that will allow us to complete the certification of the file system and ensure the highest integrity of the data. 

[Original Post – October 29, 2018]

Our next PACE maintenance day will start November 1 and run through November 3 as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on November 1 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Compute

  • (no user action needed) Replace power components in a rack in Rich 133
  • (no user action needed) Replace defective PCIe connectors on multiple servers

Network

  • (no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • (no user action needed) Change uplink connections from management switches

Storage

  • (no user action needed) Verify integrity of GPFS file systems
  • (no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • (no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • (no user action needed) Replace PACE ICE schedulers with a physical server, to increase capacity and reliability

[Resolved] Issues with Ansys and Abaqus License Server

Friday, October 26, 2018 Posted by
Comments closed

[Update – October 29, 2018] Abaqus and Ansys license servers are restored.

[Original Post – October 26, 2018] On Thursday, October 25, multiple virtual servers experienced problems due to data corruption from a storage issue.  OIT’s  storage team is working to correct this matter and operations is working on rebuilding the machines so they can be restored.  This service interruption has impacted the license servers for Ansys and Abaqus.  This has impacted PACE users’ Ansys and Abaqus jobs.  If you had Ansys and Abaqus jobs submitted during this period, please check your jobs, and resubmit them once license server is back online.

Currently, Ansys has been brought back online.

For additional information regarding this incident, please follow the status page link at https://status.gatech.edu

 

[Resolved] Temporary Network Interruption

Monday, October 15, 2018 Posted by
Comments closed

We experienced a failure in the primary InfiniBand subnet manager that may have impacted both running and starting jobs.   The malfunction happened in such a way that the backup IB subnet manager (SM) didn’t notice the primary was failing to operate normally. We disabled the primary SM, and the secondary SM took over as designed. The service outage lasted from 12:56pm to 01:07pm today, October 15, 2018. PACE staff will continue to investigate this failure mode and adjust the procedures to help prevent it in the future.  As this brief network interruption may have impacted the running and starting jobs, please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

[Resolved] pace 1 storage problems

Wednesday, October 3, 2018 Posted by
Comments closed

[Update – October 5, 2018] We worked with our vendor to address the issue impacting the network shared disk (NSD) that drastically reduced the performance of pace1 file system when it was stressed by the large number of I/O intensive jobs.  On Thursday, we had the NSD restored to normal, and our benchmarks indicate a successful resolution.  As a precaution, we will continue to monitor NSDs as the user workloads continue to resume to normal.

[Original Post – October 3, 2018]

On Monday, October 1, we started to experience slowness on our parallel file system (pace1), which was associated with users’ I/O intensive jobs.  We have engaged the users who were/are responsible for the load.  During this process, the stress on our storage and network allowed us to identify a bug with a network shared disk that is responsible for caching data that improves read/write speeds.  Currently, we have successfully deployed a workaround, which has dramatically improved the performance, and we are working with our vendor to further resolve this issue.

With this development, symptoms that you may have experienced is slowness when navigating through your files.  Your jobs should not have been impacted other than slower access to the files that may have resulted in longer execution times (i.e., wall-time).

We will update you once we have the issue fully resolved in collaboration with our vendor.  If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[RESOLVED] Temporary unavailability of home directories

Friday, September 28, 2018 Posted by
Comments closed

The storage servers that export PACE home directories experienced a problem at around 9:10am on September 28. We have identified and resolved the issue within 20 min after the event.

This problem caused temporary unavailability of home directories. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

 

 

[RESOLVED] Temporary unavailability of home directories

Wednesday, September 19, 2018 Posted by
Comments closed

At around 6:10pm on Sep 19, 2018 the storage servers that export PACE home directories and the software repository experienced a problem. We have identified and resolved the issue within 15 min after the event.

This problem caused temporary unavailability of home directories and applications. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

 

 

Testflight queue transition and unavailability

Wednesday, September 12, 2018 Posted by
Comments closed

As you know, the testflight queue includes nodes that are reserved for testing the systems/services that are planned to be deployed in the future.

As a part of our preparations to transition to the next OS (RHEL7) we will offline this queue, swap its nodes with newly purchased nodes (that better represent the modern systems currently in use), and finally deploy the RHEL7 on these new nodes.

Once these preparations are complete, we’ll reach out to you and ask you to test your codes. Until then, testflight will not be available and submissions will be declined.

There are currently some jobs running on this queue. We’ll wait until the current jobs complete instead of killing them, but we would like to once again emphasize that the use of testflight for production is against policy. This queue should only be used for testing purposes.

Please let us know if you have any questions.