GT Home : : Campus Maps : : GT Directory

Author Archive

[Resolved] Storage problem impacting applications and login

Posted by on Thursday, 28 February, 2019

At about 2:30pm, during a routine storage server procedure, we experienced a problem that was related to a service not starting properly. We have resolved the issue within 15 minutes. This incident caused temporary unavailability of some applications and home directories. The symptoms include hanging commands, codes, and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

Thank you for your attention, and apologies for this inconvenience.

[Resolved] Expected Network Interruptions Due to Campus Network Maintenance – Intermittent delays or disruption to major campus IT services

Posted by on Monday, 18 February, 2019

[Original Post – February 18, 2019] On Sunday, Feb. 24, OIT will perform a series of data center upgrades and migrations. This service window includes intermittent delays or disruption to major campus IT services between 7 a.m. and 8 p.m. as well as occasional interruptions in wireless connectivity between 9 a.m. and 12 p.m.

During this service upgrade, the intermittent service interruptions will result in periods when users may not be able to connect to PACE managed resources or they may be disconnected from their sessions, which may  interrupt interactive jobs that rely on an active SSH connection to a given cluster.   However, these upgrades will not impact running or queued batch jobs.  OIT anticipates all the service upgrades and migrations to conclude by 8 p.m., and PACE users should resume their work as usual.

For additional information and details on the services that OIT will be upgrading and migrating, please refer to the status page link at https://status.gatech.edu

[Resolved] Scheduler problem on RHEL7 Dedicated Clusters

Posted by on Saturday, 2 February, 2019

[Resolved – February 1, 21:35] At about 5:20pm on February 1, scheduler for the new RHEL7 dedicated clusters went down after encountering a segmentation fault error.  We’ve resolved the incident, and brought the scheduler back online.  As far as we know, this incident impacted two jobs based on our assessment.  We advise that you review your jobs from today.  Additionally, users who may have attempted to submit jobs between 5:20pm – 9:35pm may have experienced scheduler communication errors when running qstat, qsub… commands.

We will continue to monitor the scheduler and update if needed. If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your attention, and apologies for this inconvenience.

 

[Resolved] Networking (InfiniBand) problems

Posted by on Monday, 28 January, 2019

[Resolved, January 28] We had one of our main Mellanox IB switch’s partially go down on Sunday morning, which has left large amount of compute nodes without access to the IB interconnect.  Our system engineers have resolved the matter at about 9:41am, and the IB switch is back online.  As far as we know, the following queues have been impacted: athena-intel, atlantis, atlas-6-sunge, atlas-intel, force-6, joe-intel, joe-test, novazohar,, pace-devel, swarm, and zohar.   We advise that you review your jobs from this weekend/current jobs as this incident may have interrupted your jobs.  If your jobs have failed due to errors pertaining to MPI errors or files could not write to /scratch/ or  /data/[Your_Files], then please resubmit your jobs. 

We will continue to monitor this switch and update if needed.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you, and sorry for this inconvenience.

Brief Interruption to VPN During Urgent VPN Service Maintenance

Posted by on Wednesday, 28 November, 2018

On November 29, 2018, from 10:00pm (EST)- 11:00pm (EST), OIT will be conducting maintenance of our VPN service.  During this period, users that are connected to our clusters via the VPN (anyc.vpn.gatech.edu) will be disconnected, and you will need to reconnect to the VPN and then the cluster.  This service maintenance will not impact any of the running batch jobs, but it may impact running interactive jobs during this period.  For additional details on the maintenance taking place, please visit the following site: https://status.gatech.edu/incidents/9ljkjx72462x

Thank you for your attention to this urgent maintenance that OIT is conducting.

[Resolved] CoC-ICE Cluster: Multi-node job problem

Posted by on Wednesday, 21 November, 2018

[Update – November 26, 2018] We’ve identified the issue and resolved the configuration error.  Users are now able to submit multi-node jobs on the CoC-ICE cluster.

[Original Post – November 21, 2018]

We are investigating an issue in which users experience hanging jobs when they submit a multi-node job on CoC-ICE cluster.   This issue does not impact users who are submitting jobs on a single node.  Also, this issue is not impacting the PACE-ICE cluster.

Thank you for your patience, and we apologize for this inconvenience while we resolve this issue.

[Resolved] ICE Clusters – Intermittent account problems

Posted by on Thursday, 8 November, 2018

We received multiple reports about jobs crashing after being allocated on the instructional clusters (COC-ICE and PACE-ICE).   We’ve determined that intermittent account problems are the cause of this issue, and we are working towards a solution.

Thank you for your patience, and we apologies for the inconvenience.

 

[RESOLVED] Scratch storage problems

Posted by on Wednesday, 7 November, 2018

We received multiple reports of jobs crashing due to insufficient scratch storage, but our physical usage is only at 41%.

We’ve identified the issue is related to the disk pools that were not able to migrate data to other pools internally as a result of a threshold process/procedure that was not started post maintenance day.  Now, we initiated this process, and we are migrating the data to appropriate pools, which should resolve the issues experienced in jobs crashing due to insufficient scratch storage.

We will continue to monitor the scratch storage to ensure its operation is optimal.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your patience, and apologies for the inconvenience.

PACE clusters ready for research

Posted by on Saturday, 3 November, 2018

Our November 2018 maintenance (http://blog.pace.gatech.edu/?p=6360) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Completed Tasks

Compute

  • Complete – (no user action needed) Replace power components in a rack in Rich 133
  • Complete(no user action needed) Replace defective PCIe connectors on multiple servers
      • As a precaution, additional identified nodes will have their PCIe connectors replaced  when parts are delivered.  There will be no user action needed.

Network

  • Complete(no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • Complete(no user action needed) Change uplink connections from management switches

Storage

  • Complete(no user action needed) Verify integrity of GPFS file systems
  • Complete(no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • Complete(no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • Complete (some user action needed) Replaced PACE ICE schedulers with a physical server, to increase capacity and reliability.   Some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[COMPLETE] PACE Quarterly Maintenance – November 1-3, 2018

Posted by on Monday, 29 October, 2018

[Update – November 3, 2018, 4:45pm] 

Our November 2018 maintenance is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Please note that some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[Update – November 2, 2018] 

Verification of the integrity of the GPFS file system is taking longer than initial estimate.  As a result, this maintenance day will last the full three days, as scheduled, that will allow us to complete the certification of the file system and ensure the highest integrity of the data. 

[Original Post – October 29, 2018]

Our next PACE maintenance day will start November 1 and run through November 3 as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on November 1 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Compute

  • (no user action needed) Replace power components in a rack in Rich 133
  • (no user action needed) Replace defective PCIe connectors on multiple servers

Network

  • (no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • (no user action needed) Change uplink connections from management switches

Storage

  • (no user action needed) Verify integrity of GPFS file systems
  • (no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • (no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • (no user action needed) Replace PACE ICE schedulers with a physical server, to increase capacity and reliability