GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] PACE VM Migration – impacting various services

Posted by on Wednesday, 6 March, 2019

[March 7, 2018 – 12:33pm]  We completed migrating our virtual servers, and restored access to the testflight and novazohar clusters.  If you should encounter any issues, please let us know at pace-support@oit.gatech.edu

Tasks completed:

Complete – Migrate two license servers

Complete – Migrate testflight headnode

Complete – Migrate novazohar headnode

Complete – Migrate testflight scheduler

 

[March 6, 2018 – 10:44am] PACE will be migrating two license servers, testflight headnode, testflight scheduler, and novazohar headnode.  This migration will be very brief that will take as long as rebooting the systems.  We are reserving 30 minutes for this service on Thursday, March 7 at 12:00pm.  This will impact you very briefly that will include inability to connect to the designated login/headnodes (i.e., novazohar, and testflight) as well as possible inability to submit jobs in which applications require a license.   This service should not impact any running jobs.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Campus network experiencing intermittent network latency

Posted by on Monday, 4 March, 2019

Office of Information Technology reported intermittent network latency impacting parts of the campus network.  This would present as occasional slowness and timeouts when accessing PACE managed resources and access from PACE to non-PACE license servers, etc.  This may have caused new jobs to fail during attempts to check out software license that are not managed by PACE.  OIT has installed additional capacity, isolated and neutralized part of the cause of the issue, which is currently being monitored for any further network traffic issues.

For details and updates to this incident, please refer to OIT’s status page detailing this incident.

If you have any questions, please don’t hesitate to contact pace-support@oit.gatech.edu

 

[Complete] PACE staff is moving to Coda building

Posted by on Monday, 4 March, 2019

[March 20, 2019] This is a friendly note to confirm that PACE staff has moved over to Coda building.  While we have moved out of Rich Building, we continue to monitor the Rich data center as we have in the past.  If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[Original Post – March 4, 2019]As you may already know, PACE Team will be moving to CODA during the weeks of March 11 and March 18, more specifically, our offices will be in transition on March 15 and March 18. Please note, this move is only for the staff members and not the data center. Data center will continue to operate as usual, but our team’s responses may be delayed during this period, especially on March 15 and 18.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Resolved] Storage problem impacting applications and login

Posted by on Thursday, 28 February, 2019

At about 2:30pm, during a routine storage server procedure, we experienced a problem that was related to a service not starting properly. We have resolved the issue within 15 minutes. This incident caused temporary unavailability of some applications and home directories. The symptoms include hanging commands, codes, and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

Thank you for your attention, and apologies for this inconvenience.

[Resolved] Expected Network Interruptions Due to Campus Network Maintenance – Intermittent delays or disruption to major campus IT services

Posted by on Monday, 18 February, 2019

[Original Post – February 18, 2019] On Sunday, Feb. 24, OIT will perform a series of data center upgrades and migrations. This service window includes intermittent delays or disruption to major campus IT services between 7 a.m. and 8 p.m. as well as occasional interruptions in wireless connectivity between 9 a.m. and 12 p.m.

During this service upgrade, the intermittent service interruptions will result in periods when users may not be able to connect to PACE managed resources or they may be disconnected from their sessions, which may  interrupt interactive jobs that rely on an active SSH connection to a given cluster.   However, these upgrades will not impact running or queued batch jobs.  OIT anticipates all the service upgrades and migrations to conclude by 8 p.m., and PACE users should resume their work as usual.

For additional information and details on the services that OIT will be upgrading and migrating, please refer to the status page link at https://status.gatech.edu

PACE clusters ready for research

Posted by on Saturday, 16 February, 2019
Our February 2019 maintenance (http://blog.pace.gatech.edu/?p=6419) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Please let us know any problems you may notice: pace-support@oit.gatech.edu

Compute
* (COMPLETE) Vendor will replace defective components on groups of servers
Network

* (COMPLETE) Ethernet network reconfiguration

Storage
* (COMPLETE) GPFS / DDN enclosure reset

* (COMPLETE) NAS maintenance and reconfiguration

Other
• (COMPLETE) PACE VMWare reconfiguration to remove out of support hosts

* (COMPLETE) Migration of Megatron cluster to RHEL7

[Resolved] Scheduler problem on RHEL7 Dedicated Clusters

Posted by on Saturday, 2 February, 2019

[Resolved – February 1, 21:35] At about 5:20pm on February 1, scheduler for the new RHEL7 dedicated clusters went down after encountering a segmentation fault error.  We’ve resolved the incident, and brought the scheduler back online.  As far as we know, this incident impacted two jobs based on our assessment.  We advise that you review your jobs from today.  Additionally, users who may have attempted to submit jobs between 5:20pm – 9:35pm may have experienced scheduler communication errors when running qstat, qsub… commands.

We will continue to monitor the scheduler and update if needed. If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you for your attention, and apologies for this inconvenience.

 

[Resolved] Networking (InfiniBand) problems

Posted by on Monday, 28 January, 2019

[Resolved, January 28] We had one of our main Mellanox IB switch’s partially go down on Sunday morning, which has left large amount of compute nodes without access to the IB interconnect.  Our system engineers have resolved the matter at about 9:41am, and the IB switch is back online.  As far as we know, the following queues have been impacted: athena-intel, atlantis, atlas-6-sunge, atlas-intel, force-6, joe-intel, joe-test, novazohar,, pace-devel, swarm, and zohar.   We advise that you review your jobs from this weekend/current jobs as this incident may have interrupted your jobs.  If your jobs have failed due to errors pertaining to MPI errors or files could not write to /scratch/ or  /data/[Your_Files], then please resubmit your jobs. 

We will continue to monitor this switch and update if needed.  If you experience any further issues, please contact pace-support@oit.gatech.edu.

Thank you, and sorry for this inconvenience.

PACE quarterly maintenance – (Feb 15-16, 2019)

Posted by on Friday, 18 January, 2019

[Update – 02/11/2019] Our updated quarterly scheduled maintenance task list will include the following:

Compute

  • (no user action needed) Vendor will replace defective components on groups of servers

Network

  • (no user action needed) Ethernet network reconfiguration

Storage

  • (no user action needed) GPFS / DDN enclosure reset
  • (no user action needed) NAS maintenance and reconfiguration

Other

  • (no user action needed) PACE VMWare reconfiguration to remove out of support hosts

 

[Original Post – 01/18/2019] We are preparing for a short maintenance day on February 15, 2019. Unlike our regular schedule, which starts on Thursdays and takes three days, this maintenance will start on a Friday and take only two days.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

In general, we’ll perform maintenance on the GPFS storage, migrate some Virtual Machines to new servers, perform hardware changes on one of the clusters, and finalize the migration of “/usr/local”, which is network attached mount point on all machines, to a more reliable storage pool.

While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

We’ll update this post as we have more details.

 

 

Changes to mount points (no user impact expected)

Posted by on Thursday, 3 January, 2019

The investigation results that followed the system failures that temporarily rendered the scientific repository unresponsive (http://blog.pace.gatech.edu/?p=6390) will require some additional maintenance. To facilitate this maintenance, we will make a change to the mount point for /usr/local, which is network mounted and identical on all compute nodes.

Our tests indicate that this swap can be performed live, without impacting running jobs. It’s also completely transparent to users; you don’t need to change or do anything as a result.

In the unlikely event of job crashes that you suspect are caused by this operation, please contact pace-support@oit.gatech.edu and we’ll be happy to assist.

Thank you,
PACE Team