GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] Campus wide Intermittent network outage impacting PACE

Posted by on Friday, 12 July, 2019

Today at around 1:55pm,  OIT reported a campus wide intermittent network slowness as one of the DNS servers went down causing trouble with authentication, GRS and more.  OIT has resolved this issue as of 4:12pm, and we have recovered our storage that export home directories as a result of this related issue.  The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.

We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu

For details on the OIT issue reported, please visit their link 

Thank you for your attention to this, and apologies for the inconvenience.

[Resolved] Dedicated Scheduler – Job Submissions Paused

Posted by on Thursday, 11 July, 2019

[Update – July 11, 2019 – 2:45pm] Dedicated scheduler is back online and operational after correcting the node associations with queues that has resulted from a faulty configuration.  We have taken measures to correct our automated procedure to prevent such an incident in the future.  We have removed the pause on job submission.  You may now resume submitting your jobs.  Please check on your jobs that were submitted since 3:30pm yesterday (7/10/2019) as many of those jobs have terminated.  

Again, apologies for the inconvenience this has caused.

[Original Post – July 11, 2019 – 10:38pm] Today, at approximately 10:10am we paused job submissions to queues that are managed by the dedicated scheduler. Researcher teams will not be able to submit new jobs to the following queues: kennedy-lab,granulous,atlas-dufek,chow,athena-debug,cochlea,atlas-6,njord-6,atlantis,jabberwocky6,  megatron,acceptance,hadoop,aces,drive,complexity,corso,blue,monkeys-k33,athena-6,core,ase1-debug-6,microbio-1,radius,medprint-6,monkeys_gpu,pampa-6,monkeys,keeneland,athena-intel,atlas-intel,apurimac-bg-6,staml,ofed-test,semap-6,martini,skade,tmlhpc-6,atlas-debug,wohler,rozell,mps,prv-5-6,aryabhata-6,hadean-gpu,epictetus,neutrons-6,davenporter,atlas,athena-8core,uranus-6,hadean,ase1-6,atlas-simon,enterprise,pampa-debug-6,skadi

This action is taken to resolve the issue that we experienced since evening on July 10, in which jobs erroneously were terminated after not reaching their appropriate nodes. We are working to resolve this issue as quickly as possible.   Also, by pausing the job submission we will prevent any new jobs from being terminated. While we work to resolve this issue, we ask that you refrain from trying to submit your jobs to the listed queues above. We will follow up with an update as we work through this issue.  Thank you for your attention to this, and we are sorry for this inconvenience.

Scheduled UPS Fan Replacement in Rich Data Center

Posted by on Wednesday, 12 June, 2019

[June 12, 2019 – 4:45pm] OIT operations team notified PACE of a planned maintenance on Saturday, June 15, from 7:00AM – 2:00PM to replace a fan in one  of the UPS units in Rich Data Center.  This work may require that particular UPS to run in maintenance bypass and therefore that room would temporarily be without power backup.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

For further details about this planned OIT maintenance, please visit the following  OIT link .

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

[Resolved] Temporary Network Interruption

Posted by on Tuesday, 28 May, 2019

On Friday, May 24, at 4pm, we experienced a partial failure of our primary subnet manger that may have impacted running and starting MPI jobs that use IP over IB. Our backup IB subnet manager (SM) did not take over due the primary SM experiencing a partial failure. On Saturday, May 25, at 12:15pm, we switched to a new Subnet Manager, and restored the network. This service outage lasted from Friday, May 24, 4:00pm – Saturday, May 25, 12:15pm. Since this brief network interruption may have impacted the running jobs, please check your jobs to identify if there are any crashed jobs and report any problems to pace-support@oit.gatech.edu

PACE Ready for Research

Posted by on Saturday, 18 May, 2019

Our May 2019 maintenance (https://blog.pace.gatech.edu/?p=6473) is complete one day ahead of schedule! We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.  We are postponing the replacement of CMOS batteries on the servers due to scheduling conflict with the vendor.  As usual, there are a small number straggling nodes we will address over the coming days.

Compute

  • (Complete) Upgrade testflightcluster to RHEL 7.6
  • (Complete) Upgrade gemini-gpuand gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clusters‘ users)
  • (Complete) Switch nodes between chemxand gemini-cpu queues
  • (Postponed) Replace CMOS batteries on multiple servers

Network

  • (Complete) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
  • (Complete) Migrate Rich to campus connections to 10Gbps

Storage

  • (Complete) Reboot ICE storage servers to correct issues with backup application
  • (Complete)  Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance

Other

  • (Postponed) Updates to the submit filters in the schedulers
  • (Complete) Update salt master and minions

 

If you have any questions or concerns, please contact pace-support@oit.gatech.edu

[Complete] PACE quarterly maintenance – May 16-18, 2019

Posted by on Tuesday, 7 May, 2019

[Update – 05/09/2019] Our final quarterly maintenance schedule will include the following list of tasks:

Compute

  • (no user action needed) Replace CMOS batteries on multiple servers
  • (no user action needed) Upgrade testflight cluster to RHEL 7.6
  • (some user action needed) Upgrade gemini-gpu and gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clustersusers)
  • (no user action needed) Switch nodes between chemx and gemini-cpu queues

Network

  • (no user action needed) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
  • (no user action needed) Migrate Rich to campus connections to 10Gbps

Storage

  • (no user action needed) Reboot ICE storage servers to correct issues with backup application
  • (no user action needed) Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance

Other

  • (no user action needed) Updates to the submit filters in the schedulers
  • (no user action needed) Update salt master and minions

 

[Original Post – May 7, 2019 – 12:32pm] We are preparing for a maintenance day on May 16, 2019. This maintenance day is planned for three days and will start on Thursday May 16 and go through Saturday, May 18. 

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

In general, we will perform maintenance on PACE Network and migrate from 10Gbps to 40Gbps connections,  GPFS storage performance analysis, upgrade schedulers,  replace CMOS batteries, upgrade testflight cluster to the latest RHEL 7 kernel, 3.10.0-957.12.1, i.e., RHEL 7.6.

While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

Brief Interruption to PACE VPN During Service Maintenance

Posted by on Friday, 3 May, 2019

[May 3, 2019 – 4:53pm] On May 7, 2019, from 8:00pm (EST)- 9:00pm (EST), GT IT will be conducting maintenance of our VPN service.  During this period, users that are connected to our ITAR/ASDL/CUI  clusters via the VPN (pace.vpn.gatech.edu) will be disconnected.  This interruption will be brief (about 5 minutes), then you may  reconnect to the VPN and then the cluster.  This service maintenance will not impact any of the running batch jobs, but it may impact running interactive jobs during this period.  For additional details on the maintenance taking place, please visit the following link.

Thank you for your attention to this maintenance that GT IT is conducting.

 

PACE Procurement Timeline Adjustments

Posted by on Friday, 29 March, 2019

PACE Staff have completed our move to the CODA building and are settling in. We’ve also added a couple of new faces to the team, announcements will be forthcoming shortly.

As the year-end purchasing deadlines approach, we wanted to update the community on some changes to our procurement calendar. We’re doing our best to advocate for the research community and navigate some tough realities. We’ve nearly exhausted our space in the Rich Computer Center, and are very limited in our ability to deploy new equipment in that space. The CODA datacenter will be our new home (more on that below) but is not quite ready yet.

As such, we have cancelled the previously planned FY19-Phase3 and will need to shift some dates for our last order in FY19, FY19-Phase4. This shift results in FY19-Phase4 and FY20-Phase1 essentially being deployed concurrently around October of 2019. For this reason, we strongly encourage faculty to participate in FY20-Phase1 and reserve FY19-Phase4 for those who need to use funds expiring in FY19.

We will also adjust configurations and pricing for FY19-Phase4 and FY20-Phase1 based on upcoming processing technology and market conditions once that pricing is available to the public.

Finally, planning is in progress for PACE to migrate existing research cyberinfrastructure from the Rich data center to CODA, and all efforts will be made to minimize disruption to research efforts during this move. The execution phase will not begin until at least October 2019.

To view the published schedule online or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu.

Best Regards,

-PACE Team

[Resolved] PACE VM Migration – impacting various services

Posted by on Wednesday, 6 March, 2019

[March 7, 2018 – 12:33pm]  We completed migrating our virtual servers, and restored access to the testflight and novazohar clusters.  If you should encounter any issues, please let us know at pace-support@oit.gatech.edu

Tasks completed:

Complete – Migrate two license servers

Complete – Migrate testflight headnode

Complete – Migrate novazohar headnode

Complete – Migrate testflight scheduler

 

[March 6, 2018 – 10:44am] PACE will be migrating two license servers, testflight headnode, testflight scheduler, and novazohar headnode.  This migration will be very brief that will take as long as rebooting the systems.  We are reserving 30 minutes for this service on Thursday, March 7 at 12:00pm.  This will impact you very briefly that will include inability to connect to the designated login/headnodes (i.e., novazohar, and testflight) as well as possible inability to submit jobs in which applications require a license.   This service should not impact any running jobs.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Campus network experiencing intermittent network latency

Posted by on Monday, 4 March, 2019

Office of Information Technology reported intermittent network latency impacting parts of the campus network.  This would present as occasional slowness and timeouts when accessing PACE managed resources and access from PACE to non-PACE license servers, etc.  This may have caused new jobs to fail during attempts to check out software license that are not managed by PACE.  OIT has installed additional capacity, isolated and neutralized part of the cause of the issue, which is currently being monitored for any further network traffic issues.

For details and updates to this incident, please refer to OIT’s status page detailing this incident.

If you have any questions, please don’t hesitate to contact pace-support@oit.gatech.edu