GT Home : : Campus Maps : : GT Directory

Emergency Firewall Maintenance

Thursday, March 26, 2020 Posted by
Comments closed

Dear Researchers,

The GT network team will undertake an emergency code upgrade on the departmental Palo Alto firewalls beginning at 8pm tonight.  Because this is a high availability pair of devices, this upgrade should not be a major disruption to any traffic to or from the PACE systems.  The specific upgrade has already been successfully accomplished on other firewall devices of the same hardware and software versions and it was observed to not cause any disruptions.

With that said, there is a possibility that connections to the PACE login servers may see a temporary interruption between 8pm and 11pm TONIGHT as the firewalls are upgraded. This should not impact any running jobs except if there is a request for a license on a license server elsewhere on campus (e.g., abaqus) that happens to coincide with the exact moment of the firewall changeover.  Additionally, there is possibility that users may experience interruptions during their interactive sessions (e.g., edit session, screen, VNC Job, Jupyter notebook).  The batch jobs that are already scheduled and/or running on the clusters should otherwise progress normally.

Please check the status and completion of your jobs that have run this evening for any unexpected errors and re-submit should you believe an interruption was the cause.  We apologize in advance for any inconvenience this required emergency code upgrade may cause.

You may follow the status of this maintenance at GT’s status page

As always, if you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu .

Best,

The PACE Team

 

[RESOLVED] RHEL7 Dedicated Scheduler Down

Wednesday, March 25, 2020 Posted by
Comments closed

[RESOLVED] We have restored functionality to the RHEL7 dedicated scheduler. Thank you for your patience.

[UPDATE] The RHEL7 dedicated scheduler, accessed via login7-d, is again down. We are actively working to resolve the issue at this time, and we will update you when the scheduler is restored. Please follow the same blog post (http://blog.pace.gatech.edu/?p=6715) for updates. If you have any questions, please contact pace-support@oit.gatech.edu.

[RESOLVED] We have rebooted the RHEL7 Dedicated scheduler, and functionality has been restored. Thank you for your patience.

[ORIGINAL MESSAGE] Roughly 30 minutes ago we determined an issue with the scheduler for dedicated RHEL7 clusters; this scheduler is responsible for all jobs submitted from the dedicated RHEL7 headnode, login7-d. All other schedulers are operating as expected. We are actively working to resolve the problem, but in the meantime you will be unable to submit new jobs or query the status of queued or running jobs.

If you have any questions, please contact pace-support@oit.gatech.edu.

PACE Operations Update — COVID-19

Thursday, March 12, 2020 Posted by
Comments closed

[UPDATE – 03/19/2020]

Dear Researchers,

This is brief update on our prior communication about COVID-19 situation that we are carefully monitoring.  In the light of recent communication from the Office of the Executive Vice President for Research regarding the research ramp down plan, please rest assured that PACE will continue normal operations of our resources.   We will continue to provide support during this period.  

Regarding PACE training classes, we made modifications to our classes to offer them virtually via BlueJeans, and this week we had our first two classes, Linux 101 and Optimization 101, hosted virtually.  Please visit our training site for upcoming classes that you may register for, and our Research Scientists will be in touch regarding the instructions in accessing the classes virtually.   Additionally, our consulting sessions will be offered virtually as scheduled.  You may check our “Upcoming Events” section for the virtual coordinates for upcoming consulting session.

Also, as a clarification point about the new campus VPN (Global Protect), this is a new service that is in early deployment/testing phase, and the new VPN is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect). At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

Overall, given the challenges that COVID-19 has presented, we want to reassure our community that we are here for you to support your computational research, and please do not hesitate to contact us at pace-support@oit.gatech.edu if you have any questions or concerns.

Warm regards, and stay safe.

The PACE Team

[UPDATE – 03/13/2020].  As a brief update to yesterday’s message, the new VPN (GlobalProtect) is a new service (and going through much of the testing).  This new VPN is intended to help with the anticipated increase in demand, but it is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect you’ve been using).  At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

[Original Message – 03/12/2020]

Dear Researchers,

PACE is carefully monitoring developments with the COVID-19 situation including the recent message from President Cabrera announcing that GT is moving to online/distance instruction after spring break.  We want to reassure the community that PACE will continue normal operations.

Given the anticipated increase in demand of our VPN infrastructure, please follow the instructions on accessing OIT’s recently deployed Next Generation Campus VPN that will help you access PACE resources.

If you have any questions or concerns, you may reach us via pace-support@oit.gatech.edu

Best,

The PACE Team

 

[Resolved] Scratch filesystem issue

Thursday, February 20, 2020 Posted by
Comments closed

[Update 2/20/20 4:40 PM]

Use of the scratch filesystem is restored. It appears that the automated migration task did run but could not keep up with the rate of scratch usage. We will monitor scratch for recurrence of this issue.

Please check any running jobs for errors and resubmit if necessary.

 

[Original message 2/20/20 4:30 PM]

Shortly before 4 PM, we noticed that PACE’s mounted GPFS scratch filesystem (~/scratch) is experiencing an issue that is preventing users from writing to their scratch directories. Any running jobs that write to scratch may experience failures due to write errors.

The scratch filesystem writes first to SSDs, and an automated task migrates data to another location when those drives near capacity. This task did not run as expected, prompting users to receive errors that scratch was full. We have manually begun the migration and will update the blog post when scratch is again available.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any concerns.

[Resolved] Rich InfiniBand Switch Power Failure

Wednesday, February 19, 2020 Posted by
Comments closed

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

RESOLVED [Hive and Testflight-CODA Clusters] Connectivity Issue to All CODA Resources

Friday, February 14, 2020 Posted by
Comments closed

RESOLVED [1:44 PM]:

The network engineers report that they have fixed the issues and are continuing to monitor it, although the cause remains unknown. Jobs appear to have continued uninterrupted on the Hive and Testflight-CODA clusters, but we encourage users to verify.
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5e46cb01fa0e5304bc04ecb5
Any residual issues should be reported to pace-support@oit.gatech.edu. Thank you.

UPDATE [11:33 AM]:

Georgia Tech IT is aware of the situation and is investigating as well:

Original Message:

Around 11:00 AM, we noticed that we could not connect to any resources housed in CODA, including the Hive and Testflight-CODA clusters. At this time, the source of the problem is being investigated, but access to these resources will experience problems. In theory, jobs on these clusters should continue to run. Further details will be provided as they become available.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu. Thank you.

[COMPLETED] PACE Maintenance – February 27-29

Thursday, February 13, 2020 Posted by
Comments closed

[COMPLETED – 6:51 PM 2/28/2020]

We are pleased to announce that our February 2020 maintenance period (https://blog.pace.gatech.edu/?p=6676) has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible. 

As usual, there are a small number of straggling nodes that will require additional intervention.  

A summary of the changes and actions accomplished during this maintenance period are as follows: 

  • (Completed) RHEL7 clusters received critical patches
  • (Completed) Updates will be made to PACE databases and configurations.
  • (Deferred) [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • (Completed) [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • (Completed) [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy.
  • (In Progress) [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version.
  • (Completed) [Hive cluster] Run OSU Benchmark test on idle resources
  • (Completed) [GPFS file system] Apply latest maintenance releases and firmware updates
  • (In Progress) [Lustre file system] Apply latest maintenance releases and firmware updates

Thank you for your patience!

[UPDATE – 8:52 AM 2/27/2020]

The PACE maintenance period is underway. For the duration of maintenance, users will be unable to access PACE resources. Once the maintenance activities are complete, we will notify users of the availability of the cluster.

Also, we have been told by Georgia Power that they expect their work may take up to 72 hours to complete; as such, the maintenance outage for the CODA research hall (Hive and Testflight-CODA clusters) will extend until 6:00 AM Monday morning. We will provide updates as they are available.

[Original Message]

We are preparing for PACE’s next maintenance days on February 27-29, 2020. This maintenance period is planned for three days starting on Thursday, February 27, and ending Saturday, February 29. However, Georgia Power will begin work to establish a Micro Grid power generation facility beginning on Thursday, and while work should complete within 48 hours, any delays may extend the maintenance outage for the Hive and Testflight-CODA clusters through Sunday instead; PACE clusters in Rich will not be impacted by any delays in Georgia Power’s work. Should any issues and resultant delays occur, users will be notified accordingly. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

  • None

ITEMS NOT REQUIRING USER ACTION:

  • RHEL7 clusters will receive critical patches.
  • Updates will be made to PACE databases and configurations.
  • [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy
  • [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version
  • [Hive cluster] Run OSU Benchmark test on idle resources
  • [GPFS and Lustre file systems] Apply latest maintenance releases and firmware updates

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu .

[Resolved] Shared Scheduler for Shared Clusters is Down

Monday, February 10, 2020 Posted by
Comments closed

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

PACE Procurement Update and Schedule

Wednesday, January 29, 2020 Posted by
Comments closed

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to pace-support@oit.gatech.edu.

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.

 

February 20 Intent to participate in FY20-Phase3 due to pace-support@oit.gatech.edu
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research

 

To view the published schedule or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu

Going forward, the PACE Newsletter will be published quarterly at  https://pace.gatech.edu/pace-newsletter.

Best Regards,

– The PACE Team

 

[Restored] GPFS Filesystem Issue

Tuesday, January 28, 2020 Posted by
Comments closed

[Update 1/29/20 5:32 PM]

We are happy to report that our GPFS filesystem was restored to functionality early this afternoon. Our CI team was able to identify a failed switch as the source of problems on a group of nodes. We restored the switch, and we are investigating the deployment of improved backup systems to handle such cases in the future.

We apologize for the recent issues you have faced. As always, please send an email to pace-support@oit.gatech.edu with any concerns, so we can investigate.

 

 

[Original Post 1/28/20 12:46 PM]

We have been experiencing intermittent disruptions on our GPFS filesystem, especially on the mounted GPFS scratch (i.e., ~/scratch) filesystem, since yesterday. The PACE team is actively investigating the source of this issue, and we are working with our support vendor to restore the system to full functionality. A number of users have reported slow reads of files, hanging commands, and jobs that run more slowly than usual or do not appear to progress. We apologize for any interruptions you may be experiencing on PACE resources at this time, and we will alert you when the issue is resolved.