GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] Scratch filesystem issue

Posted by on Thursday, 20 February, 2020

[Update 2/20/20 4:40 PM]

Use of the scratch filesystem is restored. It appears that the automated migration task did run but could not keep up with the rate of scratch usage. We will monitor scratch for recurrence of this issue.

Please check any running jobs for errors and resubmit if necessary.

 

[Original message 2/20/20 4:30 PM]

Shortly before 4 PM, we noticed that PACE’s mounted GPFS scratch filesystem (~/scratch) is experiencing an issue that is preventing users from writing to their scratch directories. Any running jobs that write to scratch may experience failures due to write errors.

The scratch filesystem writes first to SSDs, and an automated task migrates data to another location when those drives near capacity. This task did not run as expected, prompting users to receive errors that scratch was full. We have manually begun the migration and will update the blog post when scratch is again available.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any concerns.

[Resolved] Rich InfiniBand Switch Power Failure

Posted by on Wednesday, 19 February, 2020

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

RESOLVED [Hive and Testflight-CODA Clusters] Connectivity Issue to All CODA Resources

Posted by on Friday, 14 February, 2020

RESOLVED [1:44 PM]:

The network engineers report that they have fixed the issues and are continuing to monitor it, although the cause remains unknown. Jobs appear to have continued uninterrupted on the Hive and Testflight-CODA clusters, but we encourage users to verify.
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5e46cb01fa0e5304bc04ecb5
Any residual issues should be reported to pace-support@oit.gatech.edu. Thank you.

UPDATE [11:33 AM]:

Georgia Tech IT is aware of the situation and is investigating as well:

Original Message:

Around 11:00 AM, we noticed that we could not connect to any resources housed in CODA, including the Hive and Testflight-CODA clusters. At this time, the source of the problem is being investigated, but access to these resources will experience problems. In theory, jobs on these clusters should continue to run. Further details will be provided as they become available.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu. Thank you.

PACE Maintenance – February 27-29

Posted by on Thursday, 13 February, 2020

We are preparing for PACE’s next maintenance days on February 27-29, 2020. This maintenance period is planned for three days starting on Thursday, February 27, and ending Saturday, February 29. However, Georgia Power will begin work to establish a Micro Grid power generation facility beginning on Thursday, and while work should complete within 48 hours, any delays may extend the maintenance outage for the Hive and Testflight-CODA clusters through Sunday instead; PACE clusters in Rich will not be impacted by any delays in Georgia Power’s work. Should any issues and resultant delays occur, users will be notified accordingly. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

  • None

ITEMS NOT REQUIRING USER ACTION:

  • RHEL7 clusters will receive critical patches.
  • Updates will be made to PACE databases and configurations.
  • [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy
  • [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version
  • [Hive cluster] Run OSU Benchmark test on idle resources
  • [GPFS and Lustre file systems] Apply latest maintenance releases and firmware updates

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu .

[Resolved] Shared Scheduler for Shared Clusters is Down

Posted by on Monday, 10 February, 2020

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

PACE Procurement Update and Schedule

Posted by on Wednesday, 29 January, 2020

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to pace-support@oit.gatech.edu.

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.

 

February 20 Intent to participate in FY20-Phase3 due to pace-support@oit.gatech.edu
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research

 

To view the published schedule or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu

Going forward, the PACE Newsletter will be published quarterly at  https://pace.gatech.edu/pace-newsletter.

Best Regards,

– The PACE Team

 

[Restored] GPFS Filesystem Issue

Posted by on Tuesday, 28 January, 2020

[Update 1/29/20 5:32 PM]

We are happy to report that our GPFS filesystem was restored to functionality early this afternoon. Our CI team was able to identify a failed switch as the source of problems on a group of nodes. We restored the switch, and we are investigating the deployment of improved backup systems to handle such cases in the future.

We apologize for the recent issues you have faced. As always, please send an email to pace-support@oit.gatech.edu with any concerns, so we can investigate.

 

 

[Original Post 1/28/20 12:46 PM]

We have been experiencing intermittent disruptions on our GPFS filesystem, especially on the mounted GPFS scratch (i.e., ~/scratch) filesystem, since yesterday. The PACE team is actively investigating the source of this issue, and we are working with our support vendor to restore the system to full functionality. A number of users have reported slow reads of files, hanging commands, and jobs that run more slowly than usual or do not appear to progress. We apologize for any interruptions you may be experiencing on PACE resources at this time, and we will alert you when the issue is resolved.

Hive Cluster Scheduler Down

Posted by on Monday, 27 January, 2020

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

 

Globus authentication and endpoints

Posted by on Wednesday, 15 January, 2020

We became aware this morning of an issue with Globus authentication to the “gatechpace#datamover” endpoint that many of you use to transfer files to/from PACE resources. We are working to repair this right now, but please use the “PACE Internal” endpoint instead. This endpoint provides access to the same filesystem that you use with the datamover endpoint (plus PACE Archive storage, for those who have signed up for our archive service). Going forward, you may continue to use this newer endpoint instead of the older datamover one, even once we have datamover functioning again soon. For full instructions on using Globus with PACE, visit our Globus documentation page. PACE Internal functions in exactly the same way as gatechpace#datamover when interacting with Globus. 

Please keep in mind that Globus is the best way to transfer files to/from PACE resources. Contact us at pace-support@oit.gatech.edu if you have any questions about using Globus.

[Re-Scheduled] Advisory of Hive cluster outage 1/20/20

Posted by on Thursday, 9 January, 2020

We are writing to inform you of the upcoming Hive cluster outage that we learned about yesterday.  PACE has no control on this outage.  As part of the design of the Coda data center, we are working with the Southern Company (Ga Power) in the creation and operation of a Micro Grid power generation facility. This is a set of products to enable research of local generation of up to 2MW of off-grid power.

In order to connect this facility/Micro grid to the Coda data center power, Southern Company will need to shut down all power to the research hall in Coda. As a result, Hive cluster will need to be shutdown during this procedure, and we are placing a scheduler reservation to prevent any jobs from running during the shutdown.  This is currently planned to begin at 8am on the Georgia Tech MLK Holiday of January 20th. GT has checked to see if this date could be rescheduled to give a longer notice, but GT was unable to change the date.   As a result, GT is working with the Southern Company to minimize the duration of this power outage but a final outage time requirement is not known. It is currently expected to be at least 24 hours in length.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

If you have any questions, please contact PACE Support at pace-support@oit.gatech.edu.