PACE A Partnership for an Advanced Computing Environment

January 29, 2020

PACE Procurement Update and Schedule

Filed under: Uncategorized — Semir Sarajlic @ 7:38 pm

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to pace-support@oit.gatech.edu.

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.

 

February 20 Intent to participate in FY20-Phase3 due to pace-support@oit.gatech.edu
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research

 

To view the published schedule or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu

Going forward, the PACE Newsletter will be published quarterly at  https://pace.gatech.edu/pace-newsletter.

Best Regards,

– The PACE Team

 

January 28, 2020

[Restored] GPFS Filesystem Issue

Filed under: Uncategorized — Michael Weiner @ 5:45 pm

[Update 1/29/20 5:32 PM]

We are happy to report that our GPFS filesystem was restored to functionality early this afternoon. Our CI team was able to identify a failed switch as the source of problems on a group of nodes. We restored the switch, and we are investigating the deployment of improved backup systems to handle such cases in the future.

We apologize for the recent issues you have faced. As always, please send an email to pace-support@oit.gatech.edu with any concerns, so we can investigate.

 

 

[Original Post 1/28/20 12:46 PM]

We have been experiencing intermittent disruptions on our GPFS filesystem, especially on the mounted GPFS scratch (i.e., ~/scratch) filesystem, since yesterday. The PACE team is actively investigating the source of this issue, and we are working with our support vendor to restore the system to full functionality. A number of users have reported slow reads of files, hanging commands, and jobs that run more slowly than usual or do not appear to progress. We apologize for any interruptions you may be experiencing on PACE resources at this time, and we will alert you when the issue is resolved.

January 27, 2020

Hive Cluster Scheduler Down

Filed under: Uncategorized — Semir Sarajlic @ 7:11 pm

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

 

January 15, 2020

Globus authentication and endpoints

Filed under: Uncategorized — Michael Weiner @ 4:17 pm

We became aware this morning of an issue with Globus authentication to the “gatechpace#datamover” endpoint that many of you use to transfer files to/from PACE resources. We are working to repair this right now, but please use the “PACE Internal” endpoint instead. This endpoint provides access to the same filesystem that you use with the datamover endpoint (plus PACE Archive storage, for those who have signed up for our archive service). Going forward, you may continue to use this newer endpoint instead of the older datamover one, even once we have datamover functioning again soon. For full instructions on using Globus with PACE, visit our Globus documentation page. PACE Internal functions in exactly the same way as gatechpace#datamover when interacting with Globus. 

Please keep in mind that Globus is the best way to transfer files to/from PACE resources. Contact us at pace-support@oit.gatech.edu if you have any questions about using Globus.

January 9, 2020

[Re-Scheduled] Advisory of Hive cluster outage 1/20/20

Filed under: Uncategorized — Michael Weiner @ 1:05 pm

We are writing to inform you of the upcoming Hive cluster outage that we learned about yesterday.  PACE has no control on this outage.  As part of the design of the Coda data center, we are working with the Southern Company (Ga Power) in the creation and operation of a Micro Grid power generation facility. This is a set of products to enable research of local generation of up to 2MW of off-grid power.

In order to connect this facility/Micro grid to the Coda data center power, Southern Company will need to shut down all power to the research hall in Coda. As a result, Hive cluster will need to be shutdown during this procedure, and we are placing a scheduler reservation to prevent any jobs from running during the shutdown.  This is currently planned to begin at 8am on the Georgia Tech MLK Holiday of January 20th. GT has checked to see if this date could be rescheduled to give a longer notice, but GT was unable to change the date.   As a result, GT is working with the Southern Company to minimize the duration of this power outage but a final outage time requirement is not known. It is currently expected to be at least 24 hours in length.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

If you have any questions, please contact PACE Support at pace-support@oit.gatech.edu.

January 8, 2020

Rich Data Center UPS Maintenance

Filed under: Uncategorized — Aaron Jezghani @ 8:56 pm

The Rich data center uninterrupted power system (UPS) will undergo maintenance to replace failed batteries on 11-January, starting at 8.00am. Due to the power configuration, it’s not expected for any of the systems in Rich to lose power during this time. All PACE services should function normally.

Please contact pace-support@oit.gatech.edu if you need more details.

January 7, 2020

[Re-Scheduled] Hive Cluster — Policy Update

Filed under: Uncategorized — Semir Sarajlic @ 7:17 pm

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.

Additionally:

We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please do not hesitate to contact us at pace-support@oit.gatech.edu.

January 3, 2020

Upcoming VPN updates

Filed under: Uncategorized — Michael Weiner @ 10:19 pm

We would like to let you know about upcoming upgrades to Georgia Tech’s VPNs. The VPN software will be updated by OIT to introduce a number of bug fixes and security improvements, including support for macOS 10.15 as well as Windows 10 ARM64 based devices. After the upgrade, your local VPN client will automatically download and install an update upon your next connection attempt. Please allow the software to update, then continue with your connection on the upgraded interface.

The main campus “anyc” VPN, which is used to access PACE from off-campus locations, will be upgraded on January 28. The “pace” VPN, which is used to access our ITAR/CUI clusters from any location, will be upgraded on January 21.

If you wish to try the new client sooner, you may do so by connecting to the dev.vpn.gatech.edu VPN, which will prompt download of the upgraded client. Due to capacity limitations, please disconnect after update and return to using your normal VPN service.

For ongoing updates, please visit the OIT status announcements for the pace VPN or the anyc VPN.

As always, please contact us at pace-support@oit.gatech.edu with any concerns.

Powered by WordPress