GT Home : : Campus Maps : : GT Directory

PACE Maintenance Days 5/19/2021-5/21/2021

Wednesday, May 5, 2021 Posted by
Comments closed

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Hive] Add InfiniBand Top Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

PACE Advisory Committee Assembled

Thursday, April 29, 2021 Posted by
Comments closed

Dear PACE research community, 

We are pleased to announce that the faculty-led PACE advisory committee was formed and assembled on March 30, 2021. The PACE Advisory Committee is a joint effort between the EVPR and OIT to ensure that shared research computing services are both meeting faculty needs and resourced in a sustainable way.   The committee consists of a representative group of PACE and faculty members, encompassing a wide range of experience and expertise on advanced computational and data capabilities provided by OIT’s research cyberinfrastructure.  An important goal of the committee is to provide essential feedback which will help continuously improve this critical service. The committee will meet regularly and: 

  1. Function as communication channel between the broader research computing community and PACE.
  2. Serve as a sounding board for major changes to the PACE infrastructure
  3. Maintain an Institute-level view of the shared resource
  4. Help craft strategies that balance the value and benefits provided by the resources with a sustainable cost structure in the face of ever-increasing demand.

PACE Advisory Committee Members: 

  • Srinivas AluruIDEaSDirector (ex-officio) 
  • Omar Asensio, Public Policy 
  • Dhruv Batra, Interactive Computing/ML@GT 
  • Mehmet Belgin, PACE 
  • Annalisa Bracco, Earth and Atmospheric Sciences 
  • Neil Bright, PACE 
  • Laura Cadonati, Physics 
  • Umit  Catalyurek, Computational Science and Engineering 
  • Sudheer ChavaScheller College of Business 
  • Yongtao Hu, Civil and Environmental Engineering 
  • Lew Lefton, EVPR/Math (ex-officio) 
  • Steven Liang, Mechanical Engineering/GTMI 
  • AJ Medford, Chemical and Biomolecular Engineering 
  • Joe Oefelein, Aerospace Engineering 
  • Annalise Paaby, Biological Sciences 
  • Tony Pan, IDEaS 
  • David Sherrill, Chemistry and Biochemistry 
  • Huan Tran, Materials Science and Engineering  

If you have any questions or comments, please direct them to the PACE Team <pace-support@oit.gatech.edu> and/or to Dr. Lew Lefton <lew.lefton@gatech.edu>.  

All the best, 

The PACE Team 

Parallel Computing with MATLAB and Scaling to HPC on PACE clusters at Georgia Tech

Wednesday, April 28, 2021 Posted by
Comments closed

MathWorks and PACE are partnering to offer a two-part parallel computing virtual workshop, taught by a MathWorks engineer, to PACE users and other members of the Georgia Tech community.

During this self-paced, hands-on workshop, you will be introduced to parallel and GPU computing in MATLAB for speeding up your application and offloading computations.  By working through common scenarios and workflows, you will gain an understanding of the parallel constructs in MATLAB, their capabilities, and some of the issues that may arise when using them. You will also learn how to take advantage of PACE resources, which are available to all researchers at Georgia Tech (including a free tier available at no cost), to scale your MATLAB computations.

Register by noon on May 14 at https://gatech.co1.qualtrics.com/jfe/form/SV_cD7prAcGZRthKCO.

Highlights
·      Speeding up programs with parallel computing
·      Working with large data sets
·      GPU computing
·      Scaling to PACE clusters (Phoenix, Hive, ICE, or Firebird)

Agenda
This virtual workshop will be held in two parts:
Part I, Tuesday, May 18, 1-4 PM, will focus on speeding up MATLAB with Parallel Computing Toolbox.
Part II, Tuesday, May 25, 1-4 PM, will focus on running MATLAB parallel code on PACE clusters.

Who should attend?
PhD students, post docs and faculty at Georgia Tech that want to (Part I) use parallel and GPU computing in MATLAB, and (Part II) scale their computations to take advantage of PACE resources.

Requirements
·      Basic working knowledge of MATLAB
·      Access to the Georgia Tech VPN. You do NOT need to be a PACE user, and all participants will receive access to PACE-ICE for hands-on activities.

Please contact PACE at pace-support@oit.gatech.edu with any questions.

PACE Update: Compute and Storage Billing

Friday, April 23, 2021 Posted by
Comments closed

Dear PACE research community,

During our extended grace period, nearly 1M user jobs from nearly 160 PI groups have completed,  consuming nearly 40M CPU hours on the Phoenix cluster. The average wait time in queue per job was less than 0.5 hours, confirming the effectiveness of the measures to ensure fair use of the Phoenix cluster to maintain an exceptional level of quality of service.

With the billing for both storage and compute usage in effect as of April 1st, we are following up to provide an update on a few important points.

Compute billing started April 1: 

Throughout March, we’ve sent communications to all PIs  in accordance with the PACE’s new cost model, including  the amount of compute credits  based on the refreshed compute equipment as part of the migration to Coda data center and/or  recently purchased equipment from FY20 Phase 1/2/3 purchase(s).

PACE has identified and fixed some discrepancies since our initially communicated information as part of our compute audit, which included resources purchased but not provisioned on time. We apologize for this oversight and encourage users to run pace-quota command to verify the updated list of charge accounts. We’ll follow up with the impacted PIs/users in a separate communication.

Please note that most school-owned accounts, as well as those jointly purchased by multiple faculty members, will show a zero balance, but you can still run jobs with them. We are working to make the balances in those accounts visible to you.

As of April 1, all the jobs that run on Phoenix and/or Firebird clusters will be debited/charged to the provided charge account (e.g., GT-gburdell3, GT-gburdell3-CODA20), and a statement will be sent to PIs at the start of May.

This does NOT necessarily mean that you must immediately begin providing funding to use Phoenix. All faculty and their research groups have access to our free tier. Additionally, if you had access to paid resources in Rich, they have been refreshed with an equivalent prepaid account intended to last for 5 years. 

Project storage billing started on April 1: 

As announced, quotas to Phoenix project storage were applied on March 31 based on PI choices as part of our storage audit.   Users may run pace-quota command to check their research group’s utilization and quota at any time.  For further information about the Phoenix storage, please see our documentation.  April is the first month where storage quotas incur charges for PIs who have chosen quotas above the 1 TB funded by the Institute.

Showback statements sent to PIs: 

Throughout March, we sent out “showback” statements for the prior months’ usage on the Phoenix cluster, which covered the usage for Oct 2020 through Feb 2021.   We are in a process of sending the March 2021 showback statements that will also include a storage report.  Overall, these statements provided PIs with an opportunity to review their group’s usage and follow up with PACE as needed.  Explanations for each of the metrics can be found in our documentation.

No charges were incurred for usage during the grace period, so the showback statements are solely for your information and to guide your usage plans going forward. 

User account audit completed: 

Users of ECE and Prometheus resources migrated in Nov 2020 did not have all their charge accounts provisioned  during their groups’ migration.  Since then, we have provided access to these additional accounts for the impacted users.  We apologize for any inconvenience this may have caused.   Also, as part of our preparation to start billing for computation, on Feb 8, the PACE team sent out a notification to PIs to conduct a review of the job submission accounts and corresponding user lists.  We appreciate PIs input throughout this process, and if any changes have occurred in your group since then, or if you would like to add a new user(s) to your account(s), please don’t hesitate to send a request to pace-support@oit.gatech.edu.  Users may run the pace-whoami command to see a list of charge accounts they may use.

Additionally, we have created a blog page for the frequently asked questions we have received from our community after the end of the extended grace period on March 31, which we would like to share with you at this time.

If you have any questions, concerns or comments about the Phoenix cluster or the new cost model, please direct them to pace-support@oit.gatech.edu.

Thank you,

The PACE Team

FAQ after the end of the grace period on the Phoenix cluster

Friday, April 23, 2021 Posted by
Comments closed

The following are frequently asked questions we have received from our user community after the end of the extended grace period on March 31 in accordance to the new cost model., which we are sharing with the community:

Q: Where can I find an updated NSF style facilities and equipment document? 

A:  Please see our page at https://pace.gatech.edu/sample-nsf-application-boilerplate-describing-pace-hpc  

Q: I had a cluster I bought back in 2013, can I access this cluster? 

A: No.  We have decommissioned all clusters from Rich datacenter as part of the Rich to Coda datacenter migration plan.   As part of our earlier communication to PIs, if a PI owned a cluster in Rich datacenter, they received a detailed summary of their charge account(s) for the Phoenix cluster that included the amount of compute credits allocated to their account based on the compute equipment that was refreshed.  To see your list of available charge account(s) and their credit balance, please run pace-quota on the Phoenix cluster. 

Q: I do not have funds to pay for the usage of the Phoenix cluster at this time, can I get  access to Phoenix at no cost? 

A: As part of this transition, PACE has taken the opportunity to provide all Institute Faculty with computational and data resources at a modest level.    All academic and research faculty (“PIs”) participating in PACE are automatically granted a certain level of resources in addition to any additional funding they may bring. Each PI is provided 1TB of project storage and compute credits (68) equivalent to 10,000 CPU-hours (per month) on a 192GB compute node. These credits may be used towards any computational resources (e.g., GPUs, high memory nodes) that are available within the Phoenix cluster. In addition, all PACE users also have access to the preemptable backfill queue at no cost.   

Q: Do I need to immediately begin providing funding to use Phoenix beyond the free tier? 

A: Not necessarily. If you had access to paid resources in Rich, you now have access to a refresh CODA20 account with an existing balance, as described to each faculty owner. The number of credits in that account is equivalent in computational power to 5 years of continuous use of your old cluster in the Rich Datacenter  

PACE Archive Storage Update and New Allocation Moratorium

Friday, April 2, 2021 Posted by
Comments closed

Dear PACE Users,

We are reaching out to provide you a status update on PACE’s Archive storage service, and to inform you about the moratorium for new archive storage user creation and allocations that we are instituting effective immediately.  This moratorium on new archive storage deployments decreases any potential negative impacts on transfer and backups due to the potential for large influx of new files.

What’s happening and what we are doing: Currently, the original PACE Archive storage is hosted on vendor hardware that is at limited support capacity as the vendor has ceased operations.  PACE has initiated a two phase plan to transfer PACE Archive storage from the current hardware to a permanent storage solution.  At this time, phase 1 is underway, and archive storage data is being replicated to a temporary storage solution.   PACE aims to finish the archive system transfer and configuration of this phase by May Maintenance Period (5/19/2021 – 5/21/2021).   The phase 1 is a temporary solution as PACE explores a more cost-efficient solution that will require a second migration of the data to the permanent storage solution, which will be part of the phase 2 of the plan, and we will follow-up with details accordingly.   

How does this impact me:  There is no service impact to current PACE archive storage users.   With the moratorium in effect, new users/allocations requests for archive storage are delayed until after the maintenance period.  New requests for archive storage may be processed starting 05/22/2021.  

What we will continue to do:  PACE team will continue to monitor the transfer of the data to the NetApp storage, and we will report as needed. 

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] Network Connectivity Issues

Thursday, March 25, 2021 Posted by
Comments closed

[Update – March 25, 2021 – 3:07pm] 

This is a follow up to yesterday’s message about the campus network connectivity issues that impacted PACE.  By 3:24pm yesterday, OIT’s network team was able to rapidly resolve the connectivity issues, which was quickly updated on the status page link that was provided earlier.   Analysis of the incident, which was made available to us at a later point, revealed that the issue was identified to be a network spanned into the Coda data center from the Rich building that experienced a spanning tree issue (a network loop).   This caused a cascade of issues  with core network equipment due to specific failure scenario that caused widespread connectivity issues across the campus.  OIT’s network team resolved the issue with the affected network that resolved the other connectivity issues affecting the campus.   OIT’s network team will conduct further investigation regarding this to prevent future occurrence.

Since yesterday at about 3:30pm, all PACE users should have been able to access PACE managed resources without issues.  There was no impact to running jobs unless they required external resources (outside of PACE).   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

 

[Original Message – March 24, 2021 2:48pm]

Dear PACE Users,

At around 2:30pm, OIT’s network team reported connectivity issues. This may impact users’ ability to connect to PACE managed resources at Coda, such as Phoenix, Hive, Firebird, PACE-ICE, CoC-ICE and Testflight-Coda. Currently, the source of the problem is being investigated, but at this time, there is no impact to running jobs unless they require external resources (i.e., from the Web). We will provide further information as it’s available.

Please refer to the OIT’s status page for the developments on this issue: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/605b8495e2838505358d3af3

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

We apologize for this inconvenience.

Best,

The PACE Team

Phoenix Project Storage Quotas Begin March 31

Monday, March 15, 2021 Posted by
Comments closed

[Update 3/31/21 10:45 AM]

As previously announced, we have applied quotas to Phoenix project storage today based on each faculty PI’s choice. You can run the pace-quota command to check your research group’s utilization and quota at any time. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/. April will be the first month where storage quotas incur charges for PIs choosing quotas beyond the 1 TB funded by the Institute.

If your research group exceeds your quota, you will not be able to write to your project storage, and jobs running in your project storage may fail. We are in the process of directly contacting all users in storage projects over quota today.

Please contact PACE Support with any questions about Phoenix project storage quotas. Faculty may also choose to contact their PACE Research Scientist liaison.

 

[Update 3/24/21 12:30 PM]

We’d like to remind you that storage quotas on Phoenix project storage will be set one week from today, on March 31.

You can run the pace-quota command to check your research group’s utilization at any time. Your PI/faculty sponsor is choosing the quota that will be set, based on your group’s storage needs.  You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

After quotas are set on March 31, we will notify all users, and you will be able to see your quota via pace-quota.

Users and faculty should contact their PACE Research Scientist liaison or PACE Support with any questions about Phoenix project storage quotas.

 

[Original Post]

As part of completing the migration to Phoenix, we will set quotas on Phoenix project storage on March 31, ending the period of unlimited project storage. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

PACE’s free tier offers 1 TB of Institute-funded project storage to GT faculty. Faculty members must fund additional storage beginning with the month of April. PACE has provided faculty members with Phoenix storage allocations (except those recently created) with information regarding their group’s storage needs. Users can contact their advisors if they have concerns about their allocation.

All users can run the “pace-quota” command on Phoenix to see their research group’s storage usage. Quotas will generally show as unlimited (zero) until March 31.

Please contact us at pace-support@oit.gatech.edu with any questions about Phoenix project storage.

Update on the New Cost Model: April 1, 2021 is the Start Date for Compute and Storage Billing

Wednesday, February 24, 2021 Posted by
Comments closed

Dear PACE research community,

Since the start of 2021, nearly 500k user jobs have completed from nearly 150 PI groups that’s accounted for nearly 16M CPU hours on the Phoenix cluster while maintaining an exceptional quality of service in which the average wait time in queue per user’s job was about 0.5 hours.   The measures that we implemented on December 14 (see blog post)  to ensure fair use of the Phoenix cluster have been effective in enabling the research groups to leverage the scalability of the Phoenix and the new system while maintaining a high level of quality service for the user community.

At this time, we want to share an update in reference to the new cost model.   We are updating the start date for compute and storage billing from March 1, 2021 to April 1, 2021 as the new start date for billing. This means that users will not be charged for their usage of compute and storage resources until April 1, 2021.  This grace period extension allows us to achieve the following:

  • Gain input from the faculty led PACE advisory committee that is being organized.
  • Align the start of our compute and storage billing for all services (including CUI)
  • Provide additional time for the research community to adopt to the Phoenix cluster and the new cost model.
  • Provide an opportunity to send  “showback” statements for the prior months during March, which would provide time for PIs to review these past statements and follow up with PACE if they have any questions prior to the start of billing that begins on April 1, 2021

If you have any questions, concerns or comments, please direct them to pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Completed – PACE Clusters Ready for Research] PACE Maintenance – February 3 – 5, 2021

Wednesday, January 20, 2021 Posted by
Comments closed

[Update — February 5, 2021, 2:14pm]

Dear PACE Users,

Our scheduled maintenance has completed on time. All Coda and Rich datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Here is an update on the tasks, which includes a task that may require user action, please see below:

ITEMS THAT MAY REQUIRE USER ACTION:

  • [COMPLETE] [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)

While updating login-hive[1-2], their SSH server keys changed. As a result, users may get a message that the key is not correct. If this should happen, please clear the entries from your local .ssh/known_hosts that have any reference to login-hive, login-hive1 or login-hive2, then try again.

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Apply updates to all compute nodes
  • [COMPLETE] [Compute] Reboot all compute nodes running Lustre clients
  • [COMPLETE] [Network] Enable subnet managers for Hive
  • [COMPLETE] [Network] Reboot the main Coda InfiniBand HDR switch
  • [COMPLETE] [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [COMPLETE] [Software] Upgrade Intel license server
  • [COMPLETE] [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [COMPLETE] [Storage] Upgrade Lustre clients
  • [COMPLETE] [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [COMPLETE] [Coda Data Center] Georgia Power will install a Power Quality Monitor

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!

The PACE Team

[Update — February 2, 2021, 3:25pm]

This is a friendly reminder that our Maintenance will begin tomorrow at 6:00 AM and conclude on Friday, February 5th, 2021.   As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off.   These jobs will be released as soon as the maintenance activities are complete. Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We have added one additional activity to this maintenance.  Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR  switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [Coda Data Center] Georgia Power will install a Power Quality Monitor

This maintenance is planned to last through Friday that will allow for Georgia Power to install a Power Quality Monitor, which is required to get the microgrid fully operational.  Due to work being performed by Databank on the colling systems, we agreed to do this activity on Friday.  No power outage is expected.   Once Georgia Power is complete with the installation, we will open the clusters to users.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

 

[Original Note – January 20, 2021, 1:23pm]

Dear PACE Users,

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on February 3rd, 2021 and conclude at 11:59 PM on February 5th, 2021. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team