GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] Phoenix scratch outage

Posted by on Saturday, 12 June, 2021

[Update 6/12/21 6:30 PM]

Phoenix Lustre scratch has been restored. We paused the scheduler at 4:40 PM to prevent additional jobs from starting and resumed scheduling at 6:20 PM. As noted, please contact us with the job number for any job that began prior to 4:40 PM and was affected by the scratch outage, in order to receive a refund.

[Original post, 6/12/21 4:30 PM]

We are experiencing an outage on Phoenix’s Lustre scratch storage. Our team is currently investigating and has confirmed that this issue is related to the scratch mount and does not affect home or project storage. Users may be unable to list, read, or write files in their scratch directories.
If your running job has failed or runs without producing output as a result of this outage, please contact us at pace-support@oit.gatech.edu with the affected job number(s), and we will refund the value of the job(s) to your charge account. Please refrain from submitting additional jobs utilizing your networked Lustre scratch directory until the service is repaired, in order to avoid increasing the number of failed jobs.

[RESOLVED] Phoenix Scheduler is Down

Posted by on Thursday, 13 May, 2021

Update (5/13 2:00pm): We are happy to report that the Phoenix Scheduler is now online and accepting jobs.

We are sorry for the inconvenience this has caused and please let us know if you continue to observe any problems (pace-support@oit.gatech.edu)
—-
At around 10:30am this morning, we restarted the Phoenix scheduler to apply a new license file. The scheduler is having trouble coming back online and we are actively troubleshooting this issue. So far we know the issue is unrelated to the license, rather some left over job files may be causing the issue. We are working on reviving the scheduler as soon as possible.

This issue doesn’t impact any running jobs, or those submitted before the incident. Only new job submissions will fail with an error.

We’ll update this post (http://blog.pace.gatech.edu/?p=7075) and send a follow up message once the issue is resolved.

Thank you for your patience and sorry for this inconvenience.

 

 

OIT Scheduled Service for MATLAB- 05/07/2021, 10:00AM – noon

Posted by on Thursday, 6 May, 2021

OIT will perform work on Georgia Tech’s MATLAB license server tomorrow morning, 05/07/2021, 10:00 AM – noon, which will impact any MATLAB jobs running on PACE at the time of the outage (as well as elsewhere on campus).

During the outage window, attempts to open new MATLAB instances in batch or interactive jobs will fail. In addition, we expect running MATLAB instances will stop working, but the job will continue running.

PACE aims to identify affected jobs tomorrow morning and follow up with the impacted users.

We recommend that you avoid submitting additional MATLAB jobs to PACE that will not finish before 10 AM on Friday (May 6) and instead submit them after work is complete.

OIT will be providing up-to-date progress on Georgia Tech’s Status page, http://status.gatech.edu. 

If you have any questions, please contact us at pace-support@oit.gatech.edu.

 

 

PACE Advisory Committee Assembled

Posted by on Thursday, 29 April, 2021

Dear PACE research community, 

We are pleased to announce that the faculty-led PACE advisory committee was formed and assembled on March 30, 2021. The PACE Advisory Committee is a joint effort between the EVPR and OIT to ensure that shared research computing services are both meeting faculty needs and resourced in a sustainable way.   The committee consists of a representative group of PACE and faculty members, encompassing a wide range of experience and expertise on advanced computational and data capabilities provided by OIT’s research cyberinfrastructure.  An important goal of the committee is to provide essential feedback which will help continuously improve this critical service. The committee will meet regularly and: 

  1. Function as communication channel between the broader research computing community and PACE.
  2. Serve as a sounding board for major changes to the PACE infrastructure
  3. Maintain an Institute-level view of the shared resource
  4. Help craft strategies that balance the value and benefits provided by the resources with a sustainable cost structure in the face of ever-increasing demand.

PACE Advisory Committee Members: 

  • Srinivas AluruIDEaSDirector (ex-officio) 
  • Omar Asensio, Public Policy 
  • Dhruv Batra, Interactive Computing/ML@GT 
  • Mehmet Belgin, PACE 
  • Annalisa Bracco, Earth and Atmospheric Sciences 
  • Neil Bright, PACE 
  • Laura Cadonati, Physics 
  • Umit  Catalyurek, Computational Science and Engineering 
  • Sudheer ChavaScheller College of Business 
  • Yongtao Hu, Civil and Environmental Engineering 
  • Lew Lefton, EVPR/Math (ex-officio) 
  • Steven Liang, Mechanical Engineering/GTMI 
  • AJ Medford, Chemical and Biomolecular Engineering 
  • Joe Oefelein, Aerospace Engineering 
  • Annalise Paaby, Biological Sciences 
  • Tony Pan, IDEaS 
  • David Sherrill, Chemistry and Biochemistry 
  • Huan Tran, Materials Science and Engineering  

If you have any questions or comments, please direct them to the PACE Team <pace-support@oit.gatech.edu> and/or to Dr. Lew Lefton <lew.lefton@gatech.edu>.  

All the best, 

The PACE Team 

Parallel Computing with MATLAB and Scaling to HPC on PACE clusters at Georgia Tech

Posted by on Wednesday, 28 April, 2021

MathWorks and PACE are partnering to offer a two-part parallel computing virtual workshop, taught by a MathWorks engineer, to PACE users and other members of the Georgia Tech community.

During this self-paced, hands-on workshop, you will be introduced to parallel and GPU computing in MATLAB for speeding up your application and offloading computations.  By working through common scenarios and workflows, you will gain an understanding of the parallel constructs in MATLAB, their capabilities, and some of the issues that may arise when using them. You will also learn how to take advantage of PACE resources, which are available to all researchers at Georgia Tech (including a free tier available at no cost), to scale your MATLAB computations.

Register by noon on May 14 at https://gatech.co1.qualtrics.com/jfe/form/SV_cD7prAcGZRthKCO.

Highlights
·      Speeding up programs with parallel computing
·      Working with large data sets
·      GPU computing
·      Scaling to PACE clusters (Phoenix, Hive, ICE, or Firebird)

Agenda
This virtual workshop will be held in two parts:
Part I, Tuesday, May 18, 1-4 PM, will focus on speeding up MATLAB with Parallel Computing Toolbox.
Part II, Tuesday, May 25, 1-4 PM, will focus on running MATLAB parallel code on PACE clusters.

Who should attend?
PhD students, post docs and faculty at Georgia Tech that want to (Part I) use parallel and GPU computing in MATLAB, and (Part II) scale their computations to take advantage of PACE resources.

Requirements
·      Basic working knowledge of MATLAB
·      Access to the Georgia Tech VPN. You do NOT need to be a PACE user, and all participants will receive access to PACE-ICE for hands-on activities.

Please contact PACE at pace-support@oit.gatech.edu with any questions.

PACE Update: Compute and Storage Billing

Posted by on Friday, 23 April, 2021

Dear PACE research community,

During our extended grace period, nearly 1M user jobs from nearly 160 PI groups have completed,  consuming nearly 40M CPU hours on the Phoenix cluster. The average wait time in queue per job was less than 0.5 hours, confirming the effectiveness of the measures to ensure fair use of the Phoenix cluster to maintain an exceptional level of quality of service.

With the billing for both storage and compute usage in effect as of April 1st, we are following up to provide an update on a few important points.

Compute billing started April 1: 

Throughout March, we’ve sent communications to all PIs  in accordance with the PACE’s new cost model, including  the amount of compute credits  based on the refreshed compute equipment as part of the migration to Coda data center and/or  recently purchased equipment from FY20 Phase 1/2/3 purchase(s).

PACE has identified and fixed some discrepancies since our initially communicated information as part of our compute audit, which included resources purchased but not provisioned on time. We apologize for this oversight and encourage users to run pace-quota command to verify the updated list of charge accounts. We’ll follow up with the impacted PIs/users in a separate communication.

Please note that most school-owned accounts, as well as those jointly purchased by multiple faculty members, will show a zero balance, but you can still run jobs with them. We are working to make the balances in those accounts visible to you.

As of April 1, all the jobs that run on Phoenix and/or Firebird clusters will be debited/charged to the provided charge account (e.g., GT-gburdell3, GT-gburdell3-CODA20), and a statement will be sent to PIs at the start of May.

This does NOT necessarily mean that you must immediately begin providing funding to use Phoenix. All faculty and their research groups have access to our free tier. Additionally, if you had access to paid resources in Rich, they have been refreshed with an equivalent prepaid account intended to last for 5 years. 

Project storage billing started on April 1: 

As announced, quotas to Phoenix project storage were applied on March 31 based on PI choices as part of our storage audit.   Users may run pace-quota command to check their research group’s utilization and quota at any time.  For further information about the Phoenix storage, please see our documentation.  April is the first month where storage quotas incur charges for PIs who have chosen quotas above the 1 TB funded by the Institute.

Showback statements sent to PIs: 

Throughout March, we sent out “showback” statements for the prior months’ usage on the Phoenix cluster, which covered the usage for Oct 2020 through Feb 2021.   We are in a process of sending the March 2021 showback statements that will also include a storage report.  Overall, these statements provided PIs with an opportunity to review their group’s usage and follow up with PACE as needed.  Explanations for each of the metrics can be found in our documentation.

No charges were incurred for usage during the grace period, so the showback statements are solely for your information and to guide your usage plans going forward. 

User account audit completed: 

Users of ECE and Prometheus resources migrated in Nov 2020 did not have all their charge accounts provisioned  during their groups’ migration.  Since then, we have provided access to these additional accounts for the impacted users.  We apologize for any inconvenience this may have caused.   Also, as part of our preparation to start billing for computation, on Feb 8, the PACE team sent out a notification to PIs to conduct a review of the job submission accounts and corresponding user lists.  We appreciate PIs input throughout this process, and if any changes have occurred in your group since then, or if you would like to add a new user(s) to your account(s), please don’t hesitate to send a request to pace-support@oit.gatech.edu.  Users may run the pace-whoami command to see a list of charge accounts they may use.

Additionally, we have created a blog page for the frequently asked questions we have received from our community after the end of the extended grace period on March 31, which we would like to share with you at this time.

If you have any questions, concerns or comments about the Phoenix cluster or the new cost model, please direct them to pace-support@oit.gatech.edu.

Thank you,

The PACE Team

FAQ after the end of the grace period on the Phoenix cluster

Posted by on Friday, 23 April, 2021

The following are frequently asked questions we have received from our user community after the end of the extended grace period on March 31 in accordance to the new cost model., which we are sharing with the community:

Q: Where can I find an updated NSF style facilities and equipment document? 

A:  Please see our page at https://pace.gatech.edu/sample-nsf-application-boilerplate-describing-pace-hpc  

Q: I had a cluster I bought back in 2013, can I access this cluster? 

A: No.  We have decommissioned all clusters from Rich datacenter as part of the Rich to Coda datacenter migration plan.   As part of our earlier communication to PIs, if a PI owned a cluster in Rich datacenter, they received a detailed summary of their charge account(s) for the Phoenix cluster that included the amount of compute credits allocated to their account based on the compute equipment that was refreshed.  To see your list of available charge account(s) and their credit balance, please run pace-quota on the Phoenix cluster. 

Q: I do not have funds to pay for the usage of the Phoenix cluster at this time, can I get  access to Phoenix at no cost? 

A: As part of this transition, PACE has taken the opportunity to provide all Institute Faculty with computational and data resources at a modest level.    All academic and research faculty (“PIs”) participating in PACE are automatically granted a certain level of resources in addition to any additional funding they may bring. Each PI is provided 1TB of project storage and compute credits (68) equivalent to 10,000 CPU-hours (per month) on a 192GB compute node. These credits may be used towards any computational resources (e.g., GPUs, high memory nodes) that are available within the Phoenix cluster. In addition, all PACE users also have access to the preemptable backfill queue at no cost.   

Q: Do I need to immediately begin providing funding to use Phoenix beyond the free tier? 

A: Not necessarily. If you had access to paid resources in Rich, you now have access to a refresh CODA20 account with an existing balance, as described to each faculty owner. The number of credits in that account is equivalent in computational power to 5 years of continuous use of your old cluster in the Rich Datacenter  

PACE Archive Storage Update and New Allocation Moratorium

Posted by on Friday, 2 April, 2021

Dear PACE Users,

We are reaching out to provide you a status update on PACE’s Archive storage service, and to inform you about the moratorium for new archive storage user creation and allocations that we are instituting effective immediately.  This moratorium on new archive storage deployments decreases any potential negative impacts on transfer and backups due to the potential for large influx of new files.

What’s happening and what we are doing: Currently, the original PACE Archive storage is hosted on vendor hardware that is at limited support capacity as the vendor has ceased operations.  PACE has initiated a two phase plan to transfer PACE Archive storage from the current hardware to a permanent storage solution.  At this time, phase 1 is underway, and archive storage data is being replicated to a temporary storage solution.   PACE aims to finish the archive system transfer and configuration of this phase by May Maintenance Period (5/19/2021 – 5/21/2021).   The phase 1 is a temporary solution as PACE explores a more cost-efficient solution that will require a second migration of the data to the permanent storage solution, which will be part of the phase 2 of the plan, and we will follow-up with details accordingly.   

How does this impact me:  There is no service impact to current PACE archive storage users.   With the moratorium in effect, new users/allocations requests for archive storage are delayed until after the maintenance period.  New requests for archive storage may be processed starting 05/22/2021.  

What we will continue to do:  PACE team will continue to monitor the transfer of the data to the NetApp storage, and we will report as needed. 

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] Network Connectivity Issues

Posted by on Thursday, 25 March, 2021

[Update – March 25, 2021 – 3:07pm] 

This is a follow up to yesterday’s message about the campus network connectivity issues that impacted PACE.  By 3:24pm yesterday, OIT’s network team was able to rapidly resolve the connectivity issues, which was quickly updated on the status page link that was provided earlier.   Analysis of the incident, which was made available to us at a later point, revealed that the issue was identified to be a network spanned into the Coda data center from the Rich building that experienced a spanning tree issue (a network loop).   This caused a cascade of issues  with core network equipment due to specific failure scenario that caused widespread connectivity issues across the campus.  OIT’s network team resolved the issue with the affected network that resolved the other connectivity issues affecting the campus.   OIT’s network team will conduct further investigation regarding this to prevent future occurrence.

Since yesterday at about 3:30pm, all PACE users should have been able to access PACE managed resources without issues.  There was no impact to running jobs unless they required external resources (outside of PACE).   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

 

[Original Message – March 24, 2021 2:48pm]

Dear PACE Users,

At around 2:30pm, OIT’s network team reported connectivity issues. This may impact users’ ability to connect to PACE managed resources at Coda, such as Phoenix, Hive, Firebird, PACE-ICE, CoC-ICE and Testflight-Coda. Currently, the source of the problem is being investigated, but at this time, there is no impact to running jobs unless they require external resources (i.e., from the Web). We will provide further information as it’s available.

Please refer to the OIT’s status page for the developments on this issue: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/605b8495e2838505358d3af3

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

We apologize for this inconvenience.

Best,

The PACE Team

Phoenix Project Storage Quotas Begin March 31

Posted by on Monday, 15 March, 2021

[Update 3/31/21 10:45 AM]

As previously announced, we have applied quotas to Phoenix project storage today based on each faculty PI’s choice. You can run the pace-quota command to check your research group’s utilization and quota at any time. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/. April will be the first month where storage quotas incur charges for PIs choosing quotas beyond the 1 TB funded by the Institute.

If your research group exceeds your quota, you will not be able to write to your project storage, and jobs running in your project storage may fail. We are in the process of directly contacting all users in storage projects over quota today.

Please contact PACE Support with any questions about Phoenix project storage quotas. Faculty may also choose to contact their PACE Research Scientist liaison.

 

[Update 3/24/21 12:30 PM]

We’d like to remind you that storage quotas on Phoenix project storage will be set one week from today, on March 31.

You can run the pace-quota command to check your research group’s utilization at any time. Your PI/faculty sponsor is choosing the quota that will be set, based on your group’s storage needs.  You can learn more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

After quotas are set on March 31, we will notify all users, and you will be able to see your quota via pace-quota.

Users and faculty should contact their PACE Research Scientist liaison or PACE Support with any questions about Phoenix project storage quotas.

 

[Original Post]

As part of completing the migration to Phoenix, we will set quotas on Phoenix project storage on March 31, ending the period of unlimited project storage. Project storage quotas on Phoenix are set for a faculty member’s research group, not for individual users. Each user also has 10 GB of home storage and 15 TB of short-term scratch storage (not backed up).

PACE’s free tier offers 1 TB of Institute-funded project storage to GT faculty. Faculty members must fund additional storage beginning with the month of April. PACE has provided faculty members with Phoenix storage allocations (except those recently created) with information regarding their group’s storage needs. Users can contact their advisors if they have concerns about their allocation.

All users can run the “pace-quota” command on Phoenix to see their research group’s storage usage. Quotas will generally show as unlimited (zero) until March 31.

Please contact us at pace-support@oit.gatech.edu with any questions about Phoenix project storage.