GT Home : : Campus Maps : : GT Directory

Author Archive

PACE Maintenance Period (August 11-13, 2021)

Posted by on Tuesday, 13 July, 2021

Dear PACE Users,

This is another friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 08/11/2021, which is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.  Please note, as usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

As we get closer to the Maintenance Period, we will communicate the list of activities to be completed.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Resolved] OIT’s Data Warehouse Service Outage

Posted by on Monday, 12 July, 2021

[Update – July 13, 2021] 

OIT has restored operation to Data Warehouse service on July 12, 11:22AM.  Shortly after, PACE has restored functionality to our database and our administrative services.   OIT has continued to monitor the Data Warehouse service.  At this time, all PACE user facing utilities such as pace-check-queue, pace-quota, and pace-whoami are operational.

Please accept our sincere apology for any inconvenience that this temporary limitation may have caused you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

[Original Message – July 12, 2021]

Dear PACE Users,

We are reaching out to inform you that on Saturday at about 10:00am, there was an outage to OIT’s Enterprise Data Warehouse service, which PACE relies on for hosting our database instance that subsequently went down at 11:07am.  The impact to PACE from this service outage is mainly limited to administrative side, and there is some impact to user facing utilities such as pace-check-queue; however, there is no impact to users’ jobs or ability to submit jobs.

What’s happening and what we are doing:  Currently, OIT is investigating the outage impacting the Data Warehouse service that occurred on Saturday, and this outage is tracked at OIT’s status page.   PACE is monitoring this development closely.

How does this impact me:  This data warehouse service outage impacts user facing utilities such as pace-check-queue, pace-quota, pace-whoami that are partially or fully nonfunctional.   In addition, until the Data Warehouse service is restored, PACE will be unable to create new user and PI account requests.  

What we will continue to do:  PACE team will continue to monitor this development, and we will report as needed.   

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

pace-support.sh is disabled on PACE Clusters — please email pace-support directly for inquiries

Posted by on Tuesday, 29 June, 2021

Dear PACE Users,

It has come to our attention that we are not receiving support requests generated by the pace-support.sh script, which allows submission of support tickets directly from PACE clusters. Our investigation is ongoing.

At this time, please email us at pace-support@oit.gatech.edu from a non-PACE system for all support requests, to ensure that we receive your message.

From our initial investigation, it appears that this outage began at some point in May. We apologize for any lost messages since then. If you have been trying to reach us via the pace-support script, please email us instead. You should receive an automated acknowledgement email from Service Desk when your request is successfully processed.

Please contact us at pace-support@oit.gatech.edu with questions.

The PACE Team

[Urgent] Hive Cluster Storage Controller Cable Replacement – Performance Impact

Posted by on Friday, 25 June, 2021

[Update – 06/25 11:40PM]

The storage controller cable on Hive cluster was replaced this evening and brought back online.  Unfortunately, after the repairs, GPFS storage mounts became unavailable, which had interrupted users’ running jobs this evening.   We’ve paused the scheduler briefly while we restarted the GPFS services across the cluster.  The storage mounts were restored, and scheduler has been resumed.

User’s jobs that have been running/queued between about 7:00pm and 10:30pm today (6/25/2021)  may have been interrupted, and we recommend the users to check on their jobs and resubmit your jobs as needed.  Please accept our sincerest apology for this inconvenience.

We will continue to monitor the services and update as needed.  If you have any questions, please contact us at pace-support@oit.gatech.edu.

[Original Message – 06/25 5:12PM]

Dear Hive Users,

We are reaching out to inform you that one of our storage controllers for Hive cluster has a bad cable that needs to be replaced to ensure optimal performance and data integrity.   We have the cable at hand, and are in a process of replacing this cable this evening, Friday 06/25/2021.  This work will impact storage performance briefly, which users may experience as storage slowness as we are routing all our traffic to a secondary controller during this operation. 

What’s happening and what we are doing:  More specifically, PACE has assessed a high failure rate of the disks in one of the enclosures for the storage controller with a bad cable.  As a precaution, we will be shutting down the controller with the bad cable to unfail the disks and to ensure data integrity of the system.  We will work on replacing the cable this evening during which the controller will be shutdown.  During this work, all storage traffic will be routed to a secondary controller that is fully operational.   Given the anticipated load on the secondary controller, we anticipate users experiencing performance degradation.  

How does this impact me:  With only one storage control in operation, users may experience storage slowness.  In a highly unlikely event, this could cause downtime to the storage which would impact all users’ running jobs; however, we do not anticipate any storage outage during this operation.

What we will continue to do:  PACE team will work on the cable replacement and restore the storage to optimal operation, and update the community as needed. 

Please accept our sincere apology for any inconvenience that this  may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

OIT Scheduled Service for MATLAB- 05/07/2021, 10:00AM – noon

Posted by on Thursday, 6 May, 2021

OIT will perform work on Georgia Tech’s MATLAB license server tomorrow morning, 05/07/2021, 10:00 AM – noon, which will impact any MATLAB jobs running on PACE at the time of the outage (as well as elsewhere on campus).

During the outage window, attempts to open new MATLAB instances in batch or interactive jobs will fail. In addition, we expect running MATLAB instances will stop working, but the job will continue running.

PACE aims to identify affected jobs tomorrow morning and follow up with the impacted users.

We recommend that you avoid submitting additional MATLAB jobs to PACE that will not finish before 10 AM on Friday (May 6) and instead submit them after work is complete.

OIT will be providing up-to-date progress on Georgia Tech’s Status page, http://status.gatech.edu. 

If you have any questions, please contact us at pace-support@oit.gatech.edu.

 

 

PACE Advisory Committee Assembled

Posted by on Thursday, 29 April, 2021

Dear PACE research community, 

We are pleased to announce that the faculty-led PACE advisory committee was formed and assembled on March 30, 2021. The PACE Advisory Committee is a joint effort between the EVPR and OIT to ensure that shared research computing services are both meeting faculty needs and resourced in a sustainable way.   The committee consists of a representative group of PACE and faculty members, encompassing a wide range of experience and expertise on advanced computational and data capabilities provided by OIT’s research cyberinfrastructure.  An important goal of the committee is to provide essential feedback which will help continuously improve this critical service. The committee will meet regularly and: 

  1. Function as communication channel between the broader research computing community and PACE.
  2. Serve as a sounding board for major changes to the PACE infrastructure
  3. Maintain an Institute-level view of the shared resource
  4. Help craft strategies that balance the value and benefits provided by the resources with a sustainable cost structure in the face of ever-increasing demand.

PACE Advisory Committee Members: 

  • Srinivas AluruIDEaSDirector (ex-officio) 
  • Omar Asensio, Public Policy 
  • Dhruv Batra, Interactive Computing/ML@GT 
  • Mehmet Belgin, PACE 
  • Annalisa Bracco, Earth and Atmospheric Sciences 
  • Neil Bright, PACE 
  • Laura Cadonati, Physics 
  • Umit  Catalyurek, Computational Science and Engineering 
  • Sudheer ChavaScheller College of Business 
  • Yongtao Hu, Civil and Environmental Engineering 
  • Lew Lefton, EVPR/Math (ex-officio) 
  • Steven Liang, Mechanical Engineering/GTMI 
  • AJ Medford, Chemical and Biomolecular Engineering 
  • Joe Oefelein, Aerospace Engineering 
  • Annalise Paaby, Biological Sciences 
  • Tony Pan, IDEaS 
  • David Sherrill, Chemistry and Biochemistry 
  • Huan Tran, Materials Science and Engineering  

If you have any questions or comments, please direct them to the PACE Team <pace-support@oit.gatech.edu> and/or to Dr. Lew Lefton <lew.lefton@gatech.edu>.  

All the best, 

The PACE Team 

PACE Update: Compute and Storage Billing

Posted by on Friday, 23 April, 2021

Dear PACE research community,

During our extended grace period, nearly 1M user jobs from nearly 160 PI groups have completed,  consuming nearly 40M CPU hours on the Phoenix cluster. The average wait time in queue per job was less than 0.5 hours, confirming the effectiveness of the measures to ensure fair use of the Phoenix cluster to maintain an exceptional level of quality of service.

With the billing for both storage and compute usage in effect as of April 1st, we are following up to provide an update on a few important points.

Compute billing started April 1: 

Throughout March, we’ve sent communications to all PIs  in accordance with the PACE’s new cost model, including  the amount of compute credits  based on the refreshed compute equipment as part of the migration to Coda data center and/or  recently purchased equipment from FY20 Phase 1/2/3 purchase(s).

PACE has identified and fixed some discrepancies since our initially communicated information as part of our compute audit, which included resources purchased but not provisioned on time. We apologize for this oversight and encourage users to run pace-quota command to verify the updated list of charge accounts. We’ll follow up with the impacted PIs/users in a separate communication.

Please note that most school-owned accounts, as well as those jointly purchased by multiple faculty members, will show a zero balance, but you can still run jobs with them. We are working to make the balances in those accounts visible to you.

As of April 1, all the jobs that run on Phoenix and/or Firebird clusters will be debited/charged to the provided charge account (e.g., GT-gburdell3, GT-gburdell3-CODA20), and a statement will be sent to PIs at the start of May.

This does NOT necessarily mean that you must immediately begin providing funding to use Phoenix. All faculty and their research groups have access to our free tier. Additionally, if you had access to paid resources in Rich, they have been refreshed with an equivalent prepaid account intended to last for 5 years. 

Project storage billing started on April 1: 

As announced, quotas to Phoenix project storage were applied on March 31 based on PI choices as part of our storage audit.   Users may run pace-quota command to check their research group’s utilization and quota at any time.  For further information about the Phoenix storage, please see our documentation.  April is the first month where storage quotas incur charges for PIs who have chosen quotas above the 1 TB funded by the Institute.

Showback statements sent to PIs: 

Throughout March, we sent out “showback” statements for the prior months’ usage on the Phoenix cluster, which covered the usage for Oct 2020 through Feb 2021.   We are in a process of sending the March 2021 showback statements that will also include a storage report.  Overall, these statements provided PIs with an opportunity to review their group’s usage and follow up with PACE as needed.  Explanations for each of the metrics can be found in our documentation.

No charges were incurred for usage during the grace period, so the showback statements are solely for your information and to guide your usage plans going forward. 

User account audit completed: 

Users of ECE and Prometheus resources migrated in Nov 2020 did not have all their charge accounts provisioned  during their groups’ migration.  Since then, we have provided access to these additional accounts for the impacted users.  We apologize for any inconvenience this may have caused.   Also, as part of our preparation to start billing for computation, on Feb 8, the PACE team sent out a notification to PIs to conduct a review of the job submission accounts and corresponding user lists.  We appreciate PIs input throughout this process, and if any changes have occurred in your group since then, or if you would like to add a new user(s) to your account(s), please don’t hesitate to send a request to pace-support@oit.gatech.edu.  Users may run the pace-whoami command to see a list of charge accounts they may use.

Additionally, we have created a blog page for the frequently asked questions we have received from our community after the end of the extended grace period on March 31, which we would like to share with you at this time.

If you have any questions, concerns or comments about the Phoenix cluster or the new cost model, please direct them to pace-support@oit.gatech.edu.

Thank you,

The PACE Team

FAQ after the end of the grace period on the Phoenix cluster

Posted by on Friday, 23 April, 2021

The following are frequently asked questions we have received from our user community after the end of the extended grace period on March 31 in accordance to the new cost model., which we are sharing with the community:

Q: Where can I find an updated NSF style facilities and equipment document? 

A:  Please see our page at https://pace.gatech.edu/sample-nsf-application-boilerplate-describing-pace-hpc  

Q: I had a cluster I bought back in 2013, can I access this cluster? 

A: No.  We have decommissioned all clusters from Rich datacenter as part of the Rich to Coda datacenter migration plan.   As part of our earlier communication to PIs, if a PI owned a cluster in Rich datacenter, they received a detailed summary of their charge account(s) for the Phoenix cluster that included the amount of compute credits allocated to their account based on the compute equipment that was refreshed.  To see your list of available charge account(s) and their credit balance, please run pace-quota on the Phoenix cluster. 

Q: I do not have funds to pay for the usage of the Phoenix cluster at this time, can I get  access to Phoenix at no cost? 

A: As part of this transition, PACE has taken the opportunity to provide all Institute Faculty with computational and data resources at a modest level.    All academic and research faculty (“PIs”) participating in PACE are automatically granted a certain level of resources in addition to any additional funding they may bring. Each PI is provided 1TB of project storage and compute credits (68) equivalent to 10,000 CPU-hours (per month) on a 192GB compute node. These credits may be used towards any computational resources (e.g., GPUs, high memory nodes) that are available within the Phoenix cluster. In addition, all PACE users also have access to the preemptable backfill queue at no cost.   

Q: Do I need to immediately begin providing funding to use Phoenix beyond the free tier? 

A: Not necessarily. If you had access to paid resources in Rich, you now have access to a refresh CODA20 account with an existing balance, as described to each faculty owner. The number of credits in that account is equivalent in computational power to 5 years of continuous use of your old cluster in the Rich Datacenter  

PACE Archive Storage Update and New Allocation Moratorium

Posted by on Friday, 2 April, 2021

Dear PACE Users,

We are reaching out to provide you a status update on PACE’s Archive storage service, and to inform you about the moratorium for new archive storage user creation and allocations that we are instituting effective immediately.  This moratorium on new archive storage deployments decreases any potential negative impacts on transfer and backups due to the potential for large influx of new files.

What’s happening and what we are doing: Currently, the original PACE Archive storage is hosted on vendor hardware that is at limited support capacity as the vendor has ceased operations.  PACE has initiated a two phase plan to transfer PACE Archive storage from the current hardware to a permanent storage solution.  At this time, phase 1 is underway, and archive storage data is being replicated to a temporary storage solution.   PACE aims to finish the archive system transfer and configuration of this phase by May Maintenance Period (5/19/2021 – 5/21/2021).   The phase 1 is a temporary solution as PACE explores a more cost-efficient solution that will require a second migration of the data to the permanent storage solution, which will be part of the phase 2 of the plan, and we will follow-up with details accordingly.   

How does this impact me:  There is no service impact to current PACE archive storage users.   With the moratorium in effect, new users/allocations requests for archive storage are delayed until after the maintenance period.  New requests for archive storage may be processed starting 05/22/2021.  

What we will continue to do:  PACE team will continue to monitor the transfer of the data to the NetApp storage, and we will report as needed. 

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] Network Connectivity Issues

Posted by on Thursday, 25 March, 2021

[Update – March 25, 2021 – 3:07pm] 

This is a follow up to yesterday’s message about the campus network connectivity issues that impacted PACE.  By 3:24pm yesterday, OIT’s network team was able to rapidly resolve the connectivity issues, which was quickly updated on the status page link that was provided earlier.   Analysis of the incident, which was made available to us at a later point, revealed that the issue was identified to be a network spanned into the Coda data center from the Rich building that experienced a spanning tree issue (a network loop).   This caused a cascade of issues  with core network equipment due to specific failure scenario that caused widespread connectivity issues across the campus.  OIT’s network team resolved the issue with the affected network that resolved the other connectivity issues affecting the campus.   OIT’s network team will conduct further investigation regarding this to prevent future occurrence.

Since yesterday at about 3:30pm, all PACE users should have been able to access PACE managed resources without issues.  There was no impact to running jobs unless they required external resources (outside of PACE).   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

 

[Original Message – March 24, 2021 2:48pm]

Dear PACE Users,

At around 2:30pm, OIT’s network team reported connectivity issues. This may impact users’ ability to connect to PACE managed resources at Coda, such as Phoenix, Hive, Firebird, PACE-ICE, CoC-ICE and Testflight-Coda. Currently, the source of the problem is being investigated, but at this time, there is no impact to running jobs unless they require external resources (i.e., from the Web). We will provide further information as it’s available.

Please refer to the OIT’s status page for the developments on this issue: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/605b8495e2838505358d3af3

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

We apologize for this inconvenience.

Best,

The PACE Team