GT Home : : Campus Maps : : GT Directory

Author Archive

PACE Archive Storage Update and New Allocation Moratorium

Posted by on Friday, 2 April, 2021

Dear PACE Users,

We are reaching out to provide you a status update on PACE’s Archive storage service, and to inform you about the moratorium for new archive storage user creation and allocations that we are instituting effective immediately.  This moratorium on new archive storage deployments decreases any potential negative impacts on transfer and backups due to the potential for large influx of new files.

What’s happening and what we are doing: Currently, the original PACE Archive storage is hosted on vendor hardware that is at limited support capacity as the vendor has ceased operations.  PACE has initiated a two phase plan to transfer PACE Archive storage from the current hardware to a permanent storage solution.  At this time, phase 1 is underway, and archive storage data is being replicated to a temporary storage solution.   PACE aims to finish the archive system transfer and configuration of this phase by May Maintenance Period (5/19/2021 – 5/21/2021).   The phase 1 is a temporary solution as PACE explores a more cost-efficient solution that will require a second migration of the data to the permanent storage solution, which will be part of the phase 2 of the plan, and we will follow-up with details accordingly.   

How does this impact me:  There is no service impact to current PACE archive storage users.   With the moratorium in effect, new users/allocations requests for archive storage are delayed until after the maintenance period.  New requests for archive storage may be processed starting 05/22/2021.  

What we will continue to do:  PACE team will continue to monitor the transfer of the data to the NetApp storage, and we will report as needed. 

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you.  If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] Network Connectivity Issues

Posted by on Thursday, 25 March, 2021

[Update – March 25, 2021 – 3:07pm] 

This is a follow up to yesterday’s message about the campus network connectivity issues that impacted PACE.  By 3:24pm yesterday, OIT’s network team was able to rapidly resolve the connectivity issues, which was quickly updated on the status page link that was provided earlier.   Analysis of the incident, which was made available to us at a later point, revealed that the issue was identified to be a network spanned into the Coda data center from the Rich building that experienced a spanning tree issue (a network loop).   This caused a cascade of issues  with core network equipment due to specific failure scenario that caused widespread connectivity issues across the campus.  OIT’s network team resolved the issue with the affected network that resolved the other connectivity issues affecting the campus.   OIT’s network team will conduct further investigation regarding this to prevent future occurrence.

Since yesterday at about 3:30pm, all PACE users should have been able to access PACE managed resources without issues.  There was no impact to running jobs unless they required external resources (outside of PACE).   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

 

[Original Message – March 24, 2021 2:48pm]

Dear PACE Users,

At around 2:30pm, OIT’s network team reported connectivity issues. This may impact users’ ability to connect to PACE managed resources at Coda, such as Phoenix, Hive, Firebird, PACE-ICE, CoC-ICE and Testflight-Coda. Currently, the source of the problem is being investigated, but at this time, there is no impact to running jobs unless they require external resources (i.e., from the Web). We will provide further information as it’s available.

Please refer to the OIT’s status page for the developments on this issue: https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/605b8495e2838505358d3af3

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

We apologize for this inconvenience.

Best,

The PACE Team

Update on the New Cost Model: April 1, 2021 is the Start Date for Compute and Storage Billing

Posted by on Wednesday, 24 February, 2021

Dear PACE research community,

Since the start of 2021, nearly 500k user jobs have completed from nearly 150 PI groups that’s accounted for nearly 16M CPU hours on the Phoenix cluster while maintaining an exceptional quality of service in which the average wait time in queue per user’s job was about 0.5 hours.   The measures that we implemented on December 14 (see blog post)  to ensure fair use of the Phoenix cluster have been effective in enabling the research groups to leverage the scalability of the Phoenix and the new system while maintaining a high level of quality service for the user community.

At this time, we want to share an update in reference to the new cost model.   We are updating the start date for compute and storage billing from March 1, 2021 to April 1, 2021 as the new start date for billing. This means that users will not be charged for their usage of compute and storage resources until April 1, 2021.  This grace period extension allows us to achieve the following:

  • Gain input from the faculty led PACE advisory committee that is being organized.
  • Align the start of our compute and storage billing for all services (including CUI)
  • Provide additional time for the research community to adopt to the Phoenix cluster and the new cost model.
  • Provide an opportunity to send  “showback” statements for the prior months during March, which would provide time for PIs to review these past statements and follow up with PACE if they have any questions prior to the start of billing that begins on April 1, 2021

If you have any questions, concerns or comments, please direct them to pace-support@oit.gatech.edu.

Best,

The PACE Team

 

[Completed – PACE Clusters Ready for Research] PACE Maintenance – February 3 – 5, 2021

Posted by on Wednesday, 20 January, 2021

[Update — February 5, 2021, 2:14pm]

Dear PACE Users,

Our scheduled maintenance has completed on time. All Coda and Rich datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Here is an update on the tasks, which includes a task that may require user action, please see below:

ITEMS THAT MAY REQUIRE USER ACTION:

  • [COMPLETE] [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)

While updating login-hive[1-2], their SSH server keys changed. As a result, users may get a message that the key is not correct. If this should happen, please clear the entries from your local .ssh/known_hosts that have any reference to login-hive, login-hive1 or login-hive2, then try again.

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Apply updates to all compute nodes
  • [COMPLETE] [Compute] Reboot all compute nodes running Lustre clients
  • [COMPLETE] [Network] Enable subnet managers for Hive
  • [COMPLETE] [Network] Reboot the main Coda InfiniBand HDR switch
  • [COMPLETE] [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [COMPLETE] [Software] Upgrade Intel license server
  • [COMPLETE] [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [COMPLETE] [Storage] Upgrade Lustre clients
  • [COMPLETE] [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [COMPLETE] [Coda Data Center] Georgia Power will install a Power Quality Monitor

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!

The PACE Team

[Update — February 2, 2021, 3:25pm]

This is a friendly reminder that our Maintenance will begin tomorrow at 6:00 AM and conclude on Friday, February 5th, 2021.   As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off.   These jobs will be released as soon as the maintenance activities are complete. Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We have added one additional activity to this maintenance.  Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR  switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE
  • [Coda Data Center] Georgia Power will install a Power Quality Monitor

This maintenance is planned to last through Friday that will allow for Georgia Power to install a Power Quality Monitor, which is required to get the microgrid fully operational.  Due to work being performed by Databank on the colling systems, we agreed to do this activity on Friday.  No power outage is expected.   Once Georgia Power is complete with the installation, we will open the clusters to users.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

 

[Original Note – January 20, 2021, 1:23pm]

Dear PACE Users,

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on February 3rd, 2021 and conclude at 11:59 PM on February 5th, 2021. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  Please note, during the maintenance period, users will not have access to Coda and Rich datacenter resources.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Apply updates to all compute nodes
  • [Compute] Update provisioning resources for Hive cluster (login-hive[1-2], sched-hive, and globus-hive)
  • [Compute] Reboot all compute nodes running Lustre clients
  • [Network] Enable subnet managers for Hive
  • [Network] Reboot the main Coda InfiniBand HDR switch
  • [Network] Upgrade Cisco switches in the Coda datacenter to the latest supported code
  • [Software] Upgrade Intel license server
  • [Storage] Reconfigure Globus for the Phoenix cluster (i.e., globus-phoenix)
  • [Storage] Upgrade Lustre clients
  • [Storage] Upgrade controller and exascaler for the storage appliances: SFA200NV, SFA18KE

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

[Complete] Emergency Firewall Upgrade – starting today (01/14) at 8:00pm

Posted by on Thursday, 14 January, 2021

Dear PACE users,

OIT will conduct an emergency firewall upgrade starting this evening, 01/14/2021, at 8:00pm to 10:00pm.   This upgrade is expected to impact VPN access, as a result, it is possible that connections in and out of PACE (e.g., interactive sessions, file transfers) may be interrupted during that period of time.

Who is impacted: During the emergency firewall upgrade window,  there is a possibility that PACE users may not be able to connect to PACE resources and/or they may lose connection briefly.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources.  Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during this maintenance activity.  This maintenance activity will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during this emergency firewall upgrade to monitor the systems, and report on any interruptions in service.   Up-to-date progress will be provided on Georgia Tech’s Status page, http://status.gatech.edu.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

 

 

OIT’s Scheduled Network Maintenance

Posted by on Monday, 4 January, 2021

[Update – January 5, 2020 11:30am]

Dear PACE users,

The routers that were upgraded late last night had a problem with OSPF, which caused the missing routes, and prevented connection to the system.  Users who may have tried to connect to PACE resources late last night would have received errors such as “no route to host” when attempting to ssh to headnodes.   Network Engineering has downgraded the firmware to the original version, and connectivity has been restored during the scheduled maintenance window.   PACE completed the testing by 2:19am this morning and confirmed that PACE services are operational.

Network Engineering team has engaged the vendor to identify the root cause of the issue given the firmware has been tested on same exact hardware prior to the deployment last night without any issues.   Once the root cause is identified and resolved, another upgrade will be scheduled and communicated accordingly.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

 

[Original Post – January 4, 2020 3:49pm]

Dear PACE users,

OIT’s Network Engineering Team will be conducting maintenance activities starting this evening, 01/04/2021, at 7:00pm through 2:00am (01/05/2021).   Data center routers and firewalls will get firmware upgrades.  All devices have redundancy, and devices will be upgraded one at a time.  No service disruptions are expected.   However, it is possible that connections in and out of PACE (e.g., interactive sessions, file transfers) may be interrupted during that period of time.

Who is impacted: During the maintenance window, we do not expect service disruptions at PACE; however,  there is a possibility that PACE users may not be able to connect to PACE resources and/or they may lose connection briefly.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources. Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during this maintenance activity.  This maintenance activity will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during this maintenance activities to monitor the systems, conduct testing and report on any interruptions in service.

Thank you for your attention to this matter, and if you have any questions, please direct them to pace-support@oit.gatech.edu.

Best,
The PACE Team

Update on the New Cost Model: March 1, 2021 is the Start Date for Compute and Storage Billing

Posted by on Friday, 18 December, 2020

Dear PACE research community,

Last week, we completed the last batch of user migrations to the Phoenix cluster, which was a major milestone for our research community and PACE. We are grateful for all your support and understanding during this major undertaking of migrating our user community from Rich to Coda datacenter – thank you!

At this time, we want to share an update in reference to the new cost model. We are updating the start date for compute and storage billing from the tentative January 1 to March 1, 2021 as the new start date for billing. This means that users will not be charged for their usage of compute and storage resources until March 1, 2021.   Given this grace period extension, we have implemented measures to ensure fair use of the Phoenix cluster, which was reported on December 14 to user community and at this blog post.  This grace period extension should provide ample time for the research community to adopt to the Phoenix cluster and the new cost model.

If you have any questions, concerns or comments about your recent migration to the Phoenix cluster, please direct them to pace-support@oit.gatech.edu.

Happy Holidays!

The PACE Team

Scheduler Policy Update for the Phoenix Cluster

Posted by on Monday, 14 December, 2020

Dear Researchers,

Last week, we completed the migration of our third batch of users from Rich to Coda datacenter, which was a major milestone for our research community as we migrated nearly 2,000 active users from clusters in Rich to the Phoenix cluster in Coda.  Given the more than doubled number of users on already highly utilized Phoenix cluster coupled with the grace period that we have in effect for job accounting with respect to recent announcement about the new cost model, we have noticed a rapid increase in wait time per job by users on the cluster.   At this time, we are updating our scheduler policy to alleviate pressure on the wait time per job for users that should improve the overall quality of service.  The changes listed below are data-driven and have been carefully chosen so as to not adversely impact research teams that submit large scale jobs.

Effective today, the following changes have been made to the scheduler policy that effect the inferno and embers queues:

  • Reduced the concurrent-use limit for CPU usage per research group from 7,200 processors to 6,000 processors.
  • Reduced the concurrent-use limit for  GPU usage per user from 220 GPUs to 32 GPUs
  • Added a per research group concurrent CPU hour capacity limit set to 300,000 CPU hours that allows the scheduler to permit the research group to concurrently run up to 300,000 CPU hours (i.e., requested processors * walltime)
  • Added a per job CPU-time capacity limit set to 264,960 CPU hours that would allow, for example, a 2,208 core job to run for 5 days.

Jobs that violate these limits will be held in the queue until currently running jobs complete and the total number of utilized processors, GPUs, and/or the remaining CPU-time fall below the thresholds.   We have updated our documentation to reflect these changes, which you may view here.

Again, the changes listed above are taking effect today, December 14, 2020.   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

All the best,

The PACE Team

 

[Resolved] Scheduler down for the Phoenix Cluster

Posted by on Tuesday, 8 December, 2020

[Update – December 9, 2020 – 11:03am]

Thank you for your continued patience.   We are reaching out to update you that the scheduler has been resolved late last night, new jobs from users that were held in queue yesterday have resumed, and we have observed normal operation of the Phoenix scheduler through the night and this morning.  The cause of the scheduler issue was the large sudden influx of user jobs on December 7 as previously reported.  The updated timeout parameter will be kept in place to prevent a similar occurrence in the future.   Additionally, we are expanding our alert utilities to track additional metrics from the scheduler to alert us of similar stressors so that we may proactively address/mitigate scheduler issues.

The Phoenix cluster is ready for research.  Thank you again for your patience as we worked to address this issue in coordination with the vendor.

 

[Update – December 8, 2020 – 6:31pm]

Thank you for your patience today as we have worked extensively with the vendor to address the scheduler outage on the Phoenix cluster. This is brief update from today’s joint investigation.

What has PACE done: PACE along with the vendor, Adaptive Computing, have been conducting an extensive investigation of the scheduler outage. The root cause of the incident is still under investigation; however, we have identified a multi-pronged event that had started at 11:39pm on December 7 that was compounded with a rapid influx of nearly 30,000 jobs from users that lead the scheduler to become unresponsive. Given this large influx of jobs, we have increased the timeout setting for the scheduler to allow Moab to process the backlog of jobs that were submitted. This is currently underway.

Who does this message impact: This impacts all users on the Phoenix cluster who have submitted jobs. During this incident, it’s normal for users to see their jobs remain in queue after submission as the scheduler is working through the backlog of job submissions.

What PACE will continue to do: We will continue to monitor the scheduler as it’s processing the backlog of jobs and update as needed. This continues to be an active situation and we will update as further information is available.

Thank you again for your patience as we work diligently to address this issue.

The PACE Team

 

[Original Note – December 8, 2020 – 10:26am]

Dear PACE users,

PACE is investigating a scheduler issue that is impacting the Phoenix cluster.  At this time, users are unable to run jobs, and  jobs are held in queue.  

This is an active situation, and we will follow up with updates as they become available.

Thank you for your attention to this urgent message, and we apologize for this inconvenience.

FAQs after user migration to the Phoenix cluster in CODA

Posted by on Tuesday, 17 November, 2020

Dear PACE research community,

After we completed our second wave of user migration last week, we received some common questions from users in reference to the new cost model that was announced on September 29 and about the new cluster, Phoenix, in general, which we will address below for the benefit of the community:

  • The Phoenix scheduler has been redesigned.   Unlike previous PACE-managed clusters, there are only two queues on the Phoenix cluster: inferno and embers.  To submit a job, you will need to specify a charge account (i.e., MAM account) that was/will be provided to you in the “welcome email” after migration to the Phoenix cluster in Coda.  You may have access to multiple MAM accounts, for example, a PI and their user group may have access to an Institute sponsored account (GT-gburdell3  – $68/mo), account for refreshed PI cluster (e.g., GT-gburdell3-CODA20 -> $43,011.32), or account for recent FY20 purchase (e.g., GT-gburdell3-FY20Phase2 -> $17,860.75).  For further details on submitting jobs on the Phoenix cluster, please refer to the documentation at  http://docs.pace.gatech.edu/phoenix_cluster/submit_jobs_phnx/ .
  • Access to departmental PACE resources (e.g. CoC, CEE, biology,…) are restructured based on departmental preferences. As with the rest of PACE, access is now managed at a group level, each owned by a specific PI, although the distribution of available departmental credits may vary from one department to another.
  • We are in a process of providing PIs further details with regards to their cluster(s) from Rich datacenter that were refreshed and converted into credits/MAM account according to the new cost model.  Additionally, PIs who participated in the FY20 purchases will receive further details about the conversion from purchased equipment to the credits/MAM account.
  • As mentioned in our initial announcement on September 29, users will not be charged for their usage of compute resources until at least January 1, 2021.   Until that time, all jobs that run on Phoenix are free as we work to migrate all users into the cluster and for the users to get familiar with the new environment.  Please note that your credits will be declined, but we will reset your total before we start billing.
  • All of your data has been migrated to Phoenix, but the structure has changed. Note that the data is now in your project storage under a different directory name, and the symbolic links to different locations have been broken due to this.  Please visit our documentation for information on locating your group’s shared directory and on recreating symbolic links as documented at http://docs.pace.gatech.edu/phoenix_cluster/where_is_my_rich_data/ .  For further details, please refer to the documentation at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/ .
  • pace-vnc-job command is functional, however, you will need to setup VNC for the Phoenix cluster.  To setup VNC, remove ~/.vnc directory, then run vncpasswd to set the new password for VNC on the Phoenix cluster.  After this, you will be able to submit pace-vnc-job with the additional MAM account that you will need to pass to the command.

If you have any questions, concerns or comments about your recent migration to Phoenix, upcoming migration or the new cost model, please direct them to pace-support@oit.gatech.edu.

Best,

The PACE Team