PACE A Partnership for an Advanced Computing Environment

September 30, 2020

[RESOLVED] Hive Scheduler Degraded Performance

Filed under: Uncategorized — Semir Sarajlic @ 7:21 pm

[UPDATE – 10/01/20 5:51pm]

We are following up to let you know that the Hive scheduler has been restored to operation, and users may submit new jobs.  We appreciate your patience as we conducted our investigation and resolved this matter.   We are providing a brief summary of our findings and actions taken to address this issue.

What Happened and what we did: Yesterday, a user ran an aggressive script that spammed the scheduler with roughly 30,000 job submissions and extremely frequent client queries to both Moab and Torque. This resulted in a chain reaction in which the scheduler utilities were fully overwhelmed and producing log files hundreds of times larger in both size and number of files than normal.  Additionally, system utilities were stressed as they tried to keep up with backups and archival. Once PACE became aware of the issue, we terminated the user’s script and began working to clean up the scheduler environment.  Ultimately, we had to forcefully remove some of the egregious job logs associated with the user.   Other users job(s) that were already submitted to the scheduler prior to the incident have operated normally in that we did not observe abrupt job cancelations/interruptions during this situation.   Also, PACE has followed up with the user, and we are working with them to improve their workflow and prevent any future issues such as this one.

What we continue to do:  As we blogged this morning at 10:02AM, the scheduler is accepting jobs and running.  We have observed some residual effects in system utilities that we have been addressing and monitoring throughout the day.   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

As always, we appreciate your patience as we worked to address this situation.

 

[UPDATE – 10/01/20 10:02am]

Yesterday, a user ran an aggressive script that spammed the scheduler with roughly 30,000 job submissions and extremely frequent client queries to both Moab and Torque. This resulted in a chain reaction in which the scheduler utilities were fully overwhelmed and producing log files hundreds of times larger in both size and number of files than normal, followed by system utilities being stressed as they tried to keep up with backups and archival. Once we became aware of the issue, we terminated the user’s script and began working to clean up the scheduler environment, ultimately having to forcefully remove some of the egregious job logs associated with the user. At this point, the scheduler is accepting jobs and running, although there are still some residual effects in system utilities that we are addressing. As always, we appreciate your patience as we address this situation.
[ORIGINAL POST – 09/30/20 7:21pm]

At about 4:30pm, we began experiencing degraded performance with the Hive scheduler.  Currently, the scheduler is under significant load, and some users may notice their new job submissions hanging as couple users have already reported to us.  PACE is investigating the issue, and we will update once the scheduler is restored to normal operation.

We apologize for the inconvenience this is causing.

 

 

[RESOLVED] PACE Maintenance – October 14-16, 2020

Filed under: Uncategorized — Semir Sarajlic @ 4:42 pm

[Update – October 19 – 5:30pm]

We are following up to inform you that our maintenance for TestFlight-Coda and Phoenix clusters has completed.  At this time, all Rich and Coda datacenter clusters are ready for research. We appreciate everyone’s patience as we worked through this partially extended maintenance day to address our activities in Coda datacenter.

At this time, we are updating you on the status of tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [COMPLETE] [Storage] Lustre Client Patches
  • [COMPLETE] [Storage] Lustre filesystem controller to be replaced
    • [COMPLETE – 10/19/2020] We conducted, further testing of Lustre storage in coordination with our vendor.
  • [COMPLETE] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive, COC-ICE and PACE-ICE clusters).

 

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

Best,
PACE Team

 

[UPDATE – October 16 – 6:31pm]

We are following up with an update on the PACE maintenance period.  As mentioned yesterday, our maintenance for Rich datacenter has completed 1-day ahead of the schedule, and we are partially complete with CODA datacenter. All clusters in Rich datacenter are ready for research. Hive, COC-ICE and PACE-ICE clusters in Coda datacenter are ready for research and instructional learning. We have released users jobs on Hive, COC-ICE, PACE-ICE clusters, and the Rich datacenter clusters. The Phoenix cluster in CODA will remain under maintenance through Monday, October 19, as scheduled. Also, we need to extend the maintenance for the Testflight-Coda cluster through Monday, October 19, to address remaining pending task.

At this time, we are updating you on the status of tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [COMPLETE] [Storage] Lustre Client Patches
  • [COMPLETE] [Storage] Lustre filesystem controller to be replaced
  • [PENDING] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive, COC-ICE and PACE-ICE clusters).

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

We will follow up with further updates.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

 

[UPDATE – October 15, 2020, 8:44pm]

Our maintenance period has completed for Rich datacenter 1-day ahead of the schedule, and we are partially complete for CODA datacenter.   All clusters in Rich datacenter are ready for research.  Only Hive cluster in Coda datacenter is ready for research.  We have released users jobs on Hive cluster, and the Rich datacenter clusters.

The remaining clusters in CODA datacenter that include, Phoenix, Testflight-Coda, CoC-ICE, and PACE-ICE will remain under maintenance for the remainder of the maintenance period as we address the remaining tasks from our maintenance period.

At this time, we are updating you on the status tasks:

ITEMS REQUIRING USER ACTION:

  • [COMPLETE] [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [COMPLETE] [Compute] Applying a tuned profile to the Hive compute nodes
  • [COMPLETE] [Compute] Update Nvidia GPU drivers on coda to Support CUDA 11 SDK
  • [COMPLETE] [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [COMPLETE] [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [COMPLETE] [Network] Update Phoenix subnet managers to RHEL7.8
  • [COMPLETE] [Storage] Replace DDN 7700 storage controller 1
  • [COMPLETE] [Storage] Replace DDN SFA18KE storage enclosure 8
  • [COMPLETE] [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [COMPLETE] [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [COMPLETE] [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes
  • [PENDING] [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive cluster).
  • [PENDING] [Storage] Lustre Client Patches
  • [PENDING] [Storage] Lustre filesystem controller to be replaced

ITEMS REQUIRING USER ACTION:

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

We will follow up tomorrow regarding the remaining CODA datacenter tasks impacting Phoenix, CoC-ICE, PACE-ICE, and Testflight-CODA.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu

 

[Update – October 12, 1:07PM]

We are following up with a reminder that our scheduled maintenance period begins at 6:00AM on October 14th, 2020 and concludes at 11:59PM on October 16th, 2020.  Please note our blog post: https://blog.pace.gatech.edu/?p=6905contains an updated list of tasks for this upcoming maintenance period, and for your reference the updated list is provided below:

ITEMS REQUIRING USER ACTION:

  • [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Applying a tuned profile to the Hive compute nodes
  • [Compute] Update Nvidia GPU drivers on coda to Support Cuda 11 SDK
  • [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [Network] Update Phoenix subnet managers to RHEL7.8
  • [Storage] Replace DDN 7700 storage controller 1
  • [Storage] Replace DDN SFA18KE storage enclosure 8
  • [Data Management] Update globus-connect-server on globus-hive from version 4 to version 5.4.
  • [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [Coda Datacenter] Top500 test run across Coda datacenter resources (excludes Hive cluster).
  • [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes

As previously mentioned, with regards to renaming the primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

[Original – September 30, 4:42PM]

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on October 14th, 2020 and conclude at 11:59 PM on October 16th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  Please note, during the maintenance period, users will not have access to Rich and Coda datacenter resources.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

  •  [some user action may be needed] Rename primary groups for future p- and d-

ITEMS NOT REQUIRING USER ACTION:

  • [Compute] Applying a tuned profile to the Hive compute nodes
  • [Network] Rebooting the Hive IB switch (atl1-1-01-014-3-cs7500)
  • [Network] Rebooting PACE IB switch (09-010-3-cs7520)
  • [Storage] Replace DDN 7700 storage controller 1
  • [Storage] Replace DDN SFA18KE storage enclosure 8
  • [Coda Datacenter] Databank: Hi-Temp Cooling Tower reboot
  • [Emergency readiness test] Test emergency power down scripts for CODA and Hive compute nodes

Regarding the renaming of primary groups task that may require some user action, we will be adjusting the names of most users’ Linux primary groups to reflect a new standardized format as part of our preparation for the migration to Coda that’s starting in October. Most users will see the name of their primary group change from school-pisurname (e.g., chem-burdell) to p-piusername (e.g., p-gburdell3) or d-school (e.g., d-chem). This change will be reflected across all PACE systems, including Hive and CUI.  The “gid” (group id number) is not changing, so this will not affect any file permissions you have set.  Most users will not need to take action. However, if you manually change file permissions using the group name or use group names in your scripts, you may need to make an adjustment. You can always run the “id” command on yourself (“id gburdell3”) to see all of your groups. Not all primary groups will change name, so do not be concerned if yours is left unchanged.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

September 29, 2020

Update to GT’s Research Cyberinfrastructure Cost Model

Filed under: Uncategorized — Semir Sarajlic @ 4:16 pm

Please visit our Web page for up to date information and FAQs: https://pace.gatech.edu/update-gts-research-cyberinfrastructure-cost-model

Introduction and Summary

Over the past few months, a team from the EVPR, OIT, EVP-A&F, and GTRC has been working with Institute leadership to develop a more sustainable and flexible way to support research cyberinfrastructure. This new model is described in more detail below and will affect researchers who leverage PACE services. The model enjoys strong support, but it is not yet fully approved.  We are communicating at this stage because we wanted you to be aware of the upcoming changes and we welcome your feedback. Please submit comments to the PACE Team <pace-support@oit.gatech.edu> or to Lew Lefton <lew.lefton@gatech.edu>. We will also have some listening sessions in the coming weeks to hear more feedback.

In a nutshell, PACE will transition from a service that purchases nodes with equipment funds, to a service which operates as a Cost Center. This means that major research cyberinfrastructure (including compute and storage services) will be treated like other core facilities. This new model will begin as the transition to the new equipment in the CODA data center happens. We recognize that this represents a shift in how we think about research computing. But, as shown below, the data indicates that the long-term benefits are worth the change.  When researchers only pay for actual consumption – similar to commercial cloud offerings from AWS, Azure, and GCP – there are several advantages:

  • Researchers have more flexibility to leverage new hardware releases instead of being restricted to hardware purchased at a specific point in time.
  • The PACE team can use capacity and usage planning to make compute cycles available to faculty in days or week as opposed to having to wait for months due to procurement bottlenecks.
  • We have secured an Indirect Cost Waiver on both PACE services and commercial cloud offerings for two years to allow us to collect data on the model and see how it is working.
  • Note that a similar consumption model has been used successfully at other institutions such as Univ. Washington and UCSD, and this approach is also being developed by key sponsors (e.g. NSF’s cloudbank.org).
  • A free tier that provides any PI the equivalent of 10,000 CPU-hours on a 192GB compute node and 1 TB of project storage at no cost.

Migration to new CODA data center

As previously communicated, PACE has developed a plan to migrate services from the Rich data center to CODA.  As is with many things this year, COVID-19 brought significant complications and delays to this process.  However, our goals remain the same:

  • Reduce or eliminate negative impact to research schedules.
  • Reduce the long-term cost to the Institute.
  • Maximize the utilization of the leased space in Coda in order to make the most cost-effective use of this new resource.
  • Remove HPC equipment supported by PACE from Rich as quickly as possible.

The supporting infrastructure is now ready to go and the move of the first cohort of PIs is planned for early October 2020.  Due to these delays, the time we have to complete the move is more compressed than we would like.  Critical pieces of infrastructure in Rich are rapidly deteriorating, and we have a “hard stop” for licensing of critical components on 12/31/2020.  In light of this, we have adjusted the migration schedule such that the process will complete by the end of December.  Individual migration times will be communicated to PIs and CSRs shortly.

Current cost model & equipment refreshment

In partnership with GT Capital Finance, PACE has secured the necessary campus approvals and has secured commercial financing to purchase an entirely new set of hardware for use in CODA.  The repayment for this loan will come from existing PACE budget over the course of the next five years.  Following established precedent, we have developed an “equivalent or better” refreshment rubric using the SPECfp_rate benchmark, while accounting for memory, local storage, GPU, and other capabilities of current equipment that produces a mapping to current generation hardware. We acknowledge that there are a small number of edge cases where this rubric needed adjustment and we have worked with those PIs individually with EVP-R approval.

Issues with old cost model

The old cost model calls for faculty to spend equipment funds on compute nodes.  Additional supporting IT infrastructure and personnel were subsidized by the Institute via the PACE budget.  With increased procedures mandated in the procurement process and manufacturer lead times, the “time to science” has now increased to an untenable 6-8 months between a PIs request for resources and the availability of those resources. This model also results in “silos” of equipment with some resources sitting idle while others have a backlog of work to process.  Additionally, this model locks PIs into specific configurations for a 5-year period which prevents the timely leveraging of technology breakthroughs to advance research.  As research needs evolve, it is frequently difficult or impossible to adjust existing hardware to accommodate the new requirements.  A preferable approach would be one where the hardware adapts to the needs of scientific workflows.

The new cost model

The Institute has developed a new and more sustainable and flexible way to support research cyberinfrastructure. This cost model will be used going forward and has full support from the offices of the EVP-R and EVP-A&F, as well as GTRC, and PACE.  The new model is based on actual consumption – similar to commercial cloud offerings from Amazon (AWS), Microsoft (Azure), and Google (GCP) – that has significant benefits to PIs as well as the Institute as a whole. In this “consumption model,” PIs only pay for what they actually use rather than some fixed capacity. This model also provides:

  • Costs calibrated to match old equipment model.  A critical goal of this consumption model is to keep bottom line costs to PIs equivalent to what they paid in the previous model.  For instance, the cost of a compute node with 192GB RAM purchased in the equipment model is approximately $7,000 and provides approximately 200,000 CPU-hours per year over the course of its 5-year lifetime.  In the consumption model, the cost of 1,000,000 CPU-hours on the same configuration hardware will cost approximately $7,000.  A draft rate study has been submitted, which will formalize rates for PACE services and be reviewed annually.  A detailed breakdown of charges is provided in the appendix.
  • F&A overhead waiver.  Cost recovery will be managed via the PACE service center, with OIT Resource Management handling the accounting as they have in the past.  PACE has secured the necessary campus approvals to waive the F&A overhead on PACE services, as well as those from commercial cloud providers, for an initial period of two years.
  • Increased flexibility in the type of resources used, allowing researchers to tailor computing resources to fit their scientific workflows rather than being restricted to the specific quantity and configuration of compute nodes purchased.
  • Ability for rapid provisioning without the requirement to wait for a lengthy procurement period to complete.  We anticipate being able to reduce the time-to-science from 6-8 months to a period measured in days or weeks.
  • Insulation from failure of compute nodes.  Compute nodes in need of repair can be taken out of service and repaired, allowing jobs to proceed using other compute nodes in the pool rather than decreasing the capacity available to a particular user.
  • PACE will monitor the time jobs wait in the queue and procure additional equipment as needed to keep wait times reasonably low.
  • Based on recent usage trends, we project a lower overall cost to PIs versus the old equipment model.
  • A free tier of compute and storage will replace the proposal-based FoRCE allocation process.  Supplying the proposal will no longer be required.  PIs simply need to request access and they will be given a monthly allocation of compute time and 1TB of project storage at no cost.  This is available to all GT academic and research faculty, including GTRI.

Transition from equipment to credits

The transition from the Rich data center to CODA, and associated equipment refresh, provides an ideal opportunity to transition to the new consumption model.  In place of specific compute nodes, PIs will receive a corresponding number of credits which can be drawn down over a period of five years.  These credits will not be associated with any specific compute nodes, allowing for the use of different capabilities as research workflows require.  As part of this transition, no billing will occur until at least January 2021.  Specifically, the credits allocated via this refresh will not begin to be drawn down until this time.  During the time between migration and the start of billing, PACE will provide PIs a monthly “show back” invoice.  No funds will be transferred during this time, as this period is intended to give PIs a sense of their monthly utilization and be best able to make projections going forward.

PACE Services and proposed rates

Draft rates for PACE services are given as an appendix.  Please note that a possibility exists that these rates could change slightly upon final review and approval by Grants & Contracts.  In this event, adjustments to PI refresh allocations will be made as needed.

Consider the following example.  A PI currently has a 22-node cluster in Rich that was purchased in 2012.  The warranty on the cluster expired three years ago but continues to limp along experiencing occasional hardware failures.  Each node has 64 cores and 256GB of memory, and the cluster has a total rating of 16,830 on the SPECfp benchmark.  According to the refresh rubric, the replacement resource is 13 nodes, each with 24 cores and 192GB memory.  (Note that this is twice the memory per core as the original system.)  Each new node provides 202,657 CPU-hours per year.  The PI will receive 13 nodes * 202,657 CPU-hours per year * 5 years * $0.0068 = $89,574.39 in credits.  The PI can utilize these credits to run on nodes with 192GB memory if desired, but also has the flexibility to run a mix of jobs that use 768GB nodes, or GPU nodes, or any other available configuration.

Operational Considerations

The interface to the queuing system has been simplified.  Behind the scenes, the scheduler hardware has been enhanced, software upgraded, and configuration changes applied that are all designed to increase scheduler performance and response time.  An additional software module – MAM (Moab Accounting Manager) has been procured to augment the Moab scheduler currently in use by PACE and provides a robust means of consumption recording and billing for jobs.

Individual queues are largely replaced with billing codes known as MAM Accounts.  PIs will be issued multiple MAM Accounts for use by their students and collaborators.  It is important to note that the PI will be responsible for all usage charges incurred by their MAM Accounts.

Free tier MAM Account – Each PI will receive a MAM Account that contains the equivalent of 10,000 CPU-hours on a 192GB compute node.  The credits can be used to run jobs on any available compute node configuration.  Unused credits do not accumulate month-to-month.  The account will be reset to 10,000 every month.

Refresh Credit MAM Account – Each PI that currently has resources in the Rich data center will receive a MAM Account with an amount of credits according to the refresh rubric.  The MAM Account will be retired when exhausted or after five years, whichever comes first.  If no additional sponsored resources are available to the researcher, they are still able to continue at the free tier.

New Project MAM Account – A new MAM Account will be issued for each new funding source.  This could be a startup package, sponsored research, departmental funds, etc.  MAM Accounts of this type will be associated with exactly one Worktag.  Worktags associated with a MAM Account cannot be changed, but any number of MAM Accounts can be issued.  If desired, a PI may set a cap on monthly utilization.  This can be helpful in preventing errant jobs from consuming an inordinate amount of resources.

Backfill queue – The iw-shared queue is replaced by a backfill queue.  Jobs submitted to the backfill queue do not debit MAM Account balances.  These jobs run at the lowest priority in the system and will be preempted (killed and rescheduled) if a non-backfill job is waiting to run.  During the first hour of execution, backfill jobs are exempted from being preempted.

A default queue is provided for all jobs not intended for backfill.  A MAM Account must be provided for each job.    As normal, resource requirements must also be specified for each job.  The scheduler will route jobs to the least costly compute node configuration that satisfies the specified requirements.  Thus, jobs may run on any node configuration by simply requesting the appropriate resources.

PACE will provide a command line tool to show MAM Account balances in real time.  Invoices will be sent monthly.  No funds will actually be transferred, and Refresh Credit MAM Accounts will not be debited until at least January 2021.  Resources consumed towards the end of the FY will be billed at the beginning of the next FY.  PACE will also develop a refund policy and procedure for crashed jobs.  The details are still being worked out, but the philosophy is that refunds will be issued for jobs that crash due to a system issue.  A higher threshold will need to be met if a job crashes for other reasons.

FAQ

I’m preparing a budget for a proposal.  How many CPU-hours do I need?  Engaging PACE in proposals is highly encouraged.  Our mission is to support the research community in their efforts, and part of that is helping to optimize workflows, suggest effective computational methods, plan projects with a cyberinfrastructure component, and assist in the preparation of proposal budgets.  If appropriate, we may also act as senior personnel or Co-PI.  We will also publish a calculator tool to convert from more familiar units such as compute nodes and CPU cores and GPUs to credits.

I use the Hive cluster.  Are we going to be charged now?  No.  Hive is 100% allocated according to the terms of its NSF award and exists outside of this consumption model.

What about data management plans requiring long-term retention?  Storage may be purchased in advance for a defined quantity and duration.  The Archive storage service would be an excellent choice for this use case. We encourage engaging PACE for any storage concerns.

Can I have the old equipment that is not moving to CODA?  No.  Equipment with useful lifetime remaining will be traded in to offset the cost of the new equipment and to fund expanded PACE services.  The rest will be disposed of via the regular surplus process.

I already have projects in various stages of funding that have budgeted for equipment.  What do I do?  This will be handled on a case-by-case basis in partnership with the office of the EVP-R, OSP, and PACE.  Among the possibilities would be to work with your program officer to reallocate the equipment budget.  A memo will be drafted by the EVP-R’s office and PACE that describes the change in cost model for use in these discussions.

How does this change the process used to budget for new faculty startup packages?  Historically, PACE purchases for startup packages have been one-time equipment purchases, budgeted for in a given fiscal year.  This new model spreads out the expenses over time.  We suggest creating a Worktag to track the startup commitment and then add funds to that Worktag as appropriate based on the particulars of the startup package.  It is expected that this commitment will be satisfied over the course of a number of years.  This Worktag will be associated with a “New Project MAM Account” as described above.

What about equipment-only awards like DURIP, and NSF’s CC* and MRI?  We are committed to providing appropriate research cyberinfrastructure to advance Georgia Tech’s mission and we recognize that there will be rare exceptions to the new model. We have already made accommodations for sponsored funds which explicitly require equipment in the call for proposals, and we will continue to develop this hybrid aspect of the model on a case-by-case basis.  If you are pursuing and equipment-only award which needs support of the PACE team, we encourage you to engage with them early in the proposal process.

How do the internal cost center rates from GT compare to commercial cloud providers like AWS, Azure, etc. ?  A true apples-to-apples comparison would require more project specific details including ingress and egress of large datasets, specific CPU/GPU requirements, etc.  However, based on published information, commercial cloud rates are 4 – 10 times higher than GT rates.  This is in part because GT doesn’t have to build in profit margins that commercial entities require.  We should also note that commercial cloud providers have a more robust infrastructure and higher scalability which may be worth the cost differential depending on the project.

Appendix

The following table lists draft rates for PACE services.  Please note that a possibility exists that these rates could change slightly upon final review and approval by Grants & Contracts.  In this event, adjustments to PI refresh allocations will be made as needed.

The calculated usage rate is the maximum rate GT is allowed to charge itself (e.g. GT PIs) for these services.  The internal rate is what will actually be charged to GT PIs.  The difference between the calculated usage rate and the internal rate reflects the subsidy provided by the Institute and is similar to the level of subsidy in the old cost model.  This subsidy includes data center costs, PACE staff salaries and fringe benefits, supporting IT infrastructure, software licenses, etc.  CRITICAL TO NOTE regarding the internal rate is that it has been carefully selected to match the current cost of the corresponding compute node configuration.  The external rate is intended for GTRI and any future possible non-GT customers (e.g. a startup company spun out from GT, etc.)

Compute services are broken down into three major categories.  General (GEN), CUI, and LIGO/OSG.  The general category comprises the vast majority of current PACE resources.  CUI (Controlled Unclassified Information) requires additional security measures and is for activities requiring adherence to NIST800-171 such as ITAR and various export control regimes.  LIGO/OSG is the Open Science Grid model of computing, used primarily by high-throughput workloads.  Subcategories describe the capabilities of the compute node such as memory size (e.g. 192GB), availability of local disk (e.g. SAS), or GPUs (e.g. v100 for double precision workloads and RTX6000 for single precision workloads).  Note that there are different units for compute nodes with GPUs.  They are measured in GPU-hours rather than CPU-hours.

Storage is charged based on the capabilities of the storage platform.  Project storage is intended for data sets that are in active use.  Scratch storage and home directories continue to be available and are provided at no cost.  Capacities of both will increase relative to what is provided in Rich.  LIGO/OSG uses a different model for storage and is not accessible to compute services other than LIGO/OSG.  Archival storage is intended for the long-term storage of data sets and is not directly accessible from PACE compute services.  It uses Globus as a user interface, allowing data sets to be easily transferred to Project storage.

Both CUI compute and storage have notable differences.  Due to security requirements CUI compute must be scheduled per-node rather than per CPU core.  Since current compute nodes contain 24 CPU cores, the effective rate for CUI compute is 24x of the table below.  PACE research scientists will be available to assist transforming workflows to utilize as many CPU cores in a single job as possible.  Security requirements dictate a hybrid consumption/equipment model for CUI storage.  PIs will need to pay a monthly/annual fee per drive bay, plus the cost of the drives themselves.  To ensure integrity and availability of the storage, at least two drives and drive bays must be purchased.  For larger capacity requirements, more sophisticated RAID levels may be utilized, and PACE staff will provide assistance in determining the optimal solution.  PACE will generate a vendor quote for drives of the chosen capacity and appropriate configuration and issue a PO against a Worktag provided by the PI.  Additional drives and drive bays are required to provide backups.  PACE will provide the required funding to enable nightly backups per our usual operational procedures.

Finally, a new consulting service is forthcoming which will enable PACE staff to provide a higher level of service when required and allows easy budgeting for PIs wishing to include PACE staff in grant proposals.

 

Unit

Calculated Usage Rate

Internal

External

[GEN] cpu-192GB

CPUh

 $0.0273

 $0.0068

 $0.0246

[GEN] cpu-384GB

CPUh

 $0.0255

 $0.0077

 $0.0276

[GEN] cpu-384GB-SAS

CPUh

 $0.0252

 $0.0091

 $0.0289

[GEN] cpu-768GB

CPUh

 $0.0254

 $0.0091

 $0.0297

[GEN] cpu-768GB-SAS

CPUh

 $0.0285

 $0.0119

 $0.0340

[GEN] gpu-192GB-v100

GPUh

 $0.3669

 $0.2307

 $0.5161

[GEN] gpu-384GB-RTX6000

GPUh

 $0.2693

 $0.1491

 $0.2777

[GEN] gpu-384GB-v100

GPUh

 $0.4789

 $0.2409

 $0.5084

[GEN] gpu-768GB-RTX6000

GPUh

 $0.1552

 $0.1552

 $0.2833

[GEN] gpu-768GB-v100

GPUh

 $0.4567

 $0.2627

 $0.4945

[CUI] Server-192GB

CPUh

 $0.0570

 $0.0068

 $0.0570

[CUI] Server-384GB

CPUh

 $0.0636

 $0.0103

 $0.0641

[CUI] Server-384GB-SAS

CPUh

 $0.0797

 $0.0153

 $0.0818

[CUI] Server-768GB

CPUh

 $0.0651

 $0.0128

 $0.0651

[LIGO/OSG] cpu-192GB

CPUh

 $0.0469

 $0.0068

 $0.0469

[LIGO/OSG] cpu-384GB

CPUh

 $0.0353

 $0.0077

 $0.0414

[LIGO/OSG] gpu-384GB-RTX6000

GPUh

 $0.2313

 $0.1639

 $0.3789

[Storage] Project Storage

TB/Mo

 $7.85

 $6.67

 $7.85

[Storage] CUI

Drive Bay/Mo

 $41.20

 $41.20

 $41.20

[Storage] Hive

TB/Mo

 $6.60

 $6.60

 $6.60

[Storage] LIGO/OSG

TB/Mo

 $2.61

 $2.61

 $2.61

[Storage] Archival

TB/Mo

 $4.89

 $3.33

 $4.89

General Consulting

Hour

 $98

 $98

 $98

September 18, 2020

[Resolved] Emergency Storage maintenance (GPFS/pace2) in Rich datacenter

Filed under: Uncategorized — Semir Sarajlic @ 11:27 am

[Update – 3:02pm] 

We are following up to inform you that our emergency maintenance work on GPFS pace2 storage in Rich datacenter was completed successfully, and at approximately 2:10pm we have released the jobs on the Shared and Dedicated clusters in Rich.   Please note that temporarily the GPFS pace2 file system will be slightly slower as it is concurrently rebuilding 7 drives.  During this maintenance, we did not lose any user data, and we did not interrupt any user jobs that were running during this period.

What PACE will do:  PACE will continue to monitor the storage and report as needed.  Thank you for your attention and patience during this brief emergency storage maintance.  

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu

Thank you,
The PACE Team

 

[Original – 11:25]

PACE will be conducting an emergency maintenance work on storage in Rich datacenter today at 1:00pm.  As a precaution we have paused all user jobs as of 10:15am today.  Currently running jobs will remain running but may be subject to interruption during our emergency maintenance.  

What’s about to happen: Today, starting at 1:00pm, PACE team will need to conduct an emergency maintenance activity on our GPFS in Rich datacenter, which will involve reseating the primary IO module.  The storage impacted is /data directory on GPFS pace2 that users use on Shared and Dedicated clusters.

Who is impacted: As of 10:15am, all PACE users are unable to submit and run new jobs as the schedulers in Rich datacenter have been paused.  Currently running user jobs may be subject to interruption during the maintenance activity.  If jobs get interrupted, PACE team will follow up with the impacted users to notify them.  

This emergency maintenance activity does not impact Coda datacenter that includes Hive and TestFlight-Coda clusters.  Also, as of 11:30am we have released the jobs for Gryphon and Novazohar that were briefly paused this morning as we assessed the situation.  Gryphon and Novazohar clusters will not be impacted by the 1:00pm scheduled emergency maintenance.

For updates, you may refer to our blog post,LINK  that we will updated as further information is available.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Thank you for your attention and our apologies for this inconvenience.

The PACE Team

September 8, 2020

OIT’s Planned Network Maintenance

Filed under: Uncategorized — Semir Sarajlic @ 1:27 pm

We are following up with an update on the schedule for OIT’s planned network maintenance.  OIT’s Network Engineering team will be conducting two maintenance activities scheduled for the evening of Friday, September 11th, from 7:00pm to 4:00am.  These maintenance activities will affect connections of PACE to the outside Internet that is anticipated to last 30 minutes or less from the start of the activity, which may occur at any point during this updated and longer maintenance time window.

What’s about to happen:  On Friday, September 11th, starting at 7:00pm –  4:00am (September 12th), Network Engineering team will be upgrading the data center firewall appliances to the latest code that is recommended by Palo Alto who has addressed serious security vulnerabilities with their latest released code.  To reassure you, OIT’s network team has been operating with some controls in place to address these vulnerabilities, and this planned upgrade will further reduce our risk.  Second maintenance activity also starts on Friday, September 11th, Network Engineering team will be swapping service to a more capable Network Address Translation (NAT) appliance in Coda datacenter, as the one currently in Coda  is being overloaded.  These activities will affect PACE’s connection to/from the Internet.

Who is impacted: PACE users will not be able to connect to PACE resources and/or they may lose connection during this maintenance window that may last 30 minutes or less from the start of the activity at any point during the maintenance time window.  We encourage users to avoid running interactive jobs (e.g., VNC/X11) that rely on an active SSH connection to a PACE cluster during this time frame to avoid sudden interruptions due to a loss of connection to the PACE resources.  Batch jobs that are running and queued in the PACE schedulers will operate normally; however, any jobs that require resources outside of PACE or Internet will be subject to interruptions during these maintenance activities.  These maintenance activities will not affect any of the PACE storage systems.

What PACE will do:  PACE will remain on standby during these activities to monitor the systems, conduct testing and report on any interruptions in service.  For up-to-date progress, please check the Georgia Tech’s status page, https://status.gatech.edu.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Powered by WordPress