PACE A Partnership for an Advanced Computing Environment

December 18, 2020

Update on the New Cost Model: March 1, 2021 is the Start Date for Compute and Storage Billing

Filed under: Uncategorized — Semir Sarajlic @ 5:43 pm

Dear PACE research community,

Last week, we completed the last batch of user migrations to the Phoenix cluster, which was a major milestone for our research community and PACE. We are grateful for all your support and understanding during this major undertaking of migrating our user community from Rich to Coda datacenter – thank you!

At this time, we want to share an update in reference to the new cost model. We are updating the start date for compute and storage billing from the tentative January 1 to March 1, 2021 as the new start date for billing. This means that users will not be charged for their usage of compute and storage resources until March 1, 2021.   Given this grace period extension, we have implemented measures to ensure fair use of the Phoenix cluster, which was reported on December 14 to user community and at this blog post.  This grace period extension should provide ample time for the research community to adopt to the Phoenix cluster and the new cost model.

If you have any questions, concerns or comments about your recent migration to the Phoenix cluster, please direct them to pace-support@oit.gatech.edu.

Happy Holidays!

The PACE Team

December 14, 2020

Scheduler Policy Update for the Phoenix Cluster

Filed under: Uncategorized — Semir Sarajlic @ 6:45 pm

Dear Researchers,

Last week, we completed the migration of our third batch of users from Rich to Coda datacenter, which was a major milestone for our research community as we migrated nearly 2,000 active users from clusters in Rich to the Phoenix cluster in Coda.  Given the more than doubled number of users on already highly utilized Phoenix cluster coupled with the grace period that we have in effect for job accounting with respect to recent announcement about the new cost model, we have noticed a rapid increase in wait time per job by users on the cluster.   At this time, we are updating our scheduler policy to alleviate pressure on the wait time per job for users that should improve the overall quality of service.  The changes listed below are data-driven and have been carefully chosen so as to not adversely impact research teams that submit large scale jobs.

Effective today, the following changes have been made to the scheduler policy that effect the inferno and embers queues:

  • Reduced the concurrent-use limit for CPU usage per research group from 7,200 processors to 6,000 processors.
  • Reduced the concurrent-use limit for  GPU usage per user from 220 GPUs to 32 GPUs
  • Added a per research group concurrent CPU hour capacity limit set to 300,000 CPU hours that allows the scheduler to permit the research group to concurrently run up to 300,000 CPU hours (i.e., requested processors * walltime)
  • Added a per job CPU-time capacity limit set to 264,960 CPU hours that would allow, for example, a 2,208 core job to run for 5 days.

Jobs that violate these limits will be held in the queue until currently running jobs complete and the total number of utilized processors, GPUs, and/or the remaining CPU-time fall below the thresholds.   We have updated our documentation to reflect these changes, which you may view here.

Again, the changes listed above are taking effect today, December 14, 2020.   If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

All the best,

The PACE Team

 

December 8, 2020

[Resolved] Scheduler down for the Phoenix Cluster

Filed under: Uncategorized — Semir Sarajlic @ 10:26 am

[Update – December 9, 2020 – 11:03am]

Thank you for your continued patience.   We are reaching out to update you that the scheduler has been resolved late last night, new jobs from users that were held in queue yesterday have resumed, and we have observed normal operation of the Phoenix scheduler through the night and this morning.  The cause of the scheduler issue was the large sudden influx of user jobs on December 7 as previously reported.  The updated timeout parameter will be kept in place to prevent a similar occurrence in the future.   Additionally, we are expanding our alert utilities to track additional metrics from the scheduler to alert us of similar stressors so that we may proactively address/mitigate scheduler issues.

The Phoenix cluster is ready for research.  Thank you again for your patience as we worked to address this issue in coordination with the vendor.

 

[Update – December 8, 2020 – 6:31pm]

Thank you for your patience today as we have worked extensively with the vendor to address the scheduler outage on the Phoenix cluster. This is brief update from today’s joint investigation.

What has PACE done: PACE along with the vendor, Adaptive Computing, have been conducting an extensive investigation of the scheduler outage. The root cause of the incident is still under investigation; however, we have identified a multi-pronged event that had started at 11:39pm on December 7 that was compounded with a rapid influx of nearly 30,000 jobs from users that lead the scheduler to become unresponsive. Given this large influx of jobs, we have increased the timeout setting for the scheduler to allow Moab to process the backlog of jobs that were submitted. This is currently underway.

Who does this message impact: This impacts all users on the Phoenix cluster who have submitted jobs. During this incident, it’s normal for users to see their jobs remain in queue after submission as the scheduler is working through the backlog of job submissions.

What PACE will continue to do: We will continue to monitor the scheduler as it’s processing the backlog of jobs and update as needed. This continues to be an active situation and we will update as further information is available.

Thank you again for your patience as we work diligently to address this issue.

The PACE Team

 

[Original Note – December 8, 2020 – 10:26am]

Dear PACE users,

PACE is investigating a scheduler issue that is impacting the Phoenix cluster.  At this time, users are unable to run jobs, and  jobs are held in queue.  

This is an active situation, and we will follow up with updates as they become available.

Thank you for your attention to this urgent message, and we apologize for this inconvenience.

December 2, 2020

[RESOLVED] Intermittent unavailability of Phoenix login nodes

Filed under: Uncategorized — Semir Sarajlic @ 5:44 pm
Phoenix login nodes 1, 2 and 4 became unavailable today for short periods of time. We identified the issue as excessive user activity and rebooted the nodes, which are now available. We will reach out to the relevant users to prevent this from happening again.

Powered by WordPress