Dear Researchers,
Last week, we completed the migration of our third batch of users from Rich to Coda datacenter, which was a major milestone for our research community as we migrated nearly 2,000 active users from clusters in Rich to the Phoenix cluster in Coda. Given the more than doubled number of users on already highly utilized Phoenix cluster coupled with the grace period that we have in effect for job accounting with respect to recent announcement about the new cost model, we have noticed a rapid increase in wait time per job by users on the cluster. At this time, we are updating our scheduler policy to alleviate pressure on the wait time per job for users that should improve the overall quality of service. The changes listed below are data-driven and have been carefully chosen so as to not adversely impact research teams that submit large scale jobs.
Effective today, the following changes have been made to the scheduler policy that effect the inferno and embers queues:
- Reduced the concurrent-use limit for CPU usage per research group from 7,200 processors to 6,000 processors.
- Reduced the concurrent-use limit for GPU usage per user from 220 GPUs to 32 GPUs
- Added a per research group concurrent CPU hour capacity limit set to 300,000 CPU hours that allows the scheduler to permit the research group to concurrently run up to 300,000 CPU hours (i.e., requested processors * walltime)
- Added a per job CPU-time capacity limit set to 264,960 CPU hours that would allow, for example, a 2,208 core job to run for 5 days.
Jobs that violate these limits will be held in the queue until currently running jobs complete and the total number of utilized processors, GPUs, and/or the remaining CPU-time fall below the thresholds. We have updated our documentation to reflect these changes, which you may view here.
Again, the changes listed above are taking effect today, December 14, 2020. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.
All the best,
The PACE Team