Dear Researchers,
Given the growth in the number and workload diversity of PACE cluster users, we are compelled to make adjustments to our shared clusters’ scheduler policy that will help offset the extensive load our schedulers are placed under as a result of rapid growth of our high throughput computing (HTC) user community that is submitting 1000s of jobs at a time. As you can see in the figure below, our shared cluster is under heaviest load from the sheer volume of HTC jobs it receives from the users that at this time is exceeding 100,000 jobs. Our current policy for the max number of jobs a user may submit is 3,000. We are proposing to reduce the max number of jobs to 500 (i.e., 500 total jobs both queued/running per user) that will substantially reduce the load on the scheduler that will improve the overall performance, providing a sustainable and improved user experience by largely preventing hanging commands/jobs/errors when attempting to submit/check on a job.
More specifically, we are making the following changes:
- Max 500 total jobs (queued/running) per user
- Effective October 29, 2019
Who is impacted:
- If you are logging into login-s[X] headnode, then this change will impact you.
- List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy
- We identified 84 users from 36 PI groups out of our nearly 3000 users from nearly 300 PI research groups who will be impacted by this change.
We understand that this is a drastic decrease in the limit of jobs a user is able to submit, but with our proposed improvements to your workflows, you will benefit from improved performance of the scheduler and gain more computational time from more efficiently submitting jobs. With that said, we are providing a custom consulting sessions to help the researchers adopt their workflows to the new limits for max jobs submission. Also, we have multiple solutions, for example, job arrays, GNU parallel and HTC Launcher developed and put in place for users to quickly adapt their workflows given this policy update. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).
Upcoming Consulting sessions:
- September 10, 1:00pm – 2:45pm, Molecular Sciences & Engineering Room 1201A
- September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
- October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
- October 17, 10:00am – 11:00am, Scheller COB 224
- October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
Again, the changes listed above will take effect on October 29, 2019. After October 29, users will not be able to submit more than 500 jobs.
If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu
Best,
The PACE Team