GT Home : : Campus Maps : : GT Directory

[Resolved] Shared Clusters Scheduler Down

This entry was posted by on Tuesday, 10 September, 2019 at

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

Comments are closed.