GT Home : : Campus Maps : : GT Directory

Author Archive

Hive Cluster Status 10/3/2019

Posted by on Thursday, 3 October, 2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.

OIT Planned Maintenance

Posted by on Wednesday, 25 September, 2019

The OIT Network Services team will be performing a software upgrade on our campus Carrier-Grade NAT (CGN) appliances this week – see OIT Status for a full description. The affected subnet is the out of band management of the Hive/MRI servers; additionally, only internet-bound connections are being serviced. As such, no failures are expected for users of the Hive/MRI servers. Nonetheless, if you encounter connectivity issues to Hive resources, please do not hesitate to contact pace-support@oit.gatech.edu for assistance.

[Resolved] Shared Clusters Scheduler Down

Posted by on Tuesday, 10 September, 2019

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.