GT Home : : Campus Maps : : GT Directory

Author Archive

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

Posted by on Saturday, 27 June, 2020

[Update – June 28, 2020, 2:42pm]

We are following up with another update.  The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us  that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services.  Frequently, there are a few nodes that require specific manual action.  We will continue to work on bringing back those straggling nodes.  We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown.  We encourage all users to verify their recent jobs.  Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change.  PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center.  During this emergency shutdown, our storage did not experience data loss.  At this time, we have partially restored services  to cluster login nodes and we continue to work on restoring gryphon login node.  We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.


[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Posted by on Tuesday, 14 April, 2020

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue),  please resubmit the job. If you have any questions or concerns, please contact

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources.   With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved.   Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

Hive Cluster — Scheduler modifications/Policy Update

Posted by on Friday, 3 April, 2020

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized.  As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive.  Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

  1. Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours.  This queue will consist of 2 nodes that will be migrated from hive-gpu queue.  This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
  2. Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users.  This will provide an equal and fair opportunity for each of the research teams to access this cluster.
  3. Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted:  All Hive users will be impacted by the adjustment to the dynamic priority.

User Action:  For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the If you have any questions, please don’t hesitate to contact us at

The Past Team

Emergency Firewall Maintenance

Posted by on Thursday, 26 March, 2020

Dear Researchers,

The GT network team will undertake an emergency code upgrade on the departmental Palo Alto firewalls beginning at 8pm tonight.  Because this is a high availability pair of devices, this upgrade should not be a major disruption to any traffic to or from the PACE systems.  The specific upgrade has already been successfully accomplished on other firewall devices of the same hardware and software versions and it was observed to not cause any disruptions.

With that said, there is a possibility that connections to the PACE login servers may see a temporary interruption between 8pm and 11pm TONIGHT as the firewalls are upgraded. This should not impact any running jobs except if there is a request for a license on a license server elsewhere on campus (e.g., abaqus) that happens to coincide with the exact moment of the firewall changeover.  Additionally, there is possibility that users may experience interruptions during their interactive sessions (e.g., edit session, screen, VNC Job, Jupyter notebook).  The batch jobs that are already scheduled and/or running on the clusters should otherwise progress normally.

Please check the status and completion of your jobs that have run this evening for any unexpected errors and re-submit should you believe an interruption was the cause.  We apologize in advance for any inconvenience this required emergency code upgrade may cause.

You may follow the status of this maintenance at GT’s status page

As always, if you have any questions, please don’t hesitate to contact us at .


The PACE Team


PACE Operations Update — COVID-19

Posted by on Thursday, 12 March, 2020

[UPDATE – 03/19/2020]

Dear Researchers,

This is brief update on our prior communication about COVID-19 situation that we are carefully monitoring.  In the light of recent communication from the Office of the Executive Vice President for Research regarding the research ramp down plan, please rest assured that PACE will continue normal operations of our resources.   We will continue to provide support during this period.  

Regarding PACE training classes, we made modifications to our classes to offer them virtually via BlueJeans, and this week we had our first two classes, Linux 101 and Optimization 101, hosted virtually.  Please visit our training site for upcoming classes that you may register for, and our Research Scientists will be in touch regarding the instructions in accessing the classes virtually.   Additionally, our consulting sessions will be offered virtually as scheduled.  You may check our “Upcoming Events” section for the virtual coordinates for upcoming consulting session.

Also, as a clarification point about the new campus VPN (Global Protect), this is a new service that is in early deployment/testing phase, and the new VPN is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect). At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

Overall, given the challenges that COVID-19 has presented, we want to reassure our community that we are here for you to support your computational research, and please do not hesitate to contact us at if you have any questions or concerns.

Warm regards, and stay safe.

The PACE Team

[UPDATE – 03/13/2020].  As a brief update to yesterday’s message, the new VPN (GlobalProtect) is a new service (and going through much of the testing).  This new VPN is intended to help with the anticipated increase in demand, but it is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect you’ve been using).  At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

[Original Message – 03/12/2020]

Dear Researchers,

PACE is carefully monitoring developments with the COVID-19 situation including the recent message from President Cabrera announcing that GT is moving to online/distance instruction after spring break.  We want to reassure the community that PACE will continue normal operations.

Given the anticipated increase in demand of our VPN infrastructure, please follow the instructions on accessing OIT’s recently deployed Next Generation Campus VPN that will help you access PACE resources.

If you have any questions or concerns, you may reach us via


The PACE Team


[Resolved] Shared Scheduler for Shared Clusters is Down

Posted by on Monday, 10 February, 2020

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

PACE Procurement Update and Schedule

Posted by on Wednesday, 29 January, 2020

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.


February 20 Intent to participate in FY20-Phase3 due to
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research


To view the published schedule or for more information, visit or email

Going forward, the PACE Newsletter will be published quarterly at

Best Regards,

– The PACE Team


Hive Cluster Scheduler Down

Posted by on Monday, 27 January, 2020

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at

We apologize for this inconvenience, and appreciate your patience and attention.


[Re-Scheduled] Hive Cluster — Policy Update

Posted by on Tuesday, 7 January, 2020

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.


We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at If you have any questions, please do not hesitate to contact us at

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X] headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at