GT Home : : Campus Maps : : GT Directory

Author Archive

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Posted by on Tuesday, 14 April, 2020

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue),  please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources.   With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved.   Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

Hive Cluster — Scheduler modifications/Policy Update

Posted by on Friday, 3 April, 2020

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized.  As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive.  Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

  1. Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours.  This queue will consist of 2 nodes that will be migrated from hive-gpu queue.  This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
  2. Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users.  This will provide an equal and fair opportunity for each of the research teams to access this cluster.
  3. Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted:  All Hive users will be impacted by the adjustment to the dynamic priority.

User Action:  For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Best,
The Past Team

Emergency Firewall Maintenance

Posted by on Thursday, 26 March, 2020

Dear Researchers,

The GT network team will undertake an emergency code upgrade on the departmental Palo Alto firewalls beginning at 8pm tonight.  Because this is a high availability pair of devices, this upgrade should not be a major disruption to any traffic to or from the PACE systems.  The specific upgrade has already been successfully accomplished on other firewall devices of the same hardware and software versions and it was observed to not cause any disruptions.

With that said, there is a possibility that connections to the PACE login servers may see a temporary interruption between 8pm and 11pm TONIGHT as the firewalls are upgraded. This should not impact any running jobs except if there is a request for a license on a license server elsewhere on campus (e.g., abaqus) that happens to coincide with the exact moment of the firewall changeover.  Additionally, there is possibility that users may experience interruptions during their interactive sessions (e.g., edit session, screen, VNC Job, Jupyter notebook).  The batch jobs that are already scheduled and/or running on the clusters should otherwise progress normally.

Please check the status and completion of your jobs that have run this evening for any unexpected errors and re-submit should you believe an interruption was the cause.  We apologize in advance for any inconvenience this required emergency code upgrade may cause.

You may follow the status of this maintenance at GT’s status page

As always, if you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu .

Best,

The PACE Team

 

PACE Operations Update — COVID-19

Posted by on Thursday, 12 March, 2020

[UPDATE – 03/19/2020]

Dear Researchers,

This is brief update on our prior communication about COVID-19 situation that we are carefully monitoring.  In the light of recent communication from the Office of the Executive Vice President for Research regarding the research ramp down plan, please rest assured that PACE will continue normal operations of our resources.   We will continue to provide support during this period.  

Regarding PACE training classes, we made modifications to our classes to offer them virtually via BlueJeans, and this week we had our first two classes, Linux 101 and Optimization 101, hosted virtually.  Please visit our training site for upcoming classes that you may register for, and our Research Scientists will be in touch regarding the instructions in accessing the classes virtually.   Additionally, our consulting sessions will be offered virtually as scheduled.  You may check our “Upcoming Events” section for the virtual coordinates for upcoming consulting session.

Also, as a clarification point about the new campus VPN (Global Protect), this is a new service that is in early deployment/testing phase, and the new VPN is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect). At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

Overall, given the challenges that COVID-19 has presented, we want to reassure our community that we are here for you to support your computational research, and please do not hesitate to contact us at pace-support@oit.gatech.edu if you have any questions or concerns.

Warm regards, and stay safe.

The PACE Team

[UPDATE – 03/13/2020].  As a brief update to yesterday’s message, the new VPN (GlobalProtect) is a new service (and going through much of the testing).  This new VPN is intended to help with the anticipated increase in demand, but it is NOT replacing the current campus VPN (i.e., Cisco’s AnyConnect you’ve been using).  At this time, they are operating in parallel, and you may use either of the VPN services to connect to PACE resources.

[Original Message – 03/12/2020]

Dear Researchers,

PACE is carefully monitoring developments with the COVID-19 situation including the recent message from President Cabrera announcing that GT is moving to online/distance instruction after spring break.  We want to reassure the community that PACE will continue normal operations.

Given the anticipated increase in demand of our VPN infrastructure, please follow the instructions on accessing OIT’s recently deployed Next Generation Campus VPN that will help you access PACE resources.

If you have any questions or concerns, you may reach us via pace-support@oit.gatech.edu

Best,

The PACE Team

 

[Resolved] Shared Scheduler for Shared Clusters is Down

Posted by on Monday, 10 February, 2020

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

PACE Procurement Update and Schedule

Posted by on Wednesday, 29 January, 2020

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to pace-support@oit.gatech.edu.

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.

 

February 20 Intent to participate in FY20-Phase3 due to pace-support@oit.gatech.edu
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research

 

To view the published schedule or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu

Going forward, the PACE Newsletter will be published quarterly at  https://pace.gatech.edu/pace-newsletter.

Best Regards,

– The PACE Team

 

Hive Cluster Scheduler Down

Posted by on Monday, 27 January, 2020

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

 

[Re-Scheduled] Hive Cluster — Policy Update

Posted by on Tuesday, 7 January, 2020

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.

Additionally:

We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

Posted by on Friday, 11 October, 2019

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.