GT Home : : Campus Maps : : GT Directory

Author Archive

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

Posted by on Friday, 11 October, 2019

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

The First LIGO Cluster for Georgia Tech is Ready for Research!

Posted by on Friday, 30 August, 2019

Tech’s Partnership for an Advanced Computing Environment (PACE) recently deployed a cluster to support the Nobel Prize winning Laser Interferometer Gravitational-Wave Observatory (LIGO) project. This project observed the first gravitational waves from the merger of two black holes, and, in doing so, confirmed Einstein’s predictions according to his theory of general relativity. Beyond the addition of new computational resources, this pioneering work is the first step in integrating Georgia Tech into the Open Science Grid (OSG national computational grid that provides shared resources to run massive numbers of small computations.

PACE started working on building a LIGO resource at Georgia Tech shortly after the arrival of Dr. Laura Cadonati, professor of Physics in the Center for Relativistic Astrophysics (CRA), in 2015. The initial proof-of-concept infrastructure was able to accept test jobs from OSG for testing. This initiative yielded great insight into the process of integrating into OSG, and was disseminated to other institutions such as Syracuse University, who subsequently was able to deploy their own cluster. Based on this successful test, Cadonati procured a new cluster to run production level LIGO workloads. In deploying this cluster, PACE partnered with a team of experts at the University of Chicago led by Senior Scientist Robert Gardner to adopt the latest advancements in OSG system and software stack.

Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 30 August, 2019

Dear Researchers,

Given the growth in the number and workload diversity of PACE cluster users, we are compelled to make adjustments to our shared clusters’ scheduler policy that will help offset the extensive load our schedulers are placed under as a result of rapid growth of our high throughput computing (HTC) user community that is submitting 1000s of jobs at a time.  As you can see in the figure below, our shared cluster is under heaviest load from the sheer volume of HTC jobs it receives from the users that at this time is exceeding 100,000 jobs.  Our current policy for the max number of jobs a user may submit is 3,000.  We are proposing to reduce the max number of jobs to 500 (i.e., 500 total jobs both queued/running per user) that will substantially reduce the load on the scheduler that will improve the overall performance, providing a sustainable and improved user experience  by largely preventing hanging commands/jobs/errors when attempting to submit/check on a job.

More specifically, we are making the following changes:

  • Max 500 total jobs (queued/running) per user
  • Effective October 29, 2019

Who is impacted:

  • If you are logging into login-s[X] headnode, then this change will impact you.
  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy
  • We identified 84 users from 36 PI groups out of our nearly 3000 users from nearly 300 PI research groups who will be impacted by this change.

We understand that this is a drastic decrease in the limit of jobs a user is able to submit, but with our proposed improvements to your workflows, you will benefit from improved performance of the scheduler and gain more computational time from more efficiently submitting jobs.  With that said, we are providing a custom consulting sessions to help the researchers adopt their workflows to the new limits for max jobs submission. Also, we have multiple solutions, for example, job arraysGNU parallel and HTC Launcher developed and put in place for users to quickly adapt their workflows given this policy update. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 10, 1:00pm – 2:45pm, Molecular Sciences & Engineering Room 1201A
  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 17, 10:00am – 11:00am, Scheller COB 224
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Best,

The PACE Team

New PACE Team Members and New Team Member Roles

Posted by on Friday, 30 August, 2019

Dear Researchers,

PACE is pleased to announce our new additions to the PACE team and recognitions of our team members who started new roles at PACE.

In Spring, our Software and Collaboration Support team grew with an addition of Dr. Kevin Manalo.  Kevin is a proud Georgia Tech graduate who cannot hold back his excitement about joining PACE, whose clusters he had heavily relied on during his PhD research!  Kevin comes to PACE as an HPC veteran with experience in HPC support and training from Johns Hopkins University and state supercomputer centers at Ohio and Alabama.

Over the summer, our Outreach and Faculty Interaction team has grown by three new members, Drs. Aaron Jezghani, Michael Weiner, and Chris Blanton.  As you may have already noticed, they have all hit the ground running as they have been very active in responding to support inquiries and hosting multiple PACE classes and workshops.  To tell you a little bit about our Outreach team members:

Dr. Aaron Jezghani recently defended his PhD in Physics at the University of Kentucky.   His research focused on nuclear physics and  involved work at both Los Alamos and Oak Ridge National Labs. Throughout Aaron’s multi-faceted dissertation work, he focused on development of detector readout electronics as well as techniques in acquiring, processing and analyzing data from the detectors, which is not an easy feat.

Dr. Michael Weiner received his undergraduate degree in physics from Yale University and his doctorate, also in physics, from Cornell University. He completed his doctoral research in computational biophysics in the laboratory of Gerald Feigenson, where he focused on Molecular Dynamics simulations of the biophysical chemistry of lipid bilayers as models of cell membranes.

Dr. Chris Blanton earned his Ph.D. from Syracuse University in Computational and Theoretical Chemistry. During his studies, he became deeply interested in computational research and HPC. After graduation, he joined the Pennsylvania State University’s Institute for CyberScience. He has worked with some of the most exciting and innovative computational researchers, and he looks forward to sharing and applying his experiences with Georgia Tech research community.

Also, over the summer, our Cyberinfrastructure team has added two members, and it’s our pleasure to reintroduce to you Trever Nightingale and Ken Suda.

Trever has returned to PACE to his position of Sr. Systems Support Engineer.  Trever has a bachelor’s degree from Amherst College and a master’s degree from the University of Minnesota, and experience in high performance and research computing centers including the Naval Research Lab, NERSC and the Centers for Disease Control (CDC) among his 20 years of UNIX experience.


Ken has been in IT professionally for almost 35 years and have filled most roles found in an IT organization.  The past couple of years, Ken has been a consultant and run a game development company.  As a consultant, Ken has been a generalist, filling whatever role the team or organization needed.

Now, with great pleasure, PACE is pleased to announce the new roles for our team members, Dan (Ann) Zhou, Andre McNeill, and Ruben Lara.

Dan (Ann) Zhou’s new role is Research Technologist Storage Architect for PACE.  Ann has been a PACE team member since August 2014 and has been an integral part of the PACE cyberinfrastructure team contributing to the operation of the many PACE storage systems, backup and the management of the storage systems among her responsibilities.  Ann received her bachelor’s degree in Electrical Engineering in China and her master’s degree in Electrical and Computer Engineering at Tennessee Technological University. She enjoys cooking, running, eating, and traveling.

Andre McNeill’s new role is Research Technologist Cloud Architect for PACE. Andre has been a member of PACE for nearly 10 years and continues to be a vital resource for both our PACE staff as well as our PACE customers to deliver a robust and reliable research computing environment including computing, networking and software systems.  A graduate of Purdue University, Andre has many interests within PACE and many outside the work place including being a DJ.

Ruben Lara’s new role is Systems Support Engineer Manager for PACE. Ruben has been a part of the PACE cyberinfrastructure team since February 2017. Ruben has many excellent managerial and organizational skills and is currently enrolled in the current MOR Leadership training.  Ruben enjoys baseball, ultimate frisbee, rock climbing and mountain biking. You can find him by the window at the Southwest end of the 10th floor of the Coda building.

Please join us in welcoming our new team members and congratulating our recently promoted team members!

Best,
The PACE Team

Release of Updated PACE User Documentation

Posted by on Thursday, 29 August, 2019

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu

 

PACE Ready for Research

Posted by on Friday, 9 August, 2019

Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

As usual, there are a small number straggling nodes we will address over the coming days.

  • (Complete) Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • (Complete) Additional space will be configured for license server.
  • (Complete) OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • (Complete) OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • (Complete) PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • (Complete) The submit filter for jobs on the RHEL 7 clusters will be modified to allow proper formatting of commands. This filter is not needed on RHEL 6 clusters.
  • (Complete) Upgrade DNS appliances; no downtime is expected due to redundant configuration.

[Complete] PACE Quarterly Maintenance – August 8-10

Posted by on Tuesday, 23 July, 2019

[August 9, 2019 Update] Our August 2019 maintenance ( http://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

[August 2, 2019 Update]

NO USER ACTION NEEDED ITEMS:

  • Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • Additional space will be configured for license server.
  • OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • The submit filter for jobs on the RHEL 6 clusters will be modified to allow proper formatting of commands. This filter is already in place on RHEL 7 clusters.
  • Upgrade DNS appliances; no downtime is expected due to redundant configuration.

Please send questions and/or comments to pace-support@oit.gatech.edu

 

[July 23, 2019] We are preparing for a maintenance day on August 8 – 10, 2019. This maintenance day is planned for three days and will start on Thursday, August 8, and go through Saturday, August 10.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.  

In general, we will be working on upgrading all of the RHEL7 production nodes to latest 7.6 kernel, update connection to and from PACE routers, and add additional disk capacity to our license server.  While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

[Resolved] Campus wide Intermittent network outage impacting PACE

Posted by on Friday, 12 July, 2019

Today at around 1:55pm,  OIT reported a campus wide intermittent network slowness as one of the DNS servers went down causing trouble with authentication, GRS and more.  OIT has resolved this issue as of 4:12pm, and we have recovered our storage that export home directories as a result of this related issue.  The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.

We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu

For details on the OIT issue reported, please visit their link 

Thank you for your attention to this, and apologies for the inconvenience.

[Resolved] Dedicated Scheduler – Job Submissions Paused

Posted by on Thursday, 11 July, 2019

[Update – July 11, 2019 – 2:45pm] Dedicated scheduler is back online and operational after correcting the node associations with queues that has resulted from a faulty configuration.  We have taken measures to correct our automated procedure to prevent such an incident in the future.  We have removed the pause on job submission.  You may now resume submitting your jobs.  Please check on your jobs that were submitted since 3:30pm yesterday (7/10/2019) as many of those jobs have terminated.  

Again, apologies for the inconvenience this has caused.

[Original Post – July 11, 2019 – 10:38pm] Today, at approximately 10:10am we paused job submissions to queues that are managed by the dedicated scheduler. Researcher teams will not be able to submit new jobs to the following queues: kennedy-lab,granulous,atlas-dufek,chow,athena-debug,cochlea,atlas-6,njord-6,atlantis,jabberwocky6,  megatron,acceptance,hadoop,aces,drive,complexity,corso,blue,monkeys-k33,athena-6,core,ase1-debug-6,microbio-1,radius,medprint-6,monkeys_gpu,pampa-6,monkeys,keeneland,athena-intel,atlas-intel,apurimac-bg-6,staml,ofed-test,semap-6,martini,skade,tmlhpc-6,atlas-debug,wohler,rozell,mps,prv-5-6,aryabhata-6,hadean-gpu,epictetus,neutrons-6,davenporter,atlas,athena-8core,uranus-6,hadean,ase1-6,atlas-simon,enterprise,pampa-debug-6,skadi

This action is taken to resolve the issue that we experienced since evening on July 10, in which jobs erroneously were terminated after not reaching their appropriate nodes. We are working to resolve this issue as quickly as possible.   Also, by pausing the job submission we will prevent any new jobs from being terminated. While we work to resolve this issue, we ask that you refrain from trying to submit your jobs to the listed queues above. We will follow up with an update as we work through this issue.  Thank you for your attention to this, and we are sorry for this inconvenience.