GT Home : : Campus Maps : : GT Directory

Author Archive

[Resolved] Shared Scheduler for Shared Clusters is Down

Posted by on Monday, 10 February, 2020

Functionality has been restored to the shared cluster as of 3:00pm, and jobs are being ingested and run as normal.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

[Original Note – Feb 10, 2020 @ 2:40pm] The Shared Scheduler has gone down. This has come to our attention at around 2:00pm. PACE team is investigating the issue, and we will follow up with the details. During this period, you will not be able to submit jobs or monitor current jobs on Shared Clusters.

PACE Procurement Update and Schedule

Posted by on Wednesday, 29 January, 2020

Dear Colleagues,

As you are aware from our prior communications and recent issue of our PACE Newsletter, the PACE team has been quite busy.  We’ve deployed the Hive cluster – a state of the art resource funded by NSF,  we continue to expand our team to provide an even higher level of service to our community, and we are preparing the CODA data center to receive research workloads migrated from the Rich data center.  We will be following up with you on this latest point very soon. Today, we are reaching out to inform you about the PACE purchasing schedule for the remainder of FY20 and provide an update on how the recent changes in procurement requirements have impacted our timelines, as I’m sure you have seen in your departments as well.

First, the situation with procurement.  The sizable orders we are placing on behalf of the faculty have come under increased scrutiny.  This added complexity has resulted in much more time devoted to compliance, and the flexibility that we once enjoyed is no longer achievable.  More significantly, each order we place is now requiring a competitive bid process. As a result, our first order of the year, FY20-Phase1, has been considerably delayed, and is still in the midst of a bid process.  We have started a second order, FY20-Phase2, in parallel to address situations of urgent need and expiring funds. We are making preparations to begin the bid process for this order shortly. An important point to note is that purchases of PACE storage are not affected.  Storage can be added as needed via request to pace-support@oit.gatech.edu.

Given the extended time that is required to process orders, we have time for only one more order before the year-end deadlines are upon us.  We will accept letters of intent to participate in FY20-Phase3 from now through February 20, 2020.  We will need complete specifications, budgets, account numbers, etc. by February 27, 2020.  Please see the schedule below for further milestones.  This rapidly approaching deadline is necessary for us to have sufficient ability to process this order in time to use FY20 funds.  Due to the bidding process, we will have reduced ability for configuration changes once after the “actionable requests” period. By extension, we also have reduced ability to precisely communicate costs in advance.  We will continue to provide budgetary estimates, and final costs will be communicated after bids are awarded.

Please know that we are doing everything possible to best advocate for the research community and navigate the best way through these difficulties.

 

February 20 Intent to participate in FY20-Phase3 due to pace-support@oit.gatech.edu
February 27 All details due to PACE (configuration, quantity, not-to-exceed budget, account number, financial contact, queue name)
April 22 Anticipated date to award bid
April 29 Anticipated date to finalize quote with selected vendor
May 6 Exact pricing communicated to faculty, all formal approvals received
May 8 Requisition entered into Workday
May 22 GT-Procurement issues purchase order to vendor
July 31 Vendor completes hardware installation and handover to PACE
August 28 PACE completes acceptance testing, resources become ready for research

 

To view the published schedule or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu

Going forward, the PACE Newsletter will be published quarterly at  https://pace.gatech.edu/pace-newsletter.

Best Regards,

– The PACE Team

 

Hive Cluster Scheduler Down

Posted by on Monday, 27 January, 2020

The Hive scheduler has been restored at around 2:20PM.  The scheduler services had crashed, which we were able to restore successfully and place measures to prevent similar reoccurrence in future.  There is a potential that some user jobs may have been impacted during this scheduler outage.  Please check your jobs, and if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Thank you again for your patience, and we apologize for the inconvenience.

[Original Note — January 27, 2020, 2:16PM] The Hive scheduler has gone down.  This has come to our attention at around 1:40pm.  PACE team is investigating the issue, and we will follow up with the details.  During this period, you will not be able to submit jobs or monitor current jobs on Hive.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

We apologize for this inconvenience, and appreciate your patience and attention.

 

[Re-Scheduled] Hive Cluster — Policy Update

Posted by on Tuesday, 7 January, 2020

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.

Additionally:

We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

Posted by on Friday, 11 October, 2019

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

The First LIGO Cluster for Georgia Tech is Ready for Research!

Posted by on Friday, 30 August, 2019

Tech’s Partnership for an Advanced Computing Environment (PACE) recently deployed a cluster to support the Nobel Prize winning Laser Interferometer Gravitational-Wave Observatory (LIGO) project. This project observed the first gravitational waves from the merger of two black holes, and, in doing so, confirmed Einstein’s predictions according to his theory of general relativity. Beyond the addition of new computational resources, this pioneering work is the first step in integrating Georgia Tech into the Open Science Grid (OSG national computational grid that provides shared resources to run massive numbers of small computations.

PACE started working on building a LIGO resource at Georgia Tech shortly after the arrival of Dr. Laura Cadonati, professor of Physics in the Center for Relativistic Astrophysics (CRA), in 2015. The initial proof-of-concept infrastructure was able to accept test jobs from OSG for testing. This initiative yielded great insight into the process of integrating into OSG, and was disseminated to other institutions such as Syracuse University, who subsequently was able to deploy their own cluster. Based on this successful test, Cadonati procured a new cluster to run production level LIGO workloads. In deploying this cluster, PACE partnered with a team of experts at the University of Chicago led by Senior Scientist Robert Gardner to adopt the latest advancements in OSG system and software stack.

Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 30 August, 2019

Dear Researchers,

Given the growth in the number and workload diversity of PACE cluster users, we are compelled to make adjustments to our shared clusters’ scheduler policy that will help offset the extensive load our schedulers are placed under as a result of rapid growth of our high throughput computing (HTC) user community that is submitting 1000s of jobs at a time.  As you can see in the figure below, our shared cluster is under heaviest load from the sheer volume of HTC jobs it receives from the users that at this time is exceeding 100,000 jobs.  Our current policy for the max number of jobs a user may submit is 3,000.  We are proposing to reduce the max number of jobs to 500 (i.e., 500 total jobs both queued/running per user) that will substantially reduce the load on the scheduler that will improve the overall performance, providing a sustainable and improved user experience  by largely preventing hanging commands/jobs/errors when attempting to submit/check on a job.

More specifically, we are making the following changes:

  • Max 500 total jobs (queued/running) per user
  • Effective October 29, 2019

Who is impacted:

  • If you are logging into login-s[X] headnode, then this change will impact you.
  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy
  • We identified 84 users from 36 PI groups out of our nearly 3000 users from nearly 300 PI research groups who will be impacted by this change.

We understand that this is a drastic decrease in the limit of jobs a user is able to submit, but with our proposed improvements to your workflows, you will benefit from improved performance of the scheduler and gain more computational time from more efficiently submitting jobs.  With that said, we are providing a custom consulting sessions to help the researchers adopt their workflows to the new limits for max jobs submission. Also, we have multiple solutions, for example, job arraysGNU parallel and HTC Launcher developed and put in place for users to quickly adapt their workflows given this policy update. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 10, 1:00pm – 2:45pm, Molecular Sciences & Engineering Room 1201A
  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 17, 10:00am – 11:00am, Scheller COB 224
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Best,

The PACE Team

New PACE Team Members and New Team Member Roles

Posted by on Friday, 30 August, 2019

Dear Researchers,

PACE is pleased to announce our new additions to the PACE team and recognitions of our team members who started new roles at PACE.

In Spring, our Software and Collaboration Support team grew with an addition of Dr. Kevin Manalo.  Kevin is a proud Georgia Tech graduate who cannot hold back his excitement about joining PACE, whose clusters he had heavily relied on during his PhD research!  Kevin comes to PACE as an HPC veteran with experience in HPC support and training from Johns Hopkins University and state supercomputer centers at Ohio and Alabama.

Over the summer, our Outreach and Faculty Interaction team has grown by three new members, Drs. Aaron Jezghani, Michael Weiner, and Chris Blanton.  As you may have already noticed, they have all hit the ground running as they have been very active in responding to support inquiries and hosting multiple PACE classes and workshops.  To tell you a little bit about our Outreach team members:

Dr. Aaron Jezghani recently defended his PhD in Physics at the University of Kentucky.   His research focused on nuclear physics and  involved work at both Los Alamos and Oak Ridge National Labs. Throughout Aaron’s multi-faceted dissertation work, he focused on development of detector readout electronics as well as techniques in acquiring, processing and analyzing data from the detectors, which is not an easy feat.

Dr. Michael Weiner received his undergraduate degree in physics from Yale University and his doctorate, also in physics, from Cornell University. He completed his doctoral research in computational biophysics in the laboratory of Gerald Feigenson, where he focused on Molecular Dynamics simulations of the biophysical chemistry of lipid bilayers as models of cell membranes.

Dr. Chris Blanton earned his Ph.D. from Syracuse University in Computational and Theoretical Chemistry. During his studies, he became deeply interested in computational research and HPC. After graduation, he joined the Pennsylvania State University’s Institute for CyberScience. He has worked with some of the most exciting and innovative computational researchers, and he looks forward to sharing and applying his experiences with Georgia Tech research community.

Also, over the summer, our Cyberinfrastructure team has added two members, and it’s our pleasure to reintroduce to you Trever Nightingale and Ken Suda.

Trever has returned to PACE to his position of Sr. Systems Support Engineer.  Trever has a bachelor’s degree from Amherst College and a master’s degree from the University of Minnesota, and experience in high performance and research computing centers including the Naval Research Lab, NERSC and the Centers for Disease Control (CDC) among his 20 years of UNIX experience.


Ken has been in IT professionally for almost 35 years and have filled most roles found in an IT organization.  The past couple of years, Ken has been a consultant and run a game development company.  As a consultant, Ken has been a generalist, filling whatever role the team or organization needed.

Now, with great pleasure, PACE is pleased to announce the new roles for our team members, Dan (Ann) Zhou, Andre McNeill, and Ruben Lara.

Dan (Ann) Zhou’s new role is Research Technologist Storage Architect for PACE.  Ann has been a PACE team member since August 2014 and has been an integral part of the PACE cyberinfrastructure team contributing to the operation of the many PACE storage systems, backup and the management of the storage systems among her responsibilities.  Ann received her bachelor’s degree in Electrical Engineering in China and her master’s degree in Electrical and Computer Engineering at Tennessee Technological University. She enjoys cooking, running, eating, and traveling.

Andre McNeill’s new role is Research Technologist Cloud Architect for PACE. Andre has been a member of PACE for nearly 10 years and continues to be a vital resource for both our PACE staff as well as our PACE customers to deliver a robust and reliable research computing environment including computing, networking and software systems.  A graduate of Purdue University, Andre has many interests within PACE and many outside the work place including being a DJ.

Ruben Lara’s new role is Systems Support Engineer Manager for PACE. Ruben has been a part of the PACE cyberinfrastructure team since February 2017. Ruben has many excellent managerial and organizational skills and is currently enrolled in the current MOR Leadership training.  Ruben enjoys baseball, ultimate frisbee, rock climbing and mountain biking. You can find him by the window at the Southwest end of the 10th floor of the Coda building.

Please join us in welcoming our new team members and congratulating our recently promoted team members!

Best,
The PACE Team

Release of Updated PACE User Documentation

Posted by on Thursday, 29 August, 2019

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu