PACE A Partnership for an Advanced Computing Environment

August 30, 2019

The First LIGO Cluster for Georgia Tech is Ready for Research!

Filed under: Uncategorized — Semir Sarajlic @ 9:25 pm

Tech’s Partnership for an Advanced Computing Environment (PACE) recently deployed a cluster to support the Nobel Prize winning Laser Interferometer Gravitational-Wave Observatory (LIGO) project. This project observed the first gravitational waves from the merger of two black holes, and, in doing so, confirmed Einstein’s predictions according to his theory of general relativity. Beyond the addition of new computational resources, this pioneering work is the first step in integrating Georgia Tech into the Open Science Grid (OSG national computational grid that provides shared resources to run massive numbers of small computations.

PACE started working on building a LIGO resource at Georgia Tech shortly after the arrival of Dr. Laura Cadonati, professor of Physics in the Center for Relativistic Astrophysics (CRA), in 2015. The initial proof-of-concept infrastructure was able to accept test jobs from OSG for testing. This initiative yielded great insight into the process of integrating into OSG, and was disseminated to other institutions such as Syracuse University, who subsequently was able to deploy their own cluster. Based on this successful test, Cadonati procured a new cluster to run production level LIGO workloads. In deploying this cluster, PACE partnered with a team of experts at the University of Chicago led by Senior Scientist Robert Gardner to adopt the latest advancements in OSG system and software stack.

Policy Update to Shared Clusters’ Scheduler

Filed under: Uncategorized — Semir Sarajlic @ 8:44 pm

Dear Researchers,

Given the growth in the number and workload diversity of PACE cluster users, we are compelled to make adjustments to our shared clusters’ scheduler policy that will help offset the extensive load our schedulers are placed under as a result of rapid growth of our high throughput computing (HTC) user community that is submitting 1000s of jobs at a time.  As you can see in the figure below, our shared cluster is under heaviest load from the sheer volume of HTC jobs it receives from the users that at this time is exceeding 100,000 jobs.  Our current policy for the max number of jobs a user may submit is 3,000.  We are proposing to reduce the max number of jobs to 500 (i.e., 500 total jobs both queued/running per user) that will substantially reduce the load on the scheduler that will improve the overall performance, providing a sustainable and improved user experience  by largely preventing hanging commands/jobs/errors when attempting to submit/check on a job.

More specifically, we are making the following changes:

  • Max 500 total jobs (queued/running) per user
  • Effective October 29, 2019

Who is impacted:

  • If you are logging into login-s[X] headnode, then this change will impact you.
  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy
  • We identified 84 users from 36 PI groups out of our nearly 3000 users from nearly 300 PI research groups who will be impacted by this change.

We understand that this is a drastic decrease in the limit of jobs a user is able to submit, but with our proposed improvements to your workflows, you will benefit from improved performance of the scheduler and gain more computational time from more efficiently submitting jobs.  With that said, we are providing a custom consulting sessions to help the researchers adopt their workflows to the new limits for max jobs submission. Also, we have multiple solutions, for example, job arraysGNU parallel and HTC Launcher developed and put in place for users to quickly adapt their workflows given this policy update. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 10, 1:00pm – 2:45pm, Molecular Sciences & Engineering Room 1201A
  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 17, 10:00am – 11:00am, Scheller COB 224
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] GPFS outage on Red Hat 7 queues

Filed under: Uncategorized — Michael Weiner @ 2:19 pm

An issue occurred around 3:30 AM on several queues running on the Red Hat 7 operating system, where a number of nodes failed to mount GPFS, our project (data) and scratch storage system. This caused the nodes to be offlined and unavailable for jobs. We repaired the affected nodes at approximately 9:30 AM today, and all queues should be functioning normally. Any jobs that were held should have begun. Please check your overnight jobs for errors.

The following queues were impacted:
atlas-he
ece-gpu
flamel-gpu
gaanam-gpu
gemini-cpu
gemini-gpu
megatron
ml_gpu
sake
skylake-test
starscream
swarm
swarm-gpu

Should you notice the problem recur, or if you have any other concerns, please contact us at pace-support@oit.gatech.edu, and we will be happy to help you. We apologize for the inconvenience this morning.

New PACE Team Members and New Team Member Roles

Filed under: Uncategorized — Semir Sarajlic @ 1:48 pm

Dear Researchers,

PACE is pleased to announce our new additions to the PACE team and recognitions of our team members who started new roles at PACE.

In Spring, our Software and Collaboration Support team grew with an addition of Dr. Kevin Manalo.  Kevin is a proud Georgia Tech graduate who cannot hold back his excitement about joining PACE, whose clusters he had heavily relied on during his PhD research!  Kevin comes to PACE as an HPC veteran with experience in HPC support and training from Johns Hopkins University and state supercomputer centers at Ohio and Alabama.

Over the summer, our Outreach and Faculty Interaction team has grown by three new members, Drs. Aaron Jezghani, Michael Weiner, and Chris Blanton.  As you may have already noticed, they have all hit the ground running as they have been very active in responding to support inquiries and hosting multiple PACE classes and workshops.  To tell you a little bit about our Outreach team members:

Dr. Aaron Jezghani recently defended his PhD in Physics at the University of Kentucky.   His research focused on nuclear physics and  involved work at both Los Alamos and Oak Ridge National Labs. Throughout Aaron’s multi-faceted dissertation work, he focused on development of detector readout electronics as well as techniques in acquiring, processing and analyzing data from the detectors, which is not an easy feat.

Dr. Michael Weiner received his undergraduate degree in physics from Yale University and his doctorate, also in physics, from Cornell University. He completed his doctoral research in computational biophysics in the laboratory of Gerald Feigenson, where he focused on Molecular Dynamics simulations of the biophysical chemistry of lipid bilayers as models of cell membranes.

Dr. Chris Blanton earned his Ph.D. from Syracuse University in Computational and Theoretical Chemistry. During his studies, he became deeply interested in computational research and HPC. After graduation, he joined the Pennsylvania State University’s Institute for CyberScience. He has worked with some of the most exciting and innovative computational researchers, and he looks forward to sharing and applying his experiences with Georgia Tech research community.

Also, over the summer, our Cyberinfrastructure team has added two members, and it’s our pleasure to reintroduce to you Trever Nightingale and Ken Suda.

Trever has returned to PACE to his position of Sr. Systems Support Engineer.  Trever has a bachelor’s degree from Amherst College and a master’s degree from the University of Minnesota, and experience in high performance and research computing centers including the Naval Research Lab, NERSC and the Centers for Disease Control (CDC) among his 20 years of UNIX experience.


Ken has been in IT professionally for almost 35 years and have filled most roles found in an IT organization.  The past couple of years, Ken has been a consultant and run a game development company.  As a consultant, Ken has been a generalist, filling whatever role the team or organization needed.

Now, with great pleasure, PACE is pleased to announce the new roles for our team members, Dan (Ann) Zhou, Andre McNeill, and Ruben Lara.

Dan (Ann) Zhou’s new role is Research Technologist Storage Architect for PACE.  Ann has been a PACE team member since August 2014 and has been an integral part of the PACE cyberinfrastructure team contributing to the operation of the many PACE storage systems, backup and the management of the storage systems among her responsibilities.  Ann received her bachelor’s degree in Electrical Engineering in China and her master’s degree in Electrical and Computer Engineering at Tennessee Technological University. She enjoys cooking, running, eating, and traveling.

Andre McNeill’s new role is Research Technologist Cloud Architect for PACE. Andre has been a member of PACE for nearly 10 years and continues to be a vital resource for both our PACE staff as well as our PACE customers to deliver a robust and reliable research computing environment including computing, networking and software systems.  A graduate of Purdue University, Andre has many interests within PACE and many outside the work place including being a DJ.

Ruben Lara’s new role is Systems Support Engineer Manager for PACE. Ruben has been a part of the PACE cyberinfrastructure team since February 2017. Ruben has many excellent managerial and organizational skills and is currently enrolled in the current MOR Leadership training.  Ruben enjoys baseball, ultimate frisbee, rock climbing and mountain biking. You can find him by the window at the Southwest end of the 10th floor of the Coda building.

Please join us in welcoming our new team members and congratulating our recently promoted team members!

Best,
The PACE Team

August 29, 2019

Release of Updated PACE User Documentation

Filed under: Uncategorized — Semir Sarajlic @ 7:40 pm

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu

 

August 9, 2019

PACE Ready for Research

Filed under: Uncategorized — Semir Sarajlic @ 11:55 pm

Our August 2019 maintenance ( https://blog.pace.gatech.edu/?p=6511 ) is complete one day ahead of schedule!  We have brought compute nodes online and released previously submitted jobs.  Login nodes are accessible and your data are available.

As usual, there are a small number straggling nodes we will address over the coming days.

  • (Complete) Network connections to PACE-RTR will be upgraded. Connectivity in and out of the Rich Data Center will be disrupted on Friday morning. VAPOR network will not be affected.
  • (Complete) Additional space will be configured for license server.
  • (Complete) OS and application patches will be applied to Red Hat Enterprise Linux (RHEL) 7 servers, effectively upgrading to RHEL 7.6.
  • (Complete) OS and application patches will be applied to testflight nodes, to begin testing new versions of kernel and libraries.
  • (Complete) PACE management scripts and utilities will be upgraded, to improve reliability and performance.
  • (Complete) The submit filter for jobs on the RHEL 7 clusters will be modified to allow proper formatting of commands. This filter is not needed on RHEL 6 clusters.
  • (Complete) Upgrade DNS appliances; no downtime is expected due to redundant configuration.

August 5, 2019

[Resolved] Campus-wide network outage impacting PACE

Filed under: Uncategorized — Michael Weiner @ 6:18 pm

A campus-wide DNS server failure occurred on the morning of Monday, August 5. OIT was able to resolve the issue at 10:06 AM, and all PACE services should now be working normally. The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.
We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu.
For details on the DNS failure, please visit the OIT status update.

Thank you for your attention to this, and we apologize for the inconvenience.

August 1, 2019

Network outages across the GT campus

Filed under: Uncategorized — Craig Moseley @ 5:59 pm

On the morning of August 1, 2019, a distribution router in the Rich data center failed around 9:22 AM, producing network outages across the GT campus. This outage included the single sign-on server, which prevented login authentication to numerous systems across campus, including PACE. OIT has identified the issue, and connectivity was restored around 9:50 AM, but issues remain.

Logins to PACE should now be possible, though intermittent issues may remain. Running and queued jobs should be unaffected. Please contact us at pace-support@oit.gatech.edu if you have any questions or persisting issues. The login failures also affected our access to view user help requests, and we apologize for any delay in responding to requests this morning.

For details on the OIT issue, please visit the link below.

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d42eda56788b204bf9f11d4

We apologize for the inconvenience.

Powered by WordPress