GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

Data center maintenance

Posted by on Tuesday, 17 September, 2019

Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.  If that happens, portions of the data center will need to shut down until the temperature has returned to an acceptable level.  If the clusters are shut down, this will terminate any running jobs on the compute nodes.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team

[Resolved] Shared Clusters Scheduler Down

Posted by on Tuesday, 10 September, 2019

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

The Launcher Documentation Available

Posted by on Tuesday, 10 September, 2019

The Launcher (link) is a framework for running large collections of serial or multi-thread applications as a single job on a batch-scheduled HPC system. The Launcher was developed at the Texas Advanced Computing Center (TACC) and has been deployed at multiple HPC centers throughout the world. The Launcher allows High-Throughput Computing users to take advantage of the benefits of scheduling larger single jobs and to better fit within the HPC environment. 

To better serve our High-throughput Computing users, we have adapted this software for use on the PACE systems.

Information on using Launcher on PACE is available at PACE Documentation.

COMSOL use at PACE

Posted by on Monday, 9 September, 2019

As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.

[Resolved] Campus Network Down

Posted by on Wednesday, 4 September, 2019

[Update] September 5

OIT reports that the campus network is again fully functional.

[Update] September 4 4:28 PM

This is brief update,  OIT Network Services has identified the cause of the campus network issues.  One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network.  Since this event, the network has been stabilized.  OIT continues to monitor this situation for any further issues.  For latest update, please check on OIT status page.

As for PACE cluster(s), you should be able to access the cluster(s) without issues.  If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Original] September 4 2:30 PM

Our campus network is down.  OIT is investigating this incident, and you may check on the details from the link below:

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2

This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.

Please check the status link above for up to date details.  If you have any questions, please send us a note to pace-support@oit.gatech.edu.  Also note, we are impacted by the outage and our responses to your email will be delayed.

Thank you for your patience.

The First LIGO Cluster for Georgia Tech is Ready for Research!

Posted by on Friday, 30 August, 2019

Tech’s Partnership for an Advanced Computing Environment (PACE) recently deployed a cluster to support the Nobel Prize winning Laser Interferometer Gravitational-Wave Observatory (LIGO) project. This project observed the first gravitational waves from the merger of two black holes, and, in doing so, confirmed Einstein’s predictions according to his theory of general relativity. Beyond the addition of new computational resources, this pioneering work is the first step in integrating Georgia Tech into the Open Science Grid (OSG national computational grid that provides shared resources to run massive numbers of small computations.

PACE started working on building a LIGO resource at Georgia Tech shortly after the arrival of Dr. Laura Cadonati, professor of Physics in the Center for Relativistic Astrophysics (CRA), in 2015. The initial proof-of-concept infrastructure was able to accept test jobs from OSG for testing. This initiative yielded great insight into the process of integrating into OSG, and was disseminated to other institutions such as Syracuse University, who subsequently was able to deploy their own cluster. Based on this successful test, Cadonati procured a new cluster to run production level LIGO workloads. In deploying this cluster, PACE partnered with a team of experts at the University of Chicago led by Senior Scientist Robert Gardner to adopt the latest advancements in OSG system and software stack.

Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 30 August, 2019

Dear Researchers,

Given the growth in the number and workload diversity of PACE cluster users, we are compelled to make adjustments to our shared clusters’ scheduler policy that will help offset the extensive load our schedulers are placed under as a result of rapid growth of our high throughput computing (HTC) user community that is submitting 1000s of jobs at a time.  As you can see in the figure below, our shared cluster is under heaviest load from the sheer volume of HTC jobs it receives from the users that at this time is exceeding 100,000 jobs.  Our current policy for the max number of jobs a user may submit is 3,000.  We are proposing to reduce the max number of jobs to 500 (i.e., 500 total jobs both queued/running per user) that will substantially reduce the load on the scheduler that will improve the overall performance, providing a sustainable and improved user experience  by largely preventing hanging commands/jobs/errors when attempting to submit/check on a job.

More specifically, we are making the following changes:

  • Max 500 total jobs (queued/running) per user
  • Effective October 29, 2019

Who is impacted:

  • If you are logging into login-s[X] headnode, then this change will impact you.
  • We identified 84 users from 36 PI groups out of our nearly 3000 users from nearly 300 PI research groups who will be impacted by this change.

We understand that this is a drastic decrease in the limit of jobs a user is able to submit, but with our proposed improvements to your workflows, you will benefit from improved performance of the scheduler and gain more computational time from more efficiently submitting jobs.  With that said, we are providing a custom consulting sessions to help the researchers adopt their workflows to the new limits for max jobs submission. Also, we have multiple solutions, for example, job arraysGNU parallel and HTC Launcher developed and put in place for users to quickly adapt their workflows given this policy update. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 10, 1:00pm – 2:45pm, Molecular Sciences & Engineering Room 1201A
  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Best,

The PACE Team

[Resolved] GPFS outage on Red Hat 7 queues

Posted by on Friday, 30 August, 2019

An issue occurred around 3:30 AM on several queues running on the Red Hat 7 operating system, where a number of nodes failed to mount GPFS, our project (data) and scratch storage system. This caused the nodes to be offlined and unavailable for jobs. We repaired the affected nodes at approximately 9:30 AM today, and all queues should be functioning normally. Any jobs that were held should have begun. Please check your overnight jobs for errors.

The following queues were impacted:
atlas-he
ece-gpu
flamel-gpu
gaanam-gpu
gemini-cpu
gemini-gpu
megatron
ml_gpu
sake
skylake-test
starscream
swarm
swarm-gpu

Should you notice the problem recur, or if you have any other concerns, please contact us at pace-support@oit.gatech.edu, and we will be happy to help you. We apologize for the inconvenience this morning.

New PACE Team Members and New Team Member Roles

Posted by on Friday, 30 August, 2019

Dear Researchers,

PACE is pleased to announce our new additions to the PACE team and recognitions of our team members who started new roles at PACE.

In Spring, our Software and Collaboration Support team grew with an addition of Dr. Kevin Manalo.  Kevin is a proud Georgia Tech graduate who cannot hold back his excitement about joining PACE, whose clusters he had heavily relied on during his PhD research!  Kevin comes to PACE as an HPC veteran with experience in HPC support and training from Johns Hopkins University and state supercomputer centers at Ohio and Alabama.

Over the summer, our Outreach and Faculty Interaction team has grown by three new members, Drs. Aaron Jezghani, Michael Weiner, and Chris Blanton.  As you may have already noticed, they have all hit the ground running as they have been very active in responding to support inquiries and hosting multiple PACE classes and workshops.  To tell you a little bit about our Outreach team members:

Dr. Aaron Jezghani recently defended his PhD in Physics at the University of Kentucky.   His research focused on nuclear physics and  involved work at both Los Alamos and Oak Ridge National Labs. Throughout Aaron’s multi-faceted dissertation work, he focused on development of detector readout electronics as well as techniques in acquiring, processing and analyzing data from the detectors, which is not an easy feat.

Dr. Michael Weiner received his undergraduate degree in physics from Yale University and his doctorate, also in physics, from Cornell University. He completed his doctoral research in computational biophysics in the laboratory of Gerald Feigenson, where he focused on Molecular Dynamics simulations of the biophysical chemistry of lipid bilayers as models of cell membranes.

Dr. Chris Blanton earned his Ph.D. from Syracuse University in Computational and Theoretical Chemistry. During his studies, he became deeply interested in computational research and HPC. After graduation, he joined the Pennsylvania State University’s Institute for CyberScience. He has worked with some of the most exciting and innovative computational researchers, and he looks forward to sharing and applying his experiences with Georgia Tech research community.

Also, over the summer, our Cyberinfrastructure team has added two members, and it’s our pleasure to reintroduce to you Trever Nightingale and Ken Suda.

Trever has returned to PACE to his position of Sr. Systems Support Engineer.  Trever has a bachelor’s degree from Amherst College and a master’s degree from the University of Minnesota, and experience in high performance and research computing centers including the Naval Research Lab, NERSC and the Centers for Disease Control (CDC) among his 20 years of UNIX experience.


Ken has been in IT professionally for almost 35 years and have filled most roles found in an IT organization.  The past couple of years, Ken has been a consultant and run a game development company.  As a consultant, Ken has been a generalist, filling whatever role the team or organization needed.

Now, with great pleasure, PACE is pleased to announce the new roles for our team members, Dan (Ann) Zhou, Andre McNeill, and Ruben Lara.

Dan (Ann) Zhou’s new role is Research Technologist Storage Architect for PACE.  Ann has been a PACE team member since August 2014 and has been an integral part of the PACE cyberinfrastructure team contributing to the operation of the many PACE storage systems, backup and the management of the storage systems among her responsibilities.  Ann received her bachelor’s degree in Electrical Engineering in China and her master’s degree in Electrical and Computer Engineering at Tennessee Technological University. She enjoys cooking, running, eating, and traveling. Her family has two kids and one husband.

Andre McNeill’s new role is Research Technologist Cloud Architect for PACE. Andre has been a member of PACE for nearly 10 years and continues to be a vital resource for both our PACE staff as well as our PACE customers to deliver a robust and reliable research computing environment including computing, networking and software systems.  A graduate of Purdue University, Andre has many interests within PACE and many outside the work place including being a DJ.

Ruben Lara’s new role is Systems Support Engineer Manager for PACE. Ruben has been a part of the PACE cyberinfrastructure team since February 2017. Ruben has many excellent managerial and organizational skills and is currently enrolled in the current MOR Leadership training.  Ruben enjoys baseball, ultimate frisbee, rock climbing and mountain biking. You can find him by the window at the Southwest end of the 10th floor of the Coda building.

Please join us in welcoming our new team members and congratulating our recently promoted team members!

Best,
The PACE Team

Release of Updated PACE User Documentation

Posted by on Thursday, 29 August, 2019

PACE is pleased to announce the release of our updated PACE User Documentation . While we updated much of our existing guides, we have added additional guides with detailed instructions for various tasks along with examples such as PBS scripts for various applications you may want to submit batch/interactive jobs with.   The actual documentation is built using GitHub that helps us better maintain the documentation, and  this modular design will mitigate any interruption in the service should we decide to upgrade/change our webhosting technology for our overall site.

Over the Fall semester, we will begin phasing out the older documentation from our many pages on PACE website and redirecting them to our new documentation.   As some of you recall from attending our prior PACE Consulting Sessions and Clusters Orientation classes, your feedback about our documentation (which was Beta at the time) was invaluable, and you will find much of your feedback included  in this release.  We hope you find this documentation helpful, and if you have any questions or comments, please don’t hesitate to let us know.

Again, to access the new documentation, you may access it from our home page or the following links below:

https://pace.gatech.edu/pace-user-documentation

or

https://docs.pace.gatech.edu