PACE A Partnership for an Advanced Computing Environment

September 25, 2019

OIT Planned Maintenance

Filed under: Maintenance — Tags: , — Aaron Jezghani @ 2:51 pm

The OIT Network Services team will be performing a software upgrade on our campus Carrier-Grade NAT (CGN) appliances this week – see OIT Status for a full description. The affected subnet is the out of band management of the Hive/MRI servers; additionally, only internet-bound connections are being serviced. As such, no failures are expected for users of the Hive/MRI servers. Nonetheless, if you encounter connectivity issues to Hive resources, please do not hesitate to contact pace-support@oit.gatech.edu for assistance.

September 24, 2019

Distributed MATLAB now available on PACE

Filed under: Uncategorized — Michael Weiner @ 1:36 pm

PACE is excited to announce that distributed MATLAB use is now available on PACE resources. Georgia Tech’s new license allows for unlimited scaling of MATLAB on clusters. This change means that users can now run parallelized MATLAB code across multiple nodes. For detailed instructions, please visit our distributed MATLAB documentation at docs.pace.gatech.edu/software/matlab-distributed/.

September 17, 2019

Data center maintenance

Filed under: Uncategorized — Michael Weiner @ 3:03 pm

Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.  If that happens, portions of the data center will need to shut down until the temperature has returned to an acceptable level.  If the clusters are shut down, this will terminate any running jobs on the compute nodes.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team

September 10, 2019

[Resolved] Shared Clusters Scheduler Down

Filed under: Uncategorized — Aaron Jezghani @ 3:20 pm

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

The Launcher Documentation Available

Filed under: Uncategorized — Semir Sarajlic @ 2:57 pm

The Launcher (link) is a framework for running large collections of serial or multi-thread applications as a single job on a batch-scheduled HPC system. The Launcher was developed at the Texas Advanced Computing Center (TACC) and has been deployed at multiple HPC centers throughout the world. The Launcher allows High-Throughput Computing users to take advantage of the benefits of scheduling larger single jobs and to better fit within the HPC environment. 

To better serve our High-throughput Computing users, we have adapted this software for use on the PACE systems.

Information on using Launcher on PACE is available at PACE Documentation.

September 9, 2019

COMSOL use at PACE

Filed under: Uncategorized — Michael Weiner @ 9:17 pm

As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.

September 4, 2019

[Resolved] Campus Network Down

Filed under: Uncategorized — Michael Weiner @ 8:46 pm

[Update] September 5

OIT reports that the campus network is again fully functional.

[Update] September 4 4:28 PM

This is brief update,  OIT Network Services has identified the cause of the campus network issues.  One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network.  Since this event, the network has been stabilized.  OIT continues to monitor this situation for any further issues.  For latest update, please check on OIT status page.

As for PACE cluster(s), you should be able to access the cluster(s) without issues.  If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Original] September 4 2:30 PM

Our campus network is down.  OIT is investigating this incident, and you may check on the details from the link below:

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2

This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.

Please check the status link above for up to date details.  If you have any questions, please send us a note to pace-support@oit.gatech.edu.  Also note, we are impacted by the outage and our responses to your email will be delayed.

Thank you for your patience.

Powered by WordPress