GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[COMPLETED] PACE Quarterly Maintenance – November 7-9

Posted by on Saturday, 26 October, 2019

[Update 11/5/19]

We would like to remind you that PACE’s maintenance period begins tomorrow. This quarterly maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

These activities will be performed:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– (Completed) Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy implemented last week (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– (Completed) PBSTools, which records user job submissions, will be upgraded.
– (Completed) Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– (Completed) [Hive cluster] Infiniband switch firmware will be upgraded.
– (Completed) [Hive cluster] Storage system firmware will be updated.
– (Completed) [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– (Completed) [Hive cluster] Lmod, the environment module system, will be updated to a newer version.
– (Completed) The athena-6 queue will be upgraded to RHEL7.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Update 11/1/19]

We would like to remind you that we are preparing for PACE’s next quarterly maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:

– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.

– RHEL7 clusters will receive critical patches.

– Updates will be made to PACE databases and configurations.

– PBSTools, which records user job submissions, will be upgraded.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.

– [Hive cluster] Infiniband switch firmware will be upgraded.

– [Hive cluster] Storage system software will be updated. – [Hive cluster] Subnet managers will be reconfigured for better redundancy.

– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

 

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Original post]

We are preparing for PACE’s next maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year. This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions, will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– RHEL7 clusters will receive critical patches.
– Updates will be made to PACE databases and configurations.
– Firmware for DDN storage will be updated.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– [Hive cluster] Infiniband switch firmware will be upgraded.
– [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

Posted by on Friday, 11 October, 2019

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Hive Cluster Status 10/3/2019

Posted by on Thursday, 3 October, 2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.

Distributed MATLAB now available on PACE

Posted by on Tuesday, 24 September, 2019

PACE is excited to announce that distributed MATLAB use is now available on PACE resources. Georgia Tech’s new license allows for unlimited scaling of MATLAB on clusters. This change means that users can now run parallelized MATLAB code across multiple nodes. For detailed instructions, please visit our distributed MATLAB documentation at docs.pace.gatech.edu/software/matlab-distributed/.

Data center maintenance

Posted by on Tuesday, 17 September, 2019

Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.  If that happens, portions of the data center will need to shut down until the temperature has returned to an acceptable level.  If the clusters are shut down, this will terminate any running jobs on the compute nodes.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team

[Resolved] Shared Clusters Scheduler Down

Posted by on Tuesday, 10 September, 2019

[Update – 9/10/2019 3:52 PM]

The shared scheduler has been restored to functionality. The issue stemmed from a large influx of jobs (>100,000) in less than 24 hours. As a reminder, the upcoming policy change on October 29, 2019, which limits the number of job submissions to 500 per user, is designed to mitigate this issue moving forward. If you feel your workflow may be impacted, please take the opportunity to read the documentation on parallel job solutions (job arraysGNU parallel and HTC Launcher) developed and put in place for users to quickly adapt their workflows accordingly. Currently, we have the following consulting sessions scheduled, with additional ones to be provided (check back for updates here).

Upcoming Consulting sessions:

  • September 24, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 8, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A
  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

[Original]

The shared scheduler experienced an out-of-memory issue this morning at 7:44 AM, resulting in a hold on all jobs in the queues managed by this scheduler. This issue affects all users who submit jobs to PACE via the shared clusters headnodes (login-s). Currently you will experience hanging jobs when submitting a job. We ask that you refrain from submitting any new jobs until further notice while PACE team investigates the matter and restores functionality to the scheduler.

Currently the following queues are affected by this scheduler issue: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biocluster-6,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,cns-24c,cns-48c,cns-6-intel,cnsforce-6,critcel,critcel-prv,critcelforce-6,cygnus,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,gaanam,gaanam-h,gaanamforce,gpu-eval,habanero,habanero-gpu,hummus,hydra-gpu,hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,lfgroup,math-6,mathforce-6,mcg-net,mcg-net-gpu,mday-test,metis,micro-largedata,microcluster,optimus,optimusforce-6,prometforce-6,prometheus,pvh,pvhforce,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,try-6,trybuy

We apologize for this inconvenience, and we appreciate your attention and patience.

The Launcher Documentation Available

Posted by on Tuesday, 10 September, 2019

The Launcher (link) is a framework for running large collections of serial or multi-thread applications as a single job on a batch-scheduled HPC system. The Launcher was developed at the Texas Advanced Computing Center (TACC) and has been deployed at multiple HPC centers throughout the world. The Launcher allows High-Throughput Computing users to take advantage of the benefits of scheduling larger single jobs and to better fit within the HPC environment. 

To better serve our High-throughput Computing users, we have adapted this software for use on the PACE systems.

Information on using Launcher on PACE is available at PACE Documentation.

COMSOL use at PACE

Posted by on Monday, 9 September, 2019

As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.

[Resolved] Campus Network Down

Posted by on Wednesday, 4 September, 2019

[Update] September 5

OIT reports that the campus network is again fully functional.

[Update] September 4 4:28 PM

This is brief update,  OIT Network Services has identified the cause of the campus network issues.  One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network.  Since this event, the network has been stabilized.  OIT continues to monitor this situation for any further issues.  For latest update, please check on OIT status page.

As for PACE cluster(s), you should be able to access the cluster(s) without issues.  If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Original] September 4 2:30 PM

Our campus network is down.  OIT is investigating this incident, and you may check on the details from the link below:

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2

This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.

Please check the status link above for up to date details.  If you have any questions, please send us a note to pace-support@oit.gatech.edu.  Also note, we are impacted by the outage and our responses to your email will be delayed.

Thank you for your patience.