GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

Globus authentication and endpoints

Posted by on Wednesday, 15 January, 2020

We became aware this morning of an issue with Globus authentication to the “gatechpace#datamover” endpoint that many of you use to transfer files to/from PACE resources. We are working to repair this right now, but please use the “PACE Internal” endpoint instead. This endpoint provides access to the same filesystem that you use with the datamover endpoint (plus PACE Archive storage, for those who have signed up for our archive service). Going forward, you may continue to use this newer endpoint instead of the older datamover one, even once we have datamover functioning again soon. For full instructions on using Globus with PACE, visit our Globus documentation page. PACE Internal functions in exactly the same way as gatechpace#datamover when interacting with Globus. 

Please keep in mind that Globus is the best way to transfer files to/from PACE resources. Contact us at pace-support@oit.gatech.edu if you have any questions about using Globus.

[Re-Scheduled] Advisory of Hive cluster outage 1/20/20

Posted by on Thursday, 9 January, 2020

We are writing to inform you of the upcoming Hive cluster outage that we learned about yesterday.  PACE has no control on this outage.  As part of the design of the Coda data center, we are working with the Southern Company (Ga Power) in the creation and operation of a Micro Grid power generation facility. This is a set of products to enable research of local generation of up to 2MW of off-grid power.

In order to connect this facility/Micro grid to the Coda data center power, Southern Company will need to shut down all power to the research hall in Coda. As a result, Hive cluster will need to be shutdown during this procedure, and we are placing a scheduler reservation to prevent any jobs from running during the shutdown.  This is currently planned to begin at 8am on the Georgia Tech MLK Holiday of January 20th. GT has checked to see if this date could be rescheduled to give a longer notice, but GT was unable to change the date.   As a result, GT is working with the Southern Company to minimize the duration of this power outage but a final outage time requirement is not known. It is currently expected to be at least 24 hours in length.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

If you have any questions, please contact PACE Support at pace-support@oit.gatech.edu.

Rich Data Center UPS Maintenance

Posted by on Wednesday, 8 January, 2020

The Rich data center uninterrupted power system (UPS) will undergo maintenance to replace failed batteries on 11-January, starting at 8.00am. Due to the power configuration, it’s not expected for any of the systems in Rich to lose power during this time. All PACE services should function normally.

Please contact pace-support@oit.gatech.edu if you need more details.

[Re-Scheduled] Hive Cluster — Policy Update

Posted by on Tuesday, 7 January, 2020

After the deployment of Hive cluster this Fall, we are pleased with the rapid growth of our user community on this cluster along with the utilization of the cluster that has been rapidly increasing. During this period, we have received user feedback that compels us to make changes that will further increase productivity for all users of Hive.  Hive PIs have approved the following changes listed below that were deployed on January 9:

  1. Hive-gpu: The maximum walltime for jobs on hive-gpu will be decreased to 3 days from the current 5 day max walltime, which is to address the longer job wait times that users have experienced on the hive-gpu queue
  2. Hive-gpu:  To ensure that GPUs do not sit idle, jobs will not be permitted to use a CPU:GPU ratio higher than 6:1 (i.e., 6 core per GPU). Each hive-gpu nodes are 24 CPUs and 4 GPUs.
  3. Hive-nvme-sas: create a new queue, hive-nvme-sas that combines and shares compute nodes between the hive-nvme and hive-sas queues.
  4. Hive-nvme-sas, hive-nvme, hive-sas: Increase the maximum walltime for jobs on the hive-nvme, hive-sas, hive-nvme-sas queues to 30 days from the current 5 day max walltime.
  5. Hive-interact: A new interactive queue, hive-interact, will be created. This queue provide access to 32 Hive compute nodes (192 GB RAM and 24 cores).  This queue is provided for  quick access to resources for testing and development. The walltime limit will be 1 hour.
  6. Hive-priority: a new hive-priority queue will be created. This queue is reserved for researchers with time-sensitive research deadlines.  For access to this queue, please communicate the appropriate dates/upcoming deadlines to the PACE team in order to get the necessary approvals for us to provide you access to high-priority queue.  Please note that we may not be able to provide access to priority queue for requests made less than 14 days in advance of the time when the resource is needed, which is due to the running jobs at the time of the request.

Who is impacted:

  • All Hive users who use hive-gpu, hive-nvme and hive-sas queues
  • The additional queues that are created will benefit, and by that, impact all Hive users.

User Action:

  • Users will need to update their PBS scripts to reflect the new walltime limits and CPU:GPU ratio requirement on hive-gpu queue
  • The mentioned changes will not impact the currently running jobs.

Additionally:

We would like to remind you of the upcoming Hive cluster outage due to the creation of a Micro Grid power generation facility. At 8 AM on Monday, January 20th (Georgia Tech holiday for MLK day), the Hive cluster will be shutdown for an anticipated 24 hours. A reservation has been put in place on all Hive nodes during this period, and any user jobs submitted that will overlap with this outage will be provided with a warning indicating this detail, and enqueued until after completion of work. A similar warning will be generated for jobs overlapping with the upcoming cluster maintenance on February 27.

The planned outage of the CODA data center has been re-scheduled, and so the Hive cluster will be available until the next PACE maintenance period on February 27. The reservation has been removed, so work should proceed on January 20 as usual.

Our documentation has been updated to reflect these changes and queue additions, and can be found at http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Upcoming VPN updates

Posted by on Friday, 3 January, 2020

We would like to let you know about upcoming upgrades to Georgia Tech’s VPNs. The VPN software will be updated by OIT to introduce a number of bug fixes and security improvements, including support for macOS 10.15 as well as Windows 10 ARM64 based devices. After the upgrade, your local VPN client will automatically download and install an update upon your next connection attempt. Please allow the software to update, then continue with your connection on the upgraded interface.

The main campus “anyc” VPN, which is used to access PACE from off-campus locations, will be upgraded on January 28. The “pace” VPN, which is used to access our ITAR/CUI clusters from any location, will be upgraded on January 21.

If you wish to try the new client sooner, you may do so by connecting to the dev.vpn.gatech.edu VPN, which will prompt download of the upgraded client. Due to capacity limitations, please disconnect after update and return to using your normal VPN service.

For ongoing updates, please visit the OIT status announcements for the pace VPN or the anyc VPN.

As always, please contact us at pace-support@oit.gatech.edu with any concerns.

New PACE utilities: pace-jupyter-notebook and pace-vnc-job now available!

Posted by on Friday, 6 December, 2019

Good Afternoon Researchers!

We are pleased to announce two new tools to improve interactive job experiences on the PACE clusters: pace-jupyter-notebook and pace-vnc-job!

Jupyter Notebooks are invaluable interactive programming tools that consolidate source code, visualizations, and formatted documentation into a single interface. These notebooks run in a web browser, and Jupyter provides support for many languages by allowing user to switch between desired programing kernels, such as Python, MATLAB, R, Julia, C, Fortran, just to name a few. In addition to providing an interactive environment for development and debugging code, Jupyter Notebooks are an ideal tool for teaching and demonstrating code and results, which PACE has utilized for its recent workshops.

The new utility pace-jupyter-notebook provides an easy to run command for launching Jupyter notebook from the following login nodes/clusters (login-s[X], login-d[x], login7-d[x], testflight-login, zohar, gryphon, login-hive[X], pace-ice, coc-ice…) that will enable Jupyter on your workstation/laptop browser of your choice.  To launch Jupyter, simply login to PACE, and run the command pace-jupyter-notebook -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect to your Jupyter Notebook! Full documentation on the use of pace-jupyter-notebook, including available options to change such as job walltime, processors, memory, and etc., can be found at http://docs.pace.gatech.edu/interactiveJobs/jupyterInt/.  Please note that on busy queues, you may experience longer wait times to launch the notebook.

In addition, we are providing a similar utility for running software with graphical user interfaces (GUIs), such as MATLAB, Paraview, ANSYS, and many more) on PACE clusters. VNC sessions offer a more robust experience when compared to traditional X11 forwarding. With a local VNC Viewer client, you can connect to the remote desktop on a compute node and interact with the software as if running on your local machine.  Similar to the Jupyter Notebook utility, the new utility pace-vnc-job  provides an easy to run command for launching VNC session on a compute node and connecting your client to the session.  To launch a VNC session, login to PACE, and run the command pace-vnc-job -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect your VNC Viewer to the remote session, start up the software you wish to run. Full documentation on the use of pace-vnc-job, including available options to change such as job walltime, processors, memory, and etc.,, can be found at http://docs.pace.gatech.edu/interactiveJobs/setupVNC_Session/.  Again, please note that on busy queues, you may experience longer wait times to launch a VNC session.

Happy Interactive computing!

Best,
The PACE Team

[COMPLETED] PACE Quarterly Maintenance – November 7-9

Posted by on Saturday, 26 October, 2019

[Update 11/5/19]

We would like to remind you that PACE’s maintenance period begins tomorrow. This quarterly maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

These activities will be performed:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– (Completed) Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy implemented last week (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– (Completed) PBSTools, which records user job submissions, will be upgraded.
– (Completed) Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– (Completed) [Hive cluster] Infiniband switch firmware will be upgraded.
– (Completed) [Hive cluster] Storage system firmware will be updated.
– (Completed) [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– (Completed) [Hive cluster] Lmod, the environment module system, will be updated to a newer version.
– (Completed) The athena-6 queue will be upgraded to RHEL7.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Update 11/1/19]

We would like to remind you that we are preparing for PACE’s next quarterly maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:

– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.

– RHEL7 clusters will receive critical patches.

– Updates will be made to PACE databases and configurations.

– PBSTools, which records user job submissions, will be upgraded.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.

– [Hive cluster] Infiniband switch firmware will be upgraded.

– [Hive cluster] Storage system software will be updated. – [Hive cluster] Subnet managers will be reconfigured for better redundancy.

– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

 

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Original post]

We are preparing for PACE’s next maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year. This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions, will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– RHEL7 clusters will receive critical patches.
– Updates will be made to PACE databases and configurations.
– Firmware for DDN storage will be updated.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– [Hive cluster] Infiniband switch firmware will be upgraded.
– [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

[Reminder] Policy Update to Shared Clusters’ Scheduler

Posted by on Friday, 18 October, 2019

This is a friendly reminder that our updated policy impacting Shared Clusters at PACE will take effect on October 29, 2019.

On October 29, 2019, we are reducing the limit of number of queued/running jobs per user to 500.

Who is impacted? All researchers connecting to PACE resources via login-s[X].pace.gatech.edu headnode are impacted by this policy change (also, we have provided a list of impacted queues below).  We have identified all the researchers/users who are impacted by these changes, who were contacted on multiple occasions.   We have worked with a number of researchers from different PI groups during our consulting sessions in helping them adopt their workflows to the new max job per users limit.  PACE provides and supports  multiple solutions, such as, job arrays, GNU parallel and Launcher  that are in place to help users quickly adapt their workflows to this policy update.

  • List of queues impacted are as follows: apurimac-6,apurimacforce-6,b5-6,b5-prv-6,b5force-6,bench-gpu,benchmark,biobot,biobotforce-6,biocluster-6,biocluster-gpu,bioforce-6,biohimem-6,casper,cee,ceeforce,chem-bigdata,chemprot,chemx,chemxforce,chowforce-6,cns-24c,cns-48c,cns-6-intel,cnsforce-6,coc,coc-force,critcel,critcel-burnup,critcel-manalo,critcel-prv,critcel-tmp,critcelforce-6,cygnus,cygnus-6,cygnus-hp,cygnus-hp-lrg-6,cygnus-hp-small,cygnus-xl,cygnus24-6,cygnus28,cygnus64-6,cygnusforce-6,cygnuspa-6,davenprtforce-6,dimer-6,dimerforce-6,ece,eceforce-6,enveomics-6,faceoff,faceoffforce-6,flamel,flamelforce,force-6,force-gpu,force-single-6,gaanam,gaanam-h,gaanamforce,gemini,ggate-6,gpu-eval,gpu-recent,habanero,habanero-gpu,hummus,hydraforce,hygene-6,hygeneforce-6,isabella,isabella-prv,isblforce-6,iw-shared-6,jangmse,joe,joe-6,joe-bigshort,joe-fast,joe-test,joeforce,kastella,kastellaforce-6,kokihorg,lfgroup,math-6,math-magma,mathforce-6,mayorlab_force-6,mcg-net,mcg-net-gpu,mcg-net_old,mday-test,medprintfrc-6,megatron-elite,megatronforce-6,metis,micro-largedata,microcluster,nvidia-gpu,optimus,optimusforce-6,phi-shared,prometforce-6,prometheus,pvh,pvhforce,radiance,romberg,rombergforce,sagan,sonar-6,sonarforce-6,spartacus,spartacusfrc-6,spinons,spinonsforce,threshold,threshold-6,tmlhpc,try-6,trybuy

Prior to this policy change taking effect on October 29,  we have one more consulting session that’s scheduled on:

  • October 22, 1:00pm – 2:45pm, Molecular Sciences and Engineering Room 1201A

For details about our policy change , please visit our blog post.

Again, the changes listed above will take effect on October 29, 2019.  After October 29, users will not be able to submit more than 500 jobs.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Preventative Maintenance for UPS units at Rich Data Center

Posted by on Friday, 11 October, 2019

OIT will be performing an Annual Preventative Maintenance for UPS units for Rich Data Center on Saturday, October 12, from 7:00AM – about 5:00PM.  No outage is expected from this work; however, in the case of an outage, PACE clusters and jobs running on them are at the risk of being interrupted.  Again, it is unlikely that we will have a power outage during this maintenance period.

If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Hive Cluster Status 10/3/2019

Posted by on Thursday, 3 October, 2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.