GT Home : : Campus Maps : : GT Directory

Author Archive

[UPDATE] shared-scheduler Degraded Performance

Posted by on Tuesday, 28 July, 2020

7/31/2020 UPDATE

Dear Researchers,

In addition to the previously announced maintenance day activities, we will be migrating the Torque component of shared-sched to a dedicated server to address the recent performance issues. This move should improve the scheduler’s response time to client queries such as qstat, and decrease job submission and start times when compute resources are available. While you do not need to do anything to prepare for this migration, we advise that you make note of any jobs queued at the start of maintenance just in case. As always, please direct any questions or concerns to pace-support@oit.gatech.edu. We thank you for your patience.

The PACE Team

 

7/29/2020 UPDATE

Dear Researchers,

At this time the scheduler is functional, although some commands may be slow to respond. We will continue investigating to ascertain the source of these problems, and will update accordingly. Thank you.

[ORIGINAL MESSAGE]

We are aware of a significant slowdown in the performance of the shared-scheduler since last week. Initial attempts to resolve the issue towards the end of the week appeared successful, but the problems have restarted and we are continuing our investigation along with scheduler support. We appreciate your patience as we work to restore full functionality to shared-scheduler.

The PACE Team

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

Posted by on Tuesday, 14 April, 2020

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

PACE License Manager and Server Issues

Posted by on Wednesday, 8 April, 2020

Overnight we experienced issues with several of our servers, including our License manager, GTLib server, and the Testflight and Novazohar queues. We are actively addressing the problem, having restored functionality to the License manager and Novazohar. We are still working on Testflight, and will provide updates as they are available. As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu accordingly.

[RESOLVED] RHEL7 Dedicated Scheduler Down

Posted by on Wednesday, 25 March, 2020

[RESOLVED] We have restored functionality to the RHEL7 dedicated scheduler. Thank you for your patience.

[UPDATE] The RHEL7 dedicated scheduler, accessed via login7-d, is again down. We are actively working to resolve the issue at this time, and we will update you when the scheduler is restored. Please follow the same blog post (http://blog.pace.gatech.edu/?p=6715) for updates. If you have any questions, please contact pace-support@oit.gatech.edu.

[RESOLVED] We have rebooted the RHEL7 Dedicated scheduler, and functionality has been restored. Thank you for your patience.

[ORIGINAL MESSAGE] Roughly 30 minutes ago we determined an issue with the scheduler for dedicated RHEL7 clusters; this scheduler is responsible for all jobs submitted from the dedicated RHEL7 headnode, login7-d. All other schedulers are operating as expected. We are actively working to resolve the problem, but in the meantime you will be unable to submit new jobs or query the status of queued or running jobs.

If you have any questions, please contact pace-support@oit.gatech.edu.

[Resolved] Rich InfiniBand Switch Power Failure

Posted by on Wednesday, 19 February, 2020

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

RESOLVED [Hive and Testflight-CODA Clusters] Connectivity Issue to All CODA Resources

Posted by on Friday, 14 February, 2020

RESOLVED [1:44 PM]:

The network engineers report that they have fixed the issues and are continuing to monitor it, although the cause remains unknown. Jobs appear to have continued uninterrupted on the Hive and Testflight-CODA clusters, but we encourage users to verify.
https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5e46cb01fa0e5304bc04ecb5
Any residual issues should be reported to pace-support@oit.gatech.edu. Thank you.

UPDATE [11:33 AM]:

Georgia Tech IT is aware of the situation and is investigating as well:

Original Message:

Around 11:00 AM, we noticed that we could not connect to any resources housed in CODA, including the Hive and Testflight-CODA clusters. At this time, the source of the problem is being investigated, but access to these resources will experience problems. In theory, jobs on these clusters should continue to run. Further details will be provided as they become available.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu. Thank you.

[COMPLETED] PACE Maintenance – February 27-29

Posted by on Thursday, 13 February, 2020

[COMPLETED – 6:51 PM 2/28/2020]

We are pleased to announce that our February 2020 maintenance period (https://blog.pace.gatech.edu/?p=6676) has completed ahead of schedule. We have restored access to computational resources, and previously queued jobs will start as resources allow. The login nodes and storage systems are now accessible. 

As usual, there are a small number of straggling nodes that will require additional intervention.  

A summary of the changes and actions accomplished during this maintenance period are as follows: 

  • (Completed) RHEL7 clusters received critical patches
  • (Completed) Updates will be made to PACE databases and configurations.
  • (Deferred) [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • (Completed) [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • (Completed) [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy.
  • (In Progress) [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version.
  • (Completed) [Hive cluster] Run OSU Benchmark test on idle resources
  • (Completed) [GPFS file system] Apply latest maintenance releases and firmware updates
  • (In Progress) [Lustre file system] Apply latest maintenance releases and firmware updates

Thank you for your patience!

[UPDATE – 8:52 AM 2/27/2020]

The PACE maintenance period is underway. For the duration of maintenance, users will be unable to access PACE resources. Once the maintenance activities are complete, we will notify users of the availability of the cluster.

Also, we have been told by Georgia Power that they expect their work may take up to 72 hours to complete; as such, the maintenance outage for the CODA research hall (Hive and Testflight-CODA clusters) will extend until 6:00 AM Monday morning. We will provide updates as they are available.

[Original Message]

We are preparing for PACE’s next maintenance days on February 27-29, 2020. This maintenance period is planned for three days starting on Thursday, February 27, and ending Saturday, February 29. However, Georgia Power will begin work to establish a Micro Grid power generation facility beginning on Thursday, and while work should complete within 48 hours, any delays may extend the maintenance outage for the Hive and Testflight-CODA clusters through Sunday instead; PACE clusters in Rich will not be impacted by any delays in Georgia Power’s work. Should any issues and resultant delays occur, users will be notified accordingly. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

  • None

ITEMS NOT REQUIRING USER ACTION:

  • RHEL7 clusters will receive critical patches.
  • Updates will be made to PACE databases and configurations.
  • [Hive, Testflight-CODA clusters] Power down all of the research hall for Georgia Power reconnections
  • [Hive cluster] Replace failed InfiniBand leaf on EDR switch
  • [Hive cluster] InfiniBand subnet managers will be reconfigured for better redundancy
  • [Hive and Testflight-CODA clusters] Lmod, the environment module system, will be updated to a newer version
  • [Hive cluster] Run OSU Benchmark test on idle resources
  • [GPFS and Lustre file systems] Apply latest maintenance releases and firmware updates

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu .

Rich Data Center UPS Maintenance

Posted by on Wednesday, 8 January, 2020

The Rich data center uninterrupted power system (UPS) will undergo maintenance to replace failed batteries on 11-January, starting at 8.00am. Due to the power configuration, it’s not expected for any of the systems in Rich to lose power during this time. All PACE services should function normally.

Please contact pace-support@oit.gatech.edu if you need more details.

New PACE utilities: pace-jupyter-notebook and pace-vnc-job now available!

Posted by on Friday, 6 December, 2019

Good Afternoon Researchers!

We are pleased to announce two new tools to improve interactive job experiences on the PACE clusters: pace-jupyter-notebook and pace-vnc-job!

Jupyter Notebooks are invaluable interactive programming tools that consolidate source code, visualizations, and formatted documentation into a single interface. These notebooks run in a web browser, and Jupyter provides support for many languages by allowing user to switch between desired programing kernels, such as Python, MATLAB, R, Julia, C, Fortran, just to name a few. In addition to providing an interactive environment for development and debugging code, Jupyter Notebooks are an ideal tool for teaching and demonstrating code and results, which PACE has utilized for its recent workshops.

The new utility pace-jupyter-notebook provides an easy to run command for launching Jupyter notebook from the following login nodes/clusters (login-s[X], login-d[x], login7-d[x], testflight-login, zohar, gryphon, login-hive[X], pace-ice, coc-ice…) that will enable Jupyter on your workstation/laptop browser of your choice.  To launch Jupyter, simply login to PACE, and run the command pace-jupyter-notebook -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect to your Jupyter Notebook! Full documentation on the use of pace-jupyter-notebook, including available options to change such as job walltime, processors, memory, and etc., can be found at http://docs.pace.gatech.edu/interactiveJobs/jupyterInt/.  Please note that on busy queues, you may experience longer wait times to launch the notebook.

In addition, we are providing a similar utility for running software with graphical user interfaces (GUIs), such as MATLAB, Paraview, ANSYS, and many more) on PACE clusters. VNC sessions offer a more robust experience when compared to traditional X11 forwarding. With a local VNC Viewer client, you can connect to the remote desktop on a compute node and interact with the software as if running on your local machine.  Similar to the Jupyter Notebook utility, the new utility pace-vnc-job  provides an easy to run command for launching VNC session on a compute node and connecting your client to the session.  To launch a VNC session, login to PACE, and run the command pace-vnc-job -q <QUEUENAME>, where <QUEUENAME> should be replaced with the queue in which you wish to run your job. Once the job starts, follow the three-step prompt to connect your VNC Viewer to the remote session, start up the software you wish to run. Full documentation on the use of pace-vnc-job, including available options to change such as job walltime, processors, memory, and etc.,, can be found at http://docs.pace.gatech.edu/interactiveJobs/setupVNC_Session/.  Again, please note that on busy queues, you may experience longer wait times to launch a VNC session.

Happy Interactive computing!

Best,
The PACE Team

Hive Cluster Status 10/3/2019

Posted by on Thursday, 3 October, 2019

This morning we noticed that the Hive nodes were all marked offline. After a short investigation, it was discovered that there was an issue with configuration management, which we have since corrected. Ultimately, the impact from this should be negligible, as jobs that were running continued accordingly, while queued jobs were simply held until resources became available. After our correction, these jobs started as expected. Nonetheless, we want to ensure that the Hive cluster provides performance and reliability as expected, so if you find any problems in your workflow due to this minor hiccup, or for any problems you encounter moving forward, please do not hesitate to send an email to pace-support@oit.gatech.edu.