GT Home : : Campus Maps : : GT Directory

Author Archive

TensorFlow update required due to identified security vulnerability

Posted by on Wednesday, 20 October, 2021

Summary: TensorFlow update required due to identified security vulnerability

What’s happening and what are we doing: A security vulnerability was discovered in TensorFlow. PACE has installed the patched version 2.6.0 of TensorFlow in our software repository, and we will retire the older versions on November 3, 2021, during our maintenance period.

How does this impact me: Both researchers who use PACE’s TensorFlow installation and those who have installed their own are impacted.

The following PACE installations will be retired:

Modules: tensorflow-gpu/2.0.0 and tensorflow-gpu/2.2.0

Virtual envs under anaconda3/2020.02: pace-tensorflow-gpu-2.2.0 and pace-tensorflow-2.2.0

Please use the tensorflow-gpu/2.6.0 module instead of the older versions  identified above. If you were previously using  a PACE-provided virtual env provided  inside the anaconda3 module, please use the separate new module instead. You can find more information about using PACE’s TensorFlow installation in our documentation. You will need to update your PBS scripts to call the new module, and you may need to update python code to ensure compatibility with the latest version of the package.

If you have created your own conda environment on PACE and installed TensorFlow in it, please create a new virtual environment and install the necessary packages. You can build this environment from the tensorflow-gpu/2.6.0 virtual environment as a base if you would like, then install other packages you need, as described in our documentation. In order to protect Georgia Tech’s cybersecurity, please discontinue use of any older environments running prior versions of TensorFlow on PACE.

What we will continue to do: We are happy to assist researchers with the transition to the new version of TensorFlow. PACE will offer support to researchers upgrading TensorFlow at our upcoming consulting sessions. The next sessions are Thursday, October 28, 10:30-12:15, and Tuesday, November 2, 2:00-3:45. Visit our training page for the full schedule and BlueJeans links.

Thank you for your prompt attention to this security update, and please accept our sincere apology for any inconvenience that this may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Hive scheduler recurring outages

Posted by on Friday, 15 October, 2021

[Update 10/15/21 5:15 PM]

The Hive scheduler is functioning at this time. The PACE team disabled several system utilities that may have contributed to earlier issues with the scheduler. We will continue to monitor the scheduler status and to work with our support vendor to improve stability of Hive’s scheduler. Please check this blog post for updates.

[Update 10/15/21 4:15 PM]

The Hive scheduler is again functional. The PACE team and our vendor are continuing our investigation in order to restore stability to the scheduler.

[Original Post 10/15/21 12:35 PM]

Summary: Hive scheduler recurring outages

What’s happening and what are we doing: The Hive scheduler has been experiencing intermittent outages over the past few weeks requiring frequent restarts. At this time, the PACE team is running a diagnostic utility and will restart the scheduler shortly. The PACE team is actively investigating the outages in coordination with our scheduler vendor to restore stability to Hive’s scheduler.

How does this impact me: Hive researchers may be unable to submit or check the status of jobs, and jobs may be unable to start. You may find that the “qsub” and “qstat” commands and/or the “showq” command are not responsive. Already-running jobs will continue.

What we will continue to do: PACE will continue working to restore functionality to the Hive scheduler and coordinating with our support vendor. We will provide updates on our blog, so please check here for current status.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

 

Hive Project & Scratch Storage Battery Replacement

Posted by on Thursday, 23 September, 2021

[Update 9/23/21 3:15 PM]

The replacement batteries have reached a sufficient charge, and Hive GPFS performance has been restored. Thank you for your patience during this maintenance.

[Original Post 9/23/21 12:30 PM]

Summary: Battery replacement on Hive project & scratch storage will impact performance today.
What’s happening and what are we doing: UPS batteries on the Hive GPFS storage device, holding project (data) and scratch storage, need to be replaced. During the replacement, which will begin shortly this afternoon, storage will shift to write-through mode, and performance will be impacted. Once the new batteries are sufficiently charged, performance will return to normal.
How does this impact me: Hive project and scratch performance will be impacted until the fresh batteries have sufficiently charged, which should take approximately 3 hours. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.
What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Globus maintenance downtime on September 18

Posted by on Friday, 10 September, 2021
Summary: Globus maintenance downtime on September 18
What’s happening and what are we doing: Globus will be undergoing maintenance worldwide on September 18, beginning at 11:00 AM and expected to last for up to 30 minutes, to complete database upgrades. Details are available on the Globus website.
How does this impact me: You will not be able to access Globus during this time nor start a transfer. Any transfers in progress will be paused and will automatically resume upon completion of maintenance. This affects all Globus services, including endpoints at PACE on our Phoenix and Hive clusters, plus others you may use at other computing sites.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Resolved] Hive scheduler outage

Posted by on Friday, 23 July, 2021

[Update 4:40 PM 7/23/21]

After continued investigation, cleaning up the scheduler logs, and rebooting the scheduler node, we have restored the Hive scheduler to full functionality. Jobs that have been submitted and queued are now running, and there was no interruption to running jobs. New jobs submitted at this time should start as space becomes available, as usual. Thank you for your patience as we investigated this situation.

Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 1:35 PM 7/23/21]

The Hive scheduler has been experiencing intermittent outages over the last few days while under heavy load, and jobs have been unable to start for nearly all of today (Friday). You may find that jobs you have submitted to Hive remain queued and do not start. We are actively working to restore functionality and will update you as more information becomes available. Thank you for your patience as we investigate this situation.
Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix storage issue

Posted by on Thursday, 22 July, 2021

[Update 2:05 PM 7/22/21]

The controller reboot is complete, and we believe no disruption occurred in access to Phoenix storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 12:55 PM 7/22/21]

In coordination with our support vendor, we are working to resolve an issue with a Phoenix Lustre metadata controller, which supports both project and scratch storage.
At 1:30 PM today, we will reboot one of the controllers. We do not expect any impact to users, as the other controller is running without error at this time. Should there be any unexpected impact, we will work to restore full functionality as quickly as possible. We will provide an update when this work is complete.
Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Project Storage Quota Update

Posted by on Thursday, 22 July, 2021

In coordination with the Hive PIs, PACE has updated our quota policies for project storage on Hive, in order to facilitate easier access to available storage capacity for our users. For project storage, accessed via the “data” symbolic link in your home directory, block quotas are now shared by an entire research group, rather than being set at the user level. All users in a single PI’s storage allocation have access to the entire quota, which brings Hive in line with Phoenix’s quota arrangement. Most research groups have 50 TB of project storage on Hive, with the exception of those specifically provided with a higher allocation in the NSF grant funding the cluster. Each user maintains a limit of 2 million files within their research group’s project storage.

You can review your storage usage on Hive by running the updated “pace-quota” command on any Hive node. Quotas for home (5 GB per user) and scratch (7 TB per user) directories are unchanged. Please visit our documentation for more details about Hive storage.

Please contact us at pace-support@oit.gatech.edu with any questions about using Hive.

OIT NetApp upgrade

Posted by on Friday, 9 July, 2021

A low-risk upgrade is planned for Georgia Tech OIT’s NetApp storage appliances, beginning Saturday, July 10, at 6:00 AM. We do not expect any impact on PACE systems from this upgrade.

OIT’s NetApp appliance is in use on PACE’s Phoenix, PACE-ICE, and COC-ICE clusters. It hosts home directories as well as pace-apps, our software module repository. Should there be an unexpected disruption, users may face issues with logins, access to home directories, and loading or using PACE-supported software modules. We will provide updates in the unlikely event of a disruption this weekend.

Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scratch storage update

Posted by on Tuesday, 22 June, 2021

We would like to remind you about scratch storage policy on Phoenix. Scratch is designed for temporary storage and is never backed up. Each week, files not modified for more than 60 days are automatically deleted from your scratch directory. As part of Phoenix’s start-up, regular cleanup of scratch has now been implemented. Each week, users with files set to be deleted receive a warning email about files to be deleted in the coming week, with additional information included. Those of you who used PACE prior to the migration to Phoenix or who use Hive are already familiar with this workflow.

Some of you will receive such an email this week. The first deletion of old scratch files in Phoenix will occur on July 7, covering files noted in these messages. We are extending the time beyond the normal one-week notification for this first round to give you time to adjust to this weekly process again.

Phoenix project storage is the intended location for your important research data. You can find out more about Phoenix storage at http://docs.pace.gatech.edu/phoenix_cluster/storage_phnx/.

Please contact us at pace-support@oit.gatech.edu with any questions about how to manage your data stored on Phoenix.

[Resolved] Hive scheduler outage

Posted by on Friday, 18 June, 2021

The Hive scheduler experienced an outage this afternoon, as the resource and workload managers were unable to communicate. Our team identified the issue as relating to a missing library file and corrected the issue, restoring functionality at approximately 5 PM today.
Jobs submitted this afternoon would not have been able to start until the repair was implemented. Already-running jobs should not have been affected.
Please contact us at pace-support@oit.gatech.edu with any questions.