GT Home : : Campus Maps : : GT Directory

Author Archive

[COMPLETED] PACE Quarterly Maintenance – November 7-9

Posted by on Saturday, 26 October, 2019

[Update 11/5/19]

We would like to remind you that PACE’s maintenance period begins tomorrow. This quarterly maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

These activities will be performed:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– (Completed) Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy implemented last week (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– (Completed) PBSTools, which records user job submissions, will be upgraded.
– (Completed) Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– (Completed) [Hive cluster] Infiniband switch firmware will be upgraded.
– (Completed) [Hive cluster] Storage system firmware will be updated.
– (Completed) [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– (Completed) [Hive cluster] Lmod, the environment module system, will be updated to a newer version.
– (Completed) The athena-6 queue will be upgraded to RHEL7.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Update 11/1/19]

We would like to remind you that we are preparing for PACE’s next quarterly maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEM REQUIRING USER ACTION:

– Anaconda Distributions have started to use a year.month scheme, starting from late last year (https://www.anaconda.com/anaconda-distribution-2018-12-released/). This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:

– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions (http://blog.pace.gatech.edu/?p=6550), will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.

– RHEL7 clusters will receive critical patches.

– Updates will be made to PACE databases and configurations.

– PBSTools, which records user job submissions, will be upgraded.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.

– [Hive cluster] Infiniband switch firmware will be upgraded.

– [Hive cluster] Storage system software will be updated. – [Hive cluster] Subnet managers will be reconfigured for better redundancy.

– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

 

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu . You can follow our Maintenance blog post at http://blog.pace.gatech.edu/?p=6614.

 

[Original post]

We are preparing for PACE’s next maintenance days on November 7-9, 2019. This maintenance period is planned for three days and will start on Thursday, November 7, and go through Saturday, November 9.  As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:
ITEM REQUIRING USER ACTION:
– Anaconda Distributions have started to use a year.month scheme, starting from late last year. This is easier for all users of PACE to track; accordingly, all PACE resources will now adopt the same convention in the use of anaconda2/2019.10 and anaconda3/2019.10 modules across all PACE resources. Defaults for Anaconda will now be set to the latest YYYY.MM. Therefore, the anaconda module files for “latest” will be removed, to avoid ambiguities. However, software installations that rely on “latest” will still be retained to preserve any critical user workflows. Users currently loading an Anaconda module ending in “latest” should modify their commands to reference a specific version of Anaconda (or just load the default without a version specified  – e.g., “module load anaconda3”). Please email PACE Support if you need help in accessing older versions of Anaconda that are no longer available via the modules system or in updating your scripts.

ITEMS NOT REQUIRING USER ACTION:
– Scheduler settings will be modified to improve the scheduler’s ability to handle large numbers of job submissions rapidly. These changes, along with the new policy being implemented on Tuesday (10/29/19) limiting simultaneous job submissions, will help stabilize the shared scheduler (accessed via login-s[X] headnodes) and make it more reliable. These scheduler settings are already implemented on the Hive cluster.
– RHEL7 clusters will receive critical patches.
– Updates will be made to PACE databases and configurations.
– Firmware for DDN storage will be updated.

– Upgrades to routers and network connections for PACE in Rich and Hive in Coda will be made in order to improve high-speed data transfer.
– [Hive cluster] Infiniband switch firmware will be upgraded.
– [Hive cluster] Subnet managers will be reconfigured for better redundancy.
– [Hive cluster] Lmod, the environment module system, will be updated to a newer version.

If you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Distributed MATLAB now available on PACE

Posted by on Tuesday, 24 September, 2019

PACE is excited to announce that distributed MATLAB use is now available on PACE resources. Georgia Tech’s new license allows for unlimited scaling of MATLAB on clusters. This change means that users can now run parallelized MATLAB code across multiple nodes. For detailed instructions, please visit our distributed MATLAB documentation at docs.pace.gatech.edu/software/matlab-distributed/.

Data center maintenance

Posted by on Tuesday, 17 September, 2019

Beginning tomorrow morning (9/18/19), there will be urgent maintenance on the cooling system in the Rich data center, which houses all PACE clusters except for Hive. A temporary cooling unit has been installed, but should the secondary cooling unit fail, the room will begin to heat.  If that happens, portions of the data center will need to shut down until the temperature has returned to an acceptable level.  If the clusters are shut down, this will terminate any running jobs on the compute nodes.
Please follow updates on this maintenance and find a full list of affected services across campus at https://status.gatech.edu/pages/maintenance/5be9af0e5638b904c2030699/5d7fd6219012a0316b71ef83.
– PACE team

COMSOL use at PACE

Posted by on Monday, 9 September, 2019

As you may know, the College of Engineering is changing the licensing model for COMSOL on September 16, 2019, and will now restrict access for research use to named users who have purchased access through CoE. Use of COMSOL for research on PACE is licensed through CoE (regardless of your college affiliation). If you or your PI have not yet made arrangements with CoE, please contact Angelica Remolina in CoE IT (angie.remolina@coe.gatech.edu). You will not be able to run COMSOL on PACE without permission from CoE after September 16.

[Resolved] Campus Network Down

Posted by on Wednesday, 4 September, 2019

[Update] September 5

OIT reports that the campus network is again fully functional.

[Update] September 4 4:28 PM

This is brief update,  OIT Network Services has identified the cause of the campus network issues.  One of the enterprise routers for campus rebooted unexpectedly that impacted our campus network.  Since this event, the network has been stabilized.  OIT continues to monitor this situation for any further issues.  For latest update, please check on OIT status page.

As for PACE cluster(s), you should be able to access the cluster(s) without issues.  If you continue to experience an issue, please try to disconnect and reconnect to restore your connectivity.

As always, if you have any questions or concerns, please don’t hesitate to contact us at pace-support@oit.gatech.edu

[Original] September 4 2:30 PM

Our campus network is down.  OIT is investigating this incident, and you may check on the details from the link below:

https://status.gatech.edu/pages/incident/5be9af0e5638b904c2030699/5d6ff4f4daca6a0543918df2

This incident will prevent you from accessing the PACE resources, but your current jobs running on PACE should not be interrupted.

Please check the status link above for up to date details.  If you have any questions, please send us a note to pace-support@oit.gatech.edu.  Also note, we are impacted by the outage and our responses to your email will be delayed.

Thank you for your patience.

[Resolved] GPFS outage on Red Hat 7 queues

Posted by on Friday, 30 August, 2019

An issue occurred around 3:30 AM on several queues running on the Red Hat 7 operating system, where a number of nodes failed to mount GPFS, our project (data) and scratch storage system. This caused the nodes to be offlined and unavailable for jobs. We repaired the affected nodes at approximately 9:30 AM today, and all queues should be functioning normally. Any jobs that were held should have begun. Please check your overnight jobs for errors.

The following queues were impacted:
atlas-he
ece-gpu
flamel-gpu
gaanam-gpu
gemini-cpu
gemini-gpu
megatron
ml_gpu
sake
skylake-test
starscream
swarm
swarm-gpu

Should you notice the problem recur, or if you have any other concerns, please contact us at pace-support@oit.gatech.edu, and we will be happy to help you. We apologize for the inconvenience this morning.

[Resolved] Campus-wide network outage impacting PACE

Posted by on Monday, 5 August, 2019

A campus-wide DNS server failure occurred on the morning of Monday, August 5. OIT was able to resolve the issue at 10:06 AM, and all PACE services should now be working normally. The problem with storage caused temporary unavailability of home directories, which would have included symptoms such as hanging codes, commands, and login attempts.
We believe that most jobs have resumed operation after the issue was resolved, but we cannot be sure. Please check to see if you have any crashed jobs, and report any issues to pace-support@oit.gatech.edu.
For details on the DNS failure, please visit the OIT status update.

Thank you for your attention to this, and we apologize for the inconvenience.