PACE A Partnership for an Advanced Computing Environment

February 10, 2018

PACE clusters ready for research

Filed under: Uncategorized — Semir Sarajlic @ 3:47 am

Our February 2018 maintenance (https://blog.pace.gatech.edu/?p=6158) is complete ahead of schedule.  We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, May 10 through Saturday, May 12, 2018.

Storage
– Both pace1 and pace2 GPFS systems now apply a limit of 2 Million files/directories per user. Please contact us if you have problems with creating new files or updating existing ones, or see messages saying that your quota is exceeded.
– We performed several maintenance tasks for both pace1 and pace2 systems to improve reliability and performance. This included rebalancing data on the drives as recommended by the vendor.
– Temporary links pointing to storage migrated in the previous maintenance window (November 2017) are now removed. All direct references to old paths will fail. We strongly recommend Math and ECE users (whose repositories are relocated as a part of storage migration) to run tests. Please let us know if you see ‘file not found’ type errors referencing the old paths staring with “/nv/…”
– Deletion of old copies of bio-konstantinidis and bio-soojinyi are currently pending, but we will start deletions sometime after the maintenance day.
– CNS users are are migrated to their new home and project directories.
Power
– We completed all power work as planned.
Rack/Node maintenance
– To rebalance power utilization, a few ASDL nodes are moved and renamed. Users of this cluster should not notice any differences other than hostnames.
– VM servers received a memory bump, allowing for more capacity
Network
– Recabling and reconfiguration of IB network is complete
– All planned Ethernet network improvements are complete
As always, please contact us  (pace-support@oit.gatech.edu) if you have notice any problems.

 

February 5, 2018

PACE quarterly maintenance – (Feb 8-10, 2018)

Filed under: Uncategorized — Semir Sarajlic @ 11:47 pm

PACE maintenance activities are scheduled to start at 6am this Thursday (2/8) and may continue until Saturday (2/10). As usual, jobs with long walltimes are being held by the scheduler to prevent them from getting killed when we power off the systems. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, new storage quotas in particular, require user action. Please read on for more details and action items.

Storage

* (Requires user action) The “2 Million files/directories per user” limitation on the GPFS system (as initially announced https://blog.pace.gatech.edu/?p=6103) will take effect on both pace1 and pace2 storages, which constitute almost all of the project space with the exception of ASDL cluster. We have been sending weekly reminders to users who are exceeding this limit since the November maintenance. If you have been receiving these notifications and haven’t reduced your usage yet, please contact pace-support urgently to prevent interruptions to your research.

* (Requires user action) As a last step to conclude the storage migration performed during November maintenance, PACE will remove the redirection links left at the old storage locations as a temporary precaution. The best way to tell whether your codes/scripts will be impacted is to test them on testflight cluster, which doesn’t have these links as described in https://blog.pace.gatech.edu/?p=6153 . If you find that your codes/scripts are working on testflight, then it means they will continue to work on any other PACE cluster after the links are removed.

We have been working with ECE and Math departments, which maintain their own software repositories, to ensure that the existing software will continue to run in the new locations. We have been strongly encouraging users of these repositories to run tests on the testflight cluster to identify potential problems. If you haven’t had a chance to try your codes yet, please try to do that until the maintenance day and contact pace-support urgently if you notice any problems.

* (Requires user action) The two storage locations that had been migrated between two GPFS systems, namely bio-konstantinidis and bio-soojinyi, will be deleted from the old (pace1) location. If you need any data from the old location, please contact pace-support urgently to retrieve them before the maintenance day.

* (May require user action) We will complete the migration of CNS cluster users to their new home (hcns1) and project storage (phy-grigoriev). We will replace the symbolic links (e.g. ~/data) accordingly to make this migration as transparent from the users as possible. If some of your codes/scripts include hardwired references to the old locations, they need to be updated with the new locations. We strongly recommend the use of available symbolic links such as “~/data” rather than absolute paths such as “/gpfs/pace2/project/pf1” to ensure that your codes/scripts will not be impacted by future changes we may need to make.

* (No user action needed) We will apply some maintenance (disk striping) on the pace1 GPFS system. We are also exploring a possibility to update some components in the pace2, but the final decision is waiting on the vendor recommendation. None of this work requires any user action.

Power Work

* (No user action needed) We will install new power distribution units (PDUs) and reconfigure some connections on some racks to achieve a better power distribution and increase redundancy.

Rack/Node maintenance

* (No user action needed) We will physically move some of the ASDL nodes to a different rack. While this requires renaming of those nodes, there will be no differences in the way users are submitting jobs via the scheduler. One exception is the unlikely scenario of users explicitly requesting nodes by their hostnames in PBS scripts.

* (No user action needed) We will increase the memory capacity on Virtual machine servers from 64GB to 256GB, which host most of the headnodes. The memory available per VMs, however, will not change.

Network

* (No user action needed) We will do some recabling and reconfiguration on the Infiniband (IB) network to achieve a more efficient connectivity, which will also allow its to retire an old switch.

* (No user action needed) We will install a new Ethernet switch and replace some others to optimize the network.

 Instructional Cluster

The instructional cluster (a.k.a PACE/COC ICE) will be offlined as a part of this maintenance. This is a brand new resource that’s not officially made available to any classes yet, but we noticed some logins by some users. Please refrain from using these resources for any classes yet, until we release it following a training session that we will schedule in the next week.

 

Powered by WordPress