GT Home : : Campus Maps : : GT Directory

[COMPLETE] PACE Quarterly Maintenance – November 1-3, 2018

This entry was posted by on Monday, 29 October, 2018 at

[Update – November 3, 2018, 4:45pm] 

Our November 2018 maintenance is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days, which includes nodes that will need PCIe connectors replaced as a preventative measure.

Please note that some jobs on PACE ICE cluster need to be re-submitted, and we have contacted the affected users individually. 

[Update – November 2, 2018] 

Verification of the integrity of the GPFS file system is taking longer than initial estimate.  As a result, this maintenance day will last the full three days, as scheduled, that will allow us to complete the certification of the file system and ensure the highest integrity of the data. 

[Original Post – October 29, 2018]

Our next PACE maintenance day will start November 1 and run through November 3 as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on November 1 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Compute

  • (no user action needed) Replace power components in a rack in Rich 133
  • (no user action needed) Replace defective PCIe connectors on multiple servers

Network

  • (no user action needed) Stress test new InfiniBand subnet managers, to prepare for the move to Coda
  • (no user action needed) Change uplink connections from management switches

Storage

  • (no user action needed) Verify integrity of GPFS file systems
  • (no user action needed) Upgrade firmware on DDN / GPFS storage systems
  • (no user action needed) Upgrade firmware on TruNAS storage systems

Other

  • (no user action needed) Replace PACE ICE schedulers with a physical server, to increase capacity and reliability

Comments are closed.