PACE A Partnership for an Advanced Computing Environment

July 22, 2016

Some PACE accounts temporarily disabled

Filed under: Uncategorized — Semir Sarajlic @ 3:03 pm

Some of our uses will not be able to login to PACE clusters after the maintenance day due to ongoing scratch data migration process. Here are the details:

One of the tasks performed during this maintenance day (https://blog.pace.gatech.edu/?p=5943) was to deploy a new fileserver to serve the “scratch” storage.

The new system is now operational, however the scratch data for some of the accounts are still in the process of migration. We temporarily inactivated these accounts to prevent modifications to incomplete data sets. The submitted jobs by these users are also held by the scheduler. Once the transfer is complete for a user, we will enable the account, release user’s jobs, and send a notification email. Currently we have no means to estimate how long the migration will take, but we are doing all we can to make this process go as quickly as possible.

If you are one of these users and need any of your data urgently, please contact us pace-support@oit.gatech.edu. All of your data are intact and accessible by us. We will try to find a solution for delivering the data you need for your research.

PACE clusters ready for research

Filed under: tech support — admin @ 3:26 am

Our July 2016 maintenance is now substantially complete.  Again, we sincerely apologize for the unfortunate additional unplanned downtime.

As previously communicated, we’ve had an unexpected delay caused by the data migrations from the old scratch system to newly acquired system. Some of these transfers are still in progress, with a limited number of users still remaining.  We have temporarily disabled access for these users to prevent jobs running on incomplete scratch data. We are reaching out to the affected users individually with more details. These users will not be able to login and their previously submitted jobs will not run until their scratch migration is complete. If you have not received a further notification from us and experience problems with logins or anything else, please do let us know as soon as possible by sending an email to pace-support@oit.gatech.edu.

Scratch performance may be reduced as these migrations complete, and we are doing everything we can to finish these migrations as soon as possible.

We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap. 

DDN/GPFS work

The new DDN SFA-7700 system is now operational and serving scratch storage for all users. We updated client software versions on all nodes.  We have encountered an anomaly that reduces its internal redundancy but does not affect normal operation.  We expect be able to rectify this while in production. 

Electrical work

Tasks complete as described

Bonus objectives

Network and local storage upgrades were implemented on schedulers as planned.  Additional diskless nodes were converted to diskfull as planned.

July 21, 2016

EXTENDED: PACE quarterly maintenance – July ’16

Filed under: tech support — admin @ 5:50 am

Dear PACE users,

Despite our best efforts, the data copies for the PACE scratch space have not gone as quickly as we had projected. We have also encountered an anomaly with the new storage system though we expect be able to rectify this while in production. At this writing, we have many of the compute nodes online but cannot start jobs until the data copy is complete.

As with all our maintenance periods, there is always a remote possibility we will run over our estimated time. This is one of those times. Please accept our apology for this unavoidable delay. We can assure you all the data is intact and we are continuing to work to optimize the transfers to achieve a speedy return to service.

Our staff will continue to update you with our progress.

Regards,
The PACE team

July 19, 2016

UNDERWAY: PACE quarterly maintenance – July ’16

Filed under: tech support — admin @ 10:08 am

Quarterly maintenance is now underway.  All clusters managed by PACE, including Tardis, are now offline.  Please see our previous post for details.

July 18, 2016

PACE quarterly maintenance – July ’16

Filed under: tech support — admin @ 11:01 pm

Dear PACE users,

Quarterly maintenance is once again upon us.  Starting at 6:00am TOMORROW MORNING, all resources managed by PACE will be taken offline.  Maintenance will continue through Wednesday evening.  Our activities are adhering to our originally published two-day schedule.

As a heads up, please make note of our Fall maintenance, which is now scheduled to begin at 6:00am on Thursday, October 13 and continue through Saturday, October 15.  Please note that this is a three-day period, including weekend work.  Further details to come as we get closer to October.

As previously communicated, our original plan to update various system software components in July has been deferred to a future maintenance period.  We will be in touch in advance of the October maintenance with details on this, including where you can test your codes against the updated software.  (highly recommended!)

Our major activity this time around will be updates to our GPFS filesystems and DDN storage devices.

DDN/GPFS work

  • We have acquired a new DDN SFA-7700, to which we will transition the scratch space.  This will provide more consistent scratch performance, a path for future capacity and performance increases, and provide as good or better performance to what we have now.  Initially, the SFA-7700 will provide approximately 375TB of space.  We will be increasing this to the 500TB we have currently as soon as additional disks can be procured.  No user action will be required.  We currently have approximately 220TB in use on scratch, so we do not expect this temporary decrease in available capacity to be an inconvenience.
  • We have DDN engineers engaged to update firmware and software on our current SFA-12k.  This will provide additional management and quality-of-service features, as well as the ability to transition to larger capacity drives.  Additionally, we will reallocate the drives previously used for the scratch space to provide additional project space capacity and metadata performance.  No user action will be required.
  • To support the two above updates, we will also be upgrading the version of the GPFS client software (where installed) from version 3.5 to version 4.2.  No user action will be required.

Electrical work

  • Facilities electricians will be performing some electrical work in the data center that will require the power to be temporarily removed from a number of our racks.  This work is to support some newly purchased equipment.  No user action will be required.

Bonus objectives

  • Additionally, as time permits, we will upgrade the network on some of our schedulers to 10-gigabit, and add additional local logging storage.  This will not affect the Gryphon, NovaZohar or Tardis clusters.  No user action will be required.
  • Also as time permits, we will continue the transition away from diskless nodes.  This mainly affects nodes in the 5-6 years old range.  No user action will be required.

July 15, 2016

Head node availability

Filed under: Uncategorized — Semir Sarajlic @ 1:08 pm

UPDATE 10:30am

All head nodes and support nodes in the VM farm are online.

Initial Post – 09:15am

Early this morning (2016/07/15, approximately 2:00am), we had a critical storage failure that cause our VM farm to declare all running head nodes as invalid.  We’re looking into this seriously, as this is one of those “not supposed to happen” moments.  In the mean time, the PACE team is working on getting these nodes back up and running for all users.

Powered by WordPress