GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

Resolved: Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

We’ve addressed some of the problems with the TrueNAS storage and CSH/TCSH should now be working again. As it turns out, this problem wasn’t actually related to the maintenance last week, and we will continue to work with the vendor regarding the cause.

Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

Currently, we’ve observed a correlation between hanging processes on all PACE systems and csh/tcsh, and are continuing to investigate. For the time being, if you have commands related to csh or tcsh, we ask that you please refrain from running them for the time being. It appears to be related to the TrueNAS storage system. We’re currently working with OIT and iXsystems to resolve the issue.

[RESOLVED] Post-maintenance Storage Issues

Posted by on Saturday, 12 August, 2017

Update: GPFS storage is stabilized again. There remains several steps we need to take to complete this work, which can be completed without a downtime. We may need to take some nodes temporarily offline, which will be done in coordination with you without impacting the running jobs.
—–

PACE systems started experiencing wide-spread problems with the GPFS storage shortly after releasing jobs after the maintenance tasks are complete. At fist glance,  they seem to be related to the Infiniband network.

We would like to advise against submitting new jobs until these issues are fully resolved. We will continue to work on a resolution and keep you updated on the progress.

Thank you for your patience.

PACE systems experiencing problems

Posted by on Wednesday, 9 August, 2017
Update: The immediate issue is now resolved. We continue to work with the vendor to find more about what caused the problem. If you need assistance with failed jobs or unavailable headnodes, please contact pace-support@oit.gatech.edu.
We are experiencing a storage issue that’s impacting majority of the home directories and some services, with potential impact on running jobs. We are currently in process of assessing the overall impact and investigating the root cause. We’ll update our blog (http://blog.pace.gatech.edu/) as we know more about the failures.
We are sorry of the inconvenience and thank you for your patience.
Update: Impacted filesystems:
ap1
ap2
ap3
ap4
ap5
ap6
hase1
hb51
hbiobot1
hcee2
hcfm1
hchpro1
hcoc1
hface1
hggate1
hhygene1
hjabberwocky1
hmart1
hmedprint1
hmeg1
hmicro1
hneutrons1
hnjord1
hp1
hp10
hp11
hp12
hp13
hp14
hp15
hp16
hp17
hp18
hp19
hp2
hp20
hp21
hp22
hp23
hp24
hp25
hp26
hp27
hp28
hp29
hp4
hp5
hp6
hp7
hp8
hp9
hpampa1
hsemap1
hska1
hsonar1
hthreshold1
html1

PACE quarterly maintenance – (Aug 10-12, 2017)

Posted by on Thursday, 3 August, 2017

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Aug 10 through the the end of Saturday (Aug 12). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Storage (no user action needed)

We are working with the vendor to reconfigure our multiple GPFS filesystems for performance fine-tuning and better support for large scale deployments. Some of these configurations will be applied on the maintenance day because they require downtime.

Network (no user action needed)

PACE infiniband (IB) network requires some configuration changes to handle the rapid growth in the number of PACE nodes. We have identified several configuration parameters that will potentially reduce the occurrence of nodes losing their connectivity to GPFS (which relies on the IB network), causing intermittent job crashes.

Schedulers (postponed, no user action needed)

We have communicated our plans to upgrade scheduler in several occasions in the past, but skipped this task in the past maintenance days due to the bugs that we had uncovered. Despite the promising progress on the resolution of these bugs by the vendor, they are not fully resolved and tested yet. For this reason, we decided to once again postpone upgrading plans and keep the current versions until we have a bug-free and well-tested version.

Scheduler-based monitoring and analysis (no user action needed)

PACE research scientists started a collaboration with Texas Advanced Computing Center (TACC) to develop a new tool to analyze scheduler logs to gain insights about usage trends. For more details, please check https://doi.org/10.1145/3093338.3093351

This tool heavily relies on a widely used utility named ‘PBSTools’, which is developed by Ohio Supercomputer Center. Our current installation of PBSTools is old, buggy and very slow. We will upgrade this tool and its database on the maintenance day to ensure that no job info will be lost during transition.

Power Work  (no user action needed)

PACE and the OIT Operations team will perform some work in the datacenter to improve electrical capacity and balance. This includes moving and/or replacing some of the in-rack Power Distribution Units (PDUs).

Software Repository (no user action needed)

Our NIST SP800-171 conforming secure cluster has a specific need for a dedicated copy of the PACE software repository. We have previously completed an initial replication of the files and will simply point the nodes to the replica during the maintenance window.

 

 

 

 

Storage (GPFS) Issue Update

Posted by on Tuesday, 25 July, 2017

We are seeing a reduction in the GPFS filesystem problems over the past weekend, and are continuing to actively work with the vendor. We don’t have a complete solution yet, but have observed greater stability for compute nodes in the GPFS filesystem. Thank you for your patience – we will continue to keep you updated as much as possible as the situation changes.

Storage (GPFS) Issue Update

Posted by on Friday, 14 July, 2017

While the problem wasn’t very widespread and we have improved the reliability, we have not yet arrived at a full solution and are still actively working on the problem. We now believe the problem is due to the recent addition of many compute nodes, ultimately bringing us into the next tier of system-level tuning needed for the filesystem. Thank you for your patience – we will continue to provide updates as they become available.

Storage (GPFS) Issue

Posted by on Wednesday, 12 July, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts scratch and project directories (~/scratch, and ~/data). At the moment, we are exploring this failure with the vendor if this may be related to the recent cluster nodes that have been brought online.

This issue has potential impact on running jobs. We are actively working on the problem, apologize for the inconvenience, and will update as soon as possible.

Storage (GPFS) and datacenter problems resolved

Posted by on Monday, 19 June, 2017

All nodes and GPFS filesystem issues affected from the power failure should be resolved as of late Friday evening (June 16) . If you are still experiencing problems, please let us know at pace-support@oit.gatech.edu.

PACE is experiencing storage (GPFS) problems

Posted by on Friday, 16 June, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts most of the project directories.

We are working with the vendor to investigate the ongoing issues. At this moment we don’t know whether they are related to yesterday’s power/cooling failures or not, but we will update the PACE community as we find out more.

This issue has potential impact on running jobs and we are sorry for this inconvenience.