GT Home : : Campus Maps : : GT Directory

Author Archive

[Resolved] All PACE nodes temporarily offline due to storage trouble

Posted by on Saturday, 30 December, 2017

Update (12/31/2017, 10:15am): We have addressed the issue and the majority of nodes started running jobs again. As far as we can tell, this was caused by a network related “event” that’s internal to the system. We are working with the vendor to identify the exact root cause.

Original post: One of the primary storage systems (pace2) went offline today, potentially impacting running jobs referencing to that system.

Our automated scripts offlined PACE nodes to prevent new jobs from starting. They will be online once the storage issues are addressed.

PACE team is currently investigating the problems and we will keep you updated.

We are sorry for the delays that may be caused due to the limited staff availability on holidays.

Systematic offlining of PACE nodes to address storage slowness

Posted by on Tuesday, 21 November, 2017

We identified a problem with the way some nodes are mounting our main (GPFS) storage server, causing slow storage performance. The fix requires restarting the storage services on affected nodes individually, when they are not running any jobs. For this reason, we started draining (offlining) all affected nodes and systematically bringing them back online as soon as their jobs are complete and the fix is applied.

This issue does not impact running jobs other than storage slowness, but you will notice offline nodes in your queues until we address all affected nodes.

It’s safe to continue submitting jobs and there is no risk of data loss.

We are sorry for this inconvenience and thank you for your cooperation.

PACE clusters ready for research

Posted by on Saturday, 4 November, 2017

Our November 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, February 8 through Saturday, February 10, 2018.

Storage
– Nearly a petabyte of data was migrated to the new DDN/GPFS storage device.  While this will provide a more performant, expandable, and supportable storage platform, it requires changes to path names.  We have adjusted the symbolic links in home directories (e.g. ~/data) to point to the new locations, please continue to use these names wherever possible.  In order to minimize disruption, we have also put a temporary redirection in place so that the old names will continue to work.  We intend to remove this redirection during our next maintenance period, and will proactively identify and assist users using the deprecated path names.

Schedulers
– The nvidia-gpu and gpu-recent queues have been consolidated into a new force-gpu queue.  Please use the new queue name going forward.  PACE staff will proactively identify and assist users using the deprecated queue names.
– The semap-6 queue has been moved to an alternate scheduler server.  No user action is required.
– The Joe cluster has been moved into the shared partition.  These users now have access to idle cycles in the shared partition, and offer the idle cycles of their cluster for use by others.

ITAR / NIST800-171 environment
– planned tasks are complete, no user action is required.

Power and Network
– planned tasks are complete, no user action is required.

PACE quarterly maintenance – (Nov 2-4, 2017)

Posted by on Monday, 23 October, 2017

 

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Nov 2 through the the end of Saturday (Nov 4). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, storage migrations in particular, require attention of a large number of users. Please read on for more details and action items.

Storage (Requires user action)

PACE is retiring old NFS storage servers, which have been actively serving project directories for a large number of users. All of the data they contain will be consolidated into a new GPFS storage (pace2) purchased recently. GPFS is a high performance parallel filesystem, which offers improved reliability (and in many cases performance) compared to NFS.

Important: PACE will also start enforcing a 2 Million files/directories limit on this GPFS system, regardless of their size. We have identified the users who are currently using more than this limit and will contact them separately to prevent interruptions to research.

Here’s a full list of storage locations that will be migrated to ‘pace2′:

pg1 ,pc5, pe11, pe14, pe15, pe3, pe5, pe9, pe10, pe12, pe4, pe6, pe8, pa1, pbi1, pcc1, pcee2, pmse1, psur1, pc4, pase1, pmart1, pchpro1, pska1, pbiobot1, pf2, pggate1, ptml1,pc6, py1, py2, pc2, pz2, pe1, pe7, pe13, pe2, pb2, pface1, pas1, pf1, pb1, hp3, pj1, pb3, pc1, pz1, ps1, pec1, pma1

In addition to these NFS shares, we will also migrate these two filesystems from our current GPFS system (pace1) to the new GPFS system (pace2), due to limited space availability:

bio-konstantinidis, bio-soojinyi

 

How can I tell if  my project directory will be migrated?

Copy and run this command on the headnode:

find ~/data* -maxdepth 1 -type l -exec ls -ld {} \;

This command will return one or more lines similar to:

lrwxrwxrwx 1 root pace-admins 16 Jun 16 2015 /nv/hp16/username3/data -> /nv/pf2/username3
lrwxrwxrwx 1 root pace-admins 19 Jan 6 2017 /nv/hp16/username3/data2 -> /gpfs/pace1/project/pf1/username3

Please note the right hand side of the arrow “->”. If the arrow is pointing to a path starting with “/nv/…” and followed by a storage name included in the list provided above, then your data will be migrated. In this example, the location linked as “data” will be migrated (/nv/pf2/username3) but “data2″ will not (/gpfs/pace1/project/pf1/username3).

As an exception, all references to “bio-konstantinidis” and “bio-soojinyi” will be migrated, even tough their path starts with “/gpfs” and not “/nv”. E.g.:

lrwxrwxrwx 1 root bio-konstantinidis 43 Oct 11 2015 /nv/hp1/username3/data3 -> /gpfs/pace1/project/bio-konstantinidis/username3

What do I do if my project storage is being migrated?

No action needed for users who have been using the symbolic link names to access the storage (e.g. data, data2, etc.), because PACE will replace these links to point to the new locations.

If you have been referencing your storage using their absolute path (e.g. /nv/pf2/username), which is not recommended, then you need to replace all mentions of “/nv” with “/gpfs/pace1/project” in your codes, scripts and job submissions. E.g., “/nv/pf2/username3″ should be replaced as “/gpfs/pace1/project/pf2/username3″.

Users of bio-konstantinidis and bio-soojinyi should only need to replace “pace1″ with “pace2″. E.g., “/gpfs/pace1/project/bio-konstantinidis/username3″ should be replaced as “/gpfs/pace2/project/bio-konstantinidis/username3″.

NOTE: PACE strongly encourages all users to reference their project directories using their symbolic links (e.g. data, data2, …), rather than absolute paths, which are always subject to change. Doing so will minimize the user action needed when we make changes in the systems and configurations.

What if I don’t fix existing references to the old locations after my data are migrated?

PACE team will replace existing directories with links pointing to their new location to minimize user impact. This way, script/codes that are pointing to old paths can continue to run without needing any changes. However, this temporary failsafe measure will be in place for approximately 3 more months (until the next maintenance day). We strongly encourage all users to check if their data is being migrated, then fix their scripts/codes accordingly as needed, within this 3-months grace period. Please contact PACE team if you need any assistance with this process.

PACE team will also be monitoring jobs during this time period, and proactively reach out to users with jobs that are still using the old paths.

 

Schedulers (Requires some user action)

  • Consolidation of nvidia-gpu and gpu-recent queues on a new queue named “force-gpu”: This will require users of these queues to change the queue name to “force-gpu” in their submission scripts.
  • Clean up and improve PBSTools configurations and data
  • Migration of semap-6 queue to the dedicated-sched scheduler
  • [NEW] Migration of all joe queues on the shared-sched scheduler

ASDL / ITAR cluster (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Redistribute power connections to additional circuit in the rack
  • Replace CMOS batteries on compute nodes
  • Replace mother board on the file server, to use all the available memory slots

Power and network (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Update power distribution units on 2 racks
  • Move compute nodes to balance power utilization
  • Replace old, out of support switches
  • Update DNS appliances in Rich 116, 133 and BDCD
  • Increase redundancy to Infiniband connections between Rich 116 and 133

[RESOLVED] Post-maintenance Storage Issues

Posted by on Saturday, 12 August, 2017

Update: GPFS storage is stabilized again. There remains several steps we need to take to complete this work, which can be completed without a downtime. We may need to take some nodes temporarily offline, which will be done in coordination with you without impacting the running jobs.
—–

PACE systems started experiencing wide-spread problems with the GPFS storage shortly after releasing jobs after the maintenance tasks are complete. At fist glance,  they seem to be related to the Infiniband network.

We would like to advise against submitting new jobs until these issues are fully resolved. We will continue to work on a resolution and keep you updated on the progress.

Thank you for your patience.

PACE systems experiencing problems

Posted by on Wednesday, 9 August, 2017
Update: The immediate issue is now resolved. We continue to work with the vendor to find more about what caused the problem. If you need assistance with failed jobs or unavailable headnodes, please contact pace-support@oit.gatech.edu.
We are experiencing a storage issue that’s impacting majority of the home directories and some services, with potential impact on running jobs. We are currently in process of assessing the overall impact and investigating the root cause. We’ll update our blog (http://blog.pace.gatech.edu/) as we know more about the failures.
We are sorry of the inconvenience and thank you for your patience.
Update: Impacted filesystems:
ap1
ap2
ap3
ap4
ap5
ap6
hase1
hb51
hbiobot1
hcee2
hcfm1
hchpro1
hcoc1
hface1
hggate1
hhygene1
hjabberwocky1
hmart1
hmedprint1
hmeg1
hmicro1
hneutrons1
hnjord1
hp1
hp10
hp11
hp12
hp13
hp14
hp15
hp16
hp17
hp18
hp19
hp2
hp20
hp21
hp22
hp23
hp24
hp25
hp26
hp27
hp28
hp29
hp4
hp5
hp6
hp7
hp8
hp9
hpampa1
hsemap1
hska1
hsonar1
hthreshold1
html1

PACE quarterly maintenance – (Aug 10-12, 2017)

Posted by on Thursday, 3 August, 2017

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Aug 10 through the the end of Saturday (Aug 12). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Storage (no user action needed)

We are working with the vendor to reconfigure our multiple GPFS filesystems for performance fine-tuning and better support for large scale deployments. Some of these configurations will be applied on the maintenance day because they require downtime.

Network (no user action needed)

PACE infiniband (IB) network requires some configuration changes to handle the rapid growth in the number of PACE nodes. We have identified several configuration parameters that will potentially reduce the occurrence of nodes losing their connectivity to GPFS (which relies on the IB network), causing intermittent job crashes.

Schedulers (postponed, no user action needed)

We have communicated our plans to upgrade scheduler in several occasions in the past, but skipped this task in the past maintenance days due to the bugs that we had uncovered. Despite the promising progress on the resolution of these bugs by the vendor, they are not fully resolved and tested yet. For this reason, we decided to once again postpone upgrading plans and keep the current versions until we have a bug-free and well-tested version.

Scheduler-based monitoring and analysis (no user action needed)

PACE research scientists started a collaboration with Texas Advanced Computing Center (TACC) to develop a new tool to analyze scheduler logs to gain insights about usage trends. For more details, please check https://doi.org/10.1145/3093338.3093351

This tool heavily relies on a widely used utility named ‘PBSTools’, which is developed by Ohio Supercomputer Center. Our current installation of PBSTools is old, buggy and very slow. We will upgrade this tool and its database on the maintenance day to ensure that no job info will be lost during transition.

Power Work  (no user action needed)

PACE and the OIT Operations team will perform some work in the datacenter to improve electrical capacity and balance. This includes moving and/or replacing some of the in-rack Power Distribution Units (PDUs).

Software Repository (no user action needed)

Our NIST SP800-171 conforming secure cluster has a specific need for a dedicated copy of the PACE software repository. We have previously completed an initial replication of the files and will simply point the nodes to the replica during the maintenance window.

 

 

 

 

PACE is experiencing storage (GPFS) problems

Posted by on Friday, 16 June, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts most of the project directories.

We are working with the vendor to investigate the ongoing issues. At this moment we don’t know whether they are related to yesterday’s power/cooling failures or not, but we will update the PACE community as we find out more.

This issue has potential impact on running jobs and we are sorry for this inconvenience.

PACE datacenter experienced a power/cooling failure

Posted by on Friday, 16 June, 2017
What happened: We had a brief power failure in our datacenter, which took out cooling in racks running chilled water. This impacted about 160 nodes from various queues, with potential impact on running jobs.
Current Situation: Some cooling has been restored, however we had to issue a shut down to a couple of the highest temperature racks that were not cooling down (p41, k30, h43, c29, c42). We are keeping a close eye on the remaining racks that were in the risk area in coordination with the Operations team as they continue to monitor temperatures in these racks.
We will start bringing the down nodes online once the cooling issue is fully resolved.
What can you do: Please resubmit failed jobs (if any) if you were using any of the queues listed below. As always, contact pace-support@oit.gatech.edu for any kind of assistance you may need.
Thank you for your patience and sorry for the inconvenience.

Impacted Queues:

—————————
apurimac-6
apurimacforce-6
atlas-6
atlas-debug
b5force-6
biobot
biobotforce-6
bioforce-6
breakfix
cee
ceeforce
chemprot
chowforce-6
cnsforce-6
critcel
critcel-burnup
critcelforce-6
critcel-prv
cygnus
cygnus-6
cygnus64-6
cygnusforce-6
cygnus-hp
davenprtforce-6
dimerforce-6
ece
eceforce-6
enveomics-6
faceoff
faceoffforce-6
force-6
ggate-6
granulous
gryphon
gryphon-debug
gryphon-prio
gryphon-tmp
hygeneforce-6
isabella-prv
isblforce-6
iw-shared-6
martini
mathforce-6
mayorlab_force-6
mday-test
medprint-6
medprintfrc-6
megatron
megatronforce-6
microcluster
micro-largedata
monkeys_gpu
mps
njordforce-6
optimusforce-6
prometforce-6
prometheus
radiance
rombergforce
semap-6
skadi
sonarforce-6
spartacusfrc-6
threshold-6
try-6
uranus-6

Infiniband switch failure causing partial network and storage unavailability

Posted by on Thursday, 25 May, 2017
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar