GT Home : : Campus Maps : : GT Directory

Archive for October, 2017

PACE quarterly maintenance – (Nov 2-4, 2017)

Posted by on Monday, 23 October, 2017

 

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Nov 2 through the the end of Saturday (Nov 4). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, storage migrations in particular, require attention of a large number of users. Please read on for more details and action items.

Storage (Requires user action)

PACE is retiring old NFS storage servers, which have been actively serving project directories for a large number of users. All of the data they contain will be consolidated into a new GPFS storage (pace2) purchased recently. GPFS is a high performance parallel filesystem, which offers improved reliability (and in many cases performance) compared to NFS.

Important: PACE will also start enforcing a 2 Million files/directories limit on this GPFS system, regardless of their size. We have identified the users who are currently using more than this limit and will contact them separately to prevent interruptions to research.

Here’s a full list of storage locations that will be migrated to ‘pace2′:

pg1 ,pc5, pe11, pe14, pe15, pe3, pe5, pe9, pe10, pe12, pe4, pe6, pe8, pa1, pbi1, pcc1, pcee2, pmse1, psur1, pc4, pase1, pmart1, pchpro1, pska1, pbiobot1, pf2, pggate1, ptml1,pc6, py1, py2, pc2, pz2, pe1, pe7, pe13, pe2, pb2, pface1, pas1, pf1, pb1, hp3, pj1, pb3, pc1, pz1, ps1, pec1, pma1

In addition to these NFS shares, we will also migrate these two filesystems from our current GPFS system (pace1) to the new GPFS system (pace2), due to limited space availability:

bio-konstantinidis, bio-soojinyi

 

How can I tell if  my project directory will be migrated?

Copy and run this command on the headnode:

find ~/data* -maxdepth 1 -type l -exec ls -ld {} \;

This command will return one or more lines similar to:

lrwxrwxrwx 1 root pace-admins 16 Jun 16 2015 /nv/hp16/username3/data -> /nv/pf2/username3
lrwxrwxrwx 1 root pace-admins 19 Jan 6 2017 /nv/hp16/username3/data2 -> /gpfs/pace1/project/pf1/username3

Please note the right hand side of the arrow “->”. If the arrow is pointing to a path starting with “/nv/…” and followed by a storage name included in the list provided above, then your data will be migrated. In this example, the location linked as “data” will be migrated (/nv/pf2/username3) but “data2″ will not (/gpfs/pace1/project/pf1/username3).

As an exception, all references to “bio-konstantinidis” and “bio-soojinyi” will be migrated, even tough their path starts with “/gpfs” and not “/nv”. E.g.:

lrwxrwxrwx 1 root bio-konstantinidis 43 Oct 11 2015 /nv/hp1/username3/data3 -> /gpfs/pace1/project/bio-konstantinidis/username3

What do I do if my project storage is being migrated?

No action needed for users who have been using the symbolic link names to access the storage (e.g. data, data2, etc.), because PACE will replace these links to point to the new locations.

If you have been referencing your storage using their absolute path (e.g. /nv/pf2/username), which is not recommended, then you need to replace all mentions of “/nv” with “/gpfs/pace1/project” in your codes, scripts and job submissions. E.g., “/nv/pf2/username3″ should be replaced as “/gpfs/pace1/project/pf2/username3″.

Users of bio-konstantinidis and bio-soojinyi should only need to replace “pace1″ with “pace2″. E.g., “/gpfs/pace1/project/bio-konstantinidis/username3″ should be replaced as “/gpfs/pace2/project/bio-konstantinidis/username3″.

NOTE: PACE strongly encourages all users to reference their project directories using their symbolic links (e.g. data, data2, …), rather than absolute paths, which are always subject to change. Doing so will minimize the user action needed when we make changes in the systems and configurations.

What if I don’t fix existing references to the old locations after my data are migrated?

PACE team will replace existing directories with links pointing to their new location to minimize user impact. This way, script/codes that are pointing to old paths can continue to run without needing any changes. However, this temporary failsafe measure will be in place for approximately 3 more months (until the next maintenance day). We strongly encourage all users to check if their data is being migrated, then fix their scripts/codes accordingly as needed, within this 3-months grace period. Please contact PACE team if you need any assistance with this process.

PACE team will also be monitoring jobs during this time period, and proactively reach out to users with jobs that are still using the old paths.

 

Schedulers (Requires some user action)

  • Consolidation of nvidia-gpu and gpu-recent queues on a new queue named “force-gpu”: This will require users of these queues to change the queue name to “force-gpu” in their submission scripts.
  • Clean up and improve PBSTools configurations and data
  • Migration of semap-6 queue to the dedicated-sched scheduler
  • [NEW] Migration of all joe queues on the shared-sched scheduler

ASDL / ITAR cluster (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Redistribute power connections to additional circuit in the rack
  • Replace CMOS batteries on compute nodes
  • Replace mother board on the file server, to use all the available memory slots

Power and network (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Update power distribution units on 2 racks
  • Move compute nodes to balance power utilization
  • Replace old, out of support switches
  • Update DNS appliances in Rich 116, 133 and BDCD
  • Increase redundancy to Infiniband connections between Rich 116 and 133