GT Home : : Campus Maps : : GT Directory

Archive for category Uncategorized

[Resolved] All PACE nodes temporarily offline due to storage trouble

Posted by on Saturday, 30 December, 2017

Update (12/31/2017, 10:15am): We have addressed the issue and the majority of nodes started running jobs again. As far as we can tell, this was caused by a network related “event” that’s internal to the system. We are working with the vendor to identify the exact root cause.

Original post: One of the primary storage systems (pace2) went offline today, potentially impacting running jobs referencing to that system.

Our automated scripts offlined PACE nodes to prevent new jobs from starting. They will be online once the storage issues are addressed.

PACE team is currently investigating the problems and we will keep you updated.

We are sorry for the delays that may be caused due to the limited staff availability on holidays.

Systematic offlining of PACE nodes to address storage slowness

Posted by on Tuesday, 21 November, 2017

We identified a problem with the way some nodes are mounting our main (GPFS) storage server, causing slow storage performance. The fix requires restarting the storage services on affected nodes individually, when they are not running any jobs. For this reason, we started draining (offlining) all affected nodes and systematically bringing them back online as soon as their jobs are complete and the fix is applied.

This issue does not impact running jobs other than storage slowness, but you will notice offline nodes in your queues until we address all affected nodes.

It’s safe to continue submitting jobs and there is no risk of data loss.

We are sorry for this inconvenience and thank you for your cooperation.

PACE clusters ready for research

Posted by on Saturday, 4 November, 2017

Our November 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, February 8 through Saturday, February 10, 2018.

Storage
– Nearly a petabyte of data was migrated to the new DDN/GPFS storage device.  While this will provide a more performant, expandable, and supportable storage platform, it requires changes to path names.  We have adjusted the symbolic links in home directories (e.g. ~/data) to point to the new locations, please continue to use these names wherever possible.  In order to minimize disruption, we have also put a temporary redirection in place so that the old names will continue to work.  We intend to remove this redirection during our next maintenance period, and will proactively identify and assist users using the deprecated path names.

Schedulers
– The nvidia-gpu and gpu-recent queues have been consolidated into a new force-gpu queue.  Please use the new queue name going forward.  PACE staff will proactively identify and assist users using the deprecated queue names.
– The semap-6 queue has been moved to an alternate scheduler server.  No user action is required.
– The Joe cluster has been moved into the shared partition.  These users now have access to idle cycles in the shared partition, and offer the idle cycles of their cluster for use by others.

ITAR / NIST800-171 environment
– planned tasks are complete, no user action is required.

Power and Network
– planned tasks are complete, no user action is required.

PACE quarterly maintenance – (Nov 2-4, 2017)

Posted by on Monday, 23 October, 2017

 

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Nov 2 through the the end of Saturday (Nov 4). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, storage migrations in particular, require attention of a large number of users. Please read on for more details and action items.

Storage (Requires user action)

PACE is retiring old NFS storage servers, which have been actively serving project directories for a large number of users. All of the data they contain will be consolidated into a new GPFS storage (pace2) purchased recently. GPFS is a high performance parallel filesystem, which offers improved reliability (and in many cases performance) compared to NFS.

Important: PACE will also start enforcing a 2 Million files/directories limit on this GPFS system, regardless of their size. We have identified the users who are currently using more than this limit and will contact them separately to prevent interruptions to research.

Here’s a full list of storage locations that will be migrated to ‘pace2′:

pg1 ,pc5, pe11, pe14, pe15, pe3, pe5, pe9, pe10, pe12, pe4, pe6, pe8, pa1, pbi1, pcc1, pcee2, pmse1, psur1, pc4, pase1, pmart1, pchpro1, pska1, pbiobot1, pf2, pggate1, ptml1,pc6, py1, py2, pc2, pz2, pe1, pe7, pe13, pe2, pb2, pface1, pas1, pf1, pb1, hp3, pj1, pb3, pc1, pz1, ps1, pec1, pma1

In addition to these NFS shares, we will also migrate these two filesystems from our current GPFS system (pace1) to the new GPFS system (pace2), due to limited space availability:

bio-konstantinidis, bio-soojinyi

 

How can I tell if  my project directory will be migrated?

Copy and run this command on the headnode:

find ~/data* -maxdepth 1 -type l -exec ls -ld {} \;

This command will return one or more lines similar to:

lrwxrwxrwx 1 root pace-admins 16 Jun 16 2015 /nv/hp16/username3/data -> /nv/pf2/username3
lrwxrwxrwx 1 root pace-admins 19 Jan 6 2017 /nv/hp16/username3/data2 -> /gpfs/pace1/project/pf1/username3

Please note the right hand side of the arrow “->”. If the arrow is pointing to a path starting with “/nv/…” and followed by a storage name included in the list provided above, then your data will be migrated. In this example, the location linked as “data” will be migrated (/nv/pf2/username3) but “data2″ will not (/gpfs/pace1/project/pf1/username3).

As an exception, all references to “bio-konstantinidis” and “bio-soojinyi” will be migrated, even tough their path starts with “/gpfs” and not “/nv”. E.g.:

lrwxrwxrwx 1 root bio-konstantinidis 43 Oct 11 2015 /nv/hp1/username3/data3 -> /gpfs/pace1/project/bio-konstantinidis/username3

What do I do if my project storage is being migrated?

No action needed for users who have been using the symbolic link names to access the storage (e.g. data, data2, etc.), because PACE will replace these links to point to the new locations.

If you have been referencing your storage using their absolute path (e.g. /nv/pf2/username), which is not recommended, then you need to replace all mentions of “/nv” with “/gpfs/pace1/project” in your codes, scripts and job submissions. E.g., “/nv/pf2/username3″ should be replaced as “/gpfs/pace1/project/pf2/username3″.

Users of bio-konstantinidis and bio-soojinyi should only need to replace “pace1″ with “pace2″. E.g., “/gpfs/pace1/project/bio-konstantinidis/username3″ should be replaced as “/gpfs/pace2/project/bio-konstantinidis/username3″.

NOTE: PACE strongly encourages all users to reference their project directories using their symbolic links (e.g. data, data2, …), rather than absolute paths, which are always subject to change. Doing so will minimize the user action needed when we make changes in the systems and configurations.

What if I don’t fix existing references to the old locations after my data are migrated?

PACE team will replace existing directories with links pointing to their new location to minimize user impact. This way, script/codes that are pointing to old paths can continue to run without needing any changes. However, this temporary failsafe measure will be in place for approximately 3 more months (until the next maintenance day). We strongly encourage all users to check if their data is being migrated, then fix their scripts/codes accordingly as needed, within this 3-months grace period. Please contact PACE team if you need any assistance with this process.

PACE team will also be monitoring jobs during this time period, and proactively reach out to users with jobs that are still using the old paths.

 

Schedulers (Requires some user action)

  • Consolidation of nvidia-gpu and gpu-recent queues on a new queue named “force-gpu”: This will require users of these queues to change the queue name to “force-gpu” in their submission scripts.
  • Clean up and improve PBSTools configurations and data
  • Migration of semap-6 queue to the dedicated-sched scheduler
  • [NEW] Migration of all joe queues on the shared-sched scheduler

ASDL / ITAR cluster (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Redistribute power connections to additional circuit in the rack
  • Replace CMOS batteries on compute nodes
  • Replace mother board on the file server, to use all the available memory slots

Power and network (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Update power distribution units on 2 racks
  • Move compute nodes to balance power utilization
  • Replace old, out of support switches
  • Update DNS appliances in Rich 116, 133 and BDCD
  • Increase redundancy to Infiniband connections between Rich 116 and 133

Campus preparedness and hurricane Irma

Posted by on Friday, 8 September, 2017

Greetings PACE community,

As hurricane Irma makes its way along the projected path through Florida and into Georgia, I’d like to let you know what PACE is doing to prepare.

OIT Operations will be closely monitoring the path of the storm and any impacts it might have on the functionality of computer rooms in the Rich Computer Center and our backup facility on Marietta Street. In the event that either of these facilities were to loose power, they will enact emergency procedures and react as best as possible.

What does this mean for PACE?

The room where we keep the compute nodes only has a few minutes of battery protected power. While this is plenty to ride through any momentary glitches in power, it only lasts a few minutes. In the event of a power loss, compute nodes will power down and terminate whatever jobs are running. The rooms where we keep our servers, storage and backups have additional generator power which can keep them running longer. This too is a finite resource. In the event of power loss, PACE will begin orderly shutdown of servers and storage in order to reduce the chance of data corruption or loss.

Bottom line is that our priority will be to protect the critical research data, and enable successful resumption of research once power is restored.

Where to get further updates?

Our primary communications channels remain our mailing list, pace-availability@lists.gatech.edu, and the PACE blog (http://blog.pace.gatech.edu). However, substantial portions of the IT infrastructure required for these to operate are also located in campus data centers. Additionally, OIT employs a cloud-based service to publish status updates. In the event that our blog is unreachable, please visit https://status.gatech.edu.

GPFS problem (resolved)

Posted by on Saturday, 2 September, 2017

This was much ado about nothing.  Running jobs continued to execute normally through this event, and no data was at risk.  What did happen is that jobs that could potentially have started were delayed.

A longer explanation –

We have monitoring agents that prevent jobs from starting if they detect a potential problem with the system.  The idea is to avoid starting a job if there’s a known reason that would cause a crash.  During our last maintenance period, we brought a new DDN storage system online and configured these agents to watch it for issues.  It did develop an issue, the monitoring agents flagged it and took nodes offline to new jobs.  However, we have yet to put any production workloads on this new storage so no running jobs were affected.

At the moment, we’re pushing out a change to the monitoring agents to ignore the new storage.  As this finishes rolling out, compute nodes will come online and resume normal processing.  We’re also working with DDN to address the issue on the new storage system.

GPFS Problem

Posted by on Friday, 1 September, 2017

We are actively debugging a GPFS storage problem on our systems that unfortunately brought many queues offline. We do not yet fully know the cause and solution, but will update as soon as possible.

We apologize for the inconvenience and are actively working on a solution.

Please Use iw-dm-4 for File Transfers

Posted by on Tuesday, 29 August, 2017

We experienced a slowdown yesterday on all headnodes caused by an unusually large amount of user file operations from headnodes. All headnodes are virtual machines and connect through a network file system gateway to the GPFS filesystem. This gateway became overwhelmed by the user file operations, and subsequently slowed down all headnode file operations.

If you have heavy file operations (i.e. winSCP, FileZilla, SCP), please perform these by logging directly into iw-dm-4.pace.gatech.edu instead of a headnode. Additionally, other file operations such as tarring/zipping are best performed on compute nodes by submitting interactive or batch jobs, as well as iw-dm-4.

We’re actively looking into alternatives to virtual machine headnodes, and will provide more detailed updates as we approach our upcoming scheduled maintenance in November (via blog.pace.gatech.edu). If you have any questions, please email us at pace-support@oit.gatech.edu.

Resolved: Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

We’ve addressed some of the problems with the TrueNAS storage and CSH/TCSH should now be working again. As it turns out, this problem wasn’t actually related to the maintenance last week, and we will continue to work with the vendor regarding the cause.

Apparent CSH and TCSH Problems

Posted by on Monday, 14 August, 2017

Currently, we’ve observed a correlation between hanging processes on all PACE systems and csh/tcsh, and are continuing to investigate. For the time being, if you have commands related to csh or tcsh, we ask that you please refrain from running them for the time being. It appears to be related to the TrueNAS storage system. We’re currently working with OIT and iXsystems to resolve the issue.