PACE A Partnership for an Advanced Computing Environment

July 18, 2012

maintenance day complete, ready for jobs

Filed under: tech support — admin @ 6:14 am

We are done with maintenance day – however some automated nightly processes still need to run before jobs can flow again.  So, I’ve set an automated timer to release jobs at 4:30am today.  That’s a little over two hours from now.  The scheduler will accept new jobs now, but will not start executing until 4:30am.

 

With the exception of the following two items, all of the tasks listed at our previous blog post have been accomplished.

  • * firmware updates on the scratch servers were deferred per the strong recommendation of the vendor
  • * an experimental software component of the scratch system was not tested due to the lack of test plan from the vendor.

 

SSH host keys have changed on the following head nodes.  Please accept the new keys into your preferred SSH client.

  • atlas-6
  • atlas-post5
  • atlas-post6
  • atlas-post7
  • atlas-post8
  • atlas-post9
  • atlas-post10
  • apurimac
  • biocluster-6
  • cee
  • critcel
  • cygnus-6
  • complexity
  • cns
  • ece
  • granulous
  • optimus
  • math
  • prometheus
  • uranus-6

July 17, 2012

10TB soft quota per user on scratch storage

Filed under: tech support — Semir Sarajlic @ 2:59 pm

One of the many benefits of using PACE clusters is the scratch storage, which provides a fast filesystem for I/O-bound jobs. The scratch server is designed to offer high speeds but not so much storage capacity. So far, a weekly script that deletes all files older than 60 days had allowed us sustain this service without the need for disk quotas. However this situation started changing as the PACE clusters had grown to a whopping ~750 active users, with the addition of ~300 users only since Feb 2011. Consequently, it became common for the scratch utilization to reach 98%-100% on several volumes, which is alarming for the health of the entire system.

We are planning to address this issue with a 2-step transition plan for enabling file quotas. The first step will be applying 10TB “soft” quotas for all users for the next 3 months. A soft quota means that you will receive warning emails from the system if you exceed 10TB, but your writes will NOT be blocked. This will help you adjust your data usage and get prepared for the second step, which is the 10TB “hard” quotas that will block writes when the quota is exceeded.

Considering that the total scratch capacity is 260TB, a 10TB quota for 750 users is a very generous limit. Looking at some current statistics, the number of users using more than this capacity does not exceed 10. If you are one of these users (you can check using the command ‘du -hs ~/scratch’) and have concerns that the 10TB quota will adversely impact your research, please contact us (pace-support@oit.gatech.edu).

July 10, 2012

REMINDER – upcoming maintenance day, 7/17

Filed under: tech support — admin @ 8:15 pm

The  major activity for maintenance day is the RedHat 6.1 to RedHat 6.2 software update.  (Please test your codes!!)   This will affect a significant amount of our user base.  We’re also instituting soft quotas on the scratch space.  Please see the detail below.

The following are running RedHat 5, and are NOT affected:

  • Athena
  • Atlantis

The following have already been upgraded to the new RedHat 6.2 stack.  We would appreciate reports on any problems you may have:

  • Monkeys
  • MPS
  • Isabella
  • Joe-6
  • Aryabhata-6

If I didn’t mention your cluster above, you are affected by this software update.  Please test using the ‘testflight’ queue.  Jobs are limited to 48 hours in this queue.  If you would like to recompile your software with the 6.2 stack, please login to the ‘testflight-6.pace.gatech.edu’ head node.

Other activities we have planed are:

Relocating some project directory servers to an alternate data center on campus.  We have strong network connectivity, so this should not change performance of these filesystems.  No user modifications needed.

  • /nv/hp3 – Joe
  • /nv/pb1 – BioCluster
  • /nv/pb3 – Apurimac
  • /nv/pc1 – Cygnus
  • /nv/pc2 – Cygnus
  • /nv/pc3 – Cygnus
  • /nv/pec1 – ECE
  • /nv/pj1 – Joe
  • /nv/pma1 – Math
  • /nv/pme1 – Prometheus
  • /nv/pme2 – Prometheus
  • /nv/pme3 – Prometheus
  • /nv/pme4 – Prometheus
  • /nv/pme5 – Prometheus
  • /nv/pme6 – Prometheus
  • /nv/pme7 – Prometheus
  • /nv/pme8 – Prometheus
  • /nv/ps1 – Critcel
  • /nv/pz1 – Athena

Activities on the scratch space – no user change is expected for any of this.

  • We need to balance some users on volumes v3, v4, v13 and v14.  This will involve moving users from one volume to another, but we will place links in the old locations.
  • Run a filesystem consistency check on the v14 volume.  This has the potential to take a significant amount of time.  Please watch the pace-availability email list (or this blog) for updates if this will take longer than expected.
  • firmware updates on the scratch servers to resolve some crash & failover events that we’ve been seeing.
  • institute soft quotas.  Users exceeding 10TB of usage on the scratch space will receive automated warning emails, but writes will be allowed to proceed.  Currently, this will affect 6 of 750+ users.  The 10TB space represents about 5% of a rather expensive shared 215TB resource, so please be cognizant of the impact to other users.

Retirement of old filesystems.  User data will be moved to alternate filesystems.  Affected filesystems are:

  • /nv/hp6
  • /nv/hp7

Performance upgrades (hardware RAID) for NFSroot servers for the Athena cluster. Previous maintenance activities have upgraded other clusters already.

moving some filesystems off of temporary homes and onto new servers.  Affected filesystems are:

  • /nv/pz2 – Athena
  • /nv/pb2 – Optimus

If time permits, we have a number of other “targets of opportunity” –

  • relocate some compute nodes and servers, removing retired systems
  • reworking a couple of Infiniband uplinks for the Uranus cluster
  • add resource tags to the scheduler so that users can better select compute node features/capabilities from their job scripts
  • relocate a DNS/DHCP server for geographic redundancy
  • fix system serial numbers in the BIOS for asset tracking
  • test a new Infiniband subnet manager to gather data for future maintenance day activities
  • rename some ‘twin nodes’ for naming consistency
  • apply BIOS updates to some compute nodes in the Optimus cluster to facilitate remote management
  • test an experimental software component of the scratch system.  Panasas engineers will be onsite to do this and revert before going back into production.  This will help gather data and validate a fix for some other issues we’ve been seeing.

Powered by WordPress