PACE A Partnership for an Advanced Computing Environment

January 17, 2014

Scratch Quota Policies are Changing

Filed under: tech support — Semir Sarajlic @ 6:50 pm

We would like to give you a heads up of some upcoming adjustments to the scratch space quotas.

Current policy is a 10TB soft quota and a 20TB hard quota.  Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota.  This change should only affect a small handful of users.  Given the close proximity to our maintenance next week, we will be making this change at the end of January.  This is an easy first step that we can take to start addressing the recent lack of space on scratch storage.  We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity.  If you have any suggestions for other adjustments to scratch policy, please feel free to let us know (pace-support@oit.gatech.edu).

Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

January 16, 2014

January Maintenance is over

Filed under: tech support — Semir Sarajlic @ 1:09 am

January maintenance is complete, and clusters started accepting and running jobs. We accomplished all of the primary objectives, and even found time to address a few bonus items.

Most importantly, we completed updating the resource and scheduling managers (torque and moab) throughout the entire PACE realm. This upgrade should bring visible improvements in the speed and reliability. Please note that the job submission process will show some differences after this update, therefore we strongly encourage you to read the transition guide here: http://www.pace.gatech.edu/job-submissionmanagement-transition-guide-jan-2014

Also, please make sure that you check the FAQ for common problems and their solutions by running the command on your headnode:  jan2014-faq  (use the spacebar to skip pages).

We had a hardware failure in the DDN storage system, which caused an interruption in the planned Biocluster data transfer. We expect to receive the replacement parts and fix the system in a few days. This failure has not caused any data loss, and the system will be up and running (perhaps with some performance degradation). We learned that the repairs will require a short downtime, and we will soon get in touch with the users of Gryphon, Biocluster and Skadi clusters (current users of this system), for scheduling this work.

Other accomplishments include:

– Optimus is now a shared cluster. All Optimus users now have access to optimusforce-6 and iw-shared-6.

– All of the Atlas nodes are upgraded to RHEL6.

– Most of the Athena nodes are upgraded to RHEL6.

– The old scheduler server (repace) is replaced with the upgraded (shared-sched). You may notice a difference in the generated job numbers and files.

– Some networking cable cleanup and improvements

– Gryphon has new scheduler and login servers, and the nodes used for these purposes have been put back in the computation pool.

– Deployed project file space quotas as previously agreed with PIs to users who did not have quotas prior to maintenance, and adjusted for those already over to allow some head room before abutting their quota. To check your quotas, use “quota -s”.

January 14, 2014

January Maintenance under way

Filed under: tech support — Semir Sarajlic @ 11:01 am

The January maintenance period has begun.
All clusters will be inaccessible until maintenance is over.

January 9, 2014

Reminder – January Maintenance

Filed under: tech support — admin @ 2:00 pm

Hi folks,

Just a reminder of our upcoming maintenance activities next week. Please see my previous blog post here: https://blog.pace.gatech.edu/?p=5449 for details.

In addition to the items described in the previous post, we will also be fixing up some quotas on home and project directories for some users who have no quotas applied. Per policy, all users should have a 5GB quota in their home directory.  A preliminary look through our accounts indicates that only one or two users have no quota applied here, and are over the 5GB limit.  We will be in touch with those users shortly to address the issue.  Project directory quotas are at the sized at the discretion of the faculty.  For those users without a quota on their project directory, we will apply a quota that is sufficiently sized such that all users remain under their quota.  After the maintenance day, we will provide a report to faculty detailing the project directory usage of their users, and work with them to make any adjustments needed.  Remember, the project directory quotas are simply intended to prevent an accidental consumption of a space that would negatively impact the work of other users of that storage.

Related to the home & project quotas, I’d also like to give you a heads up of some upcoming adjustments to the scratch space quotas.  Current policy is a 10TB soft quota and a 20TB hard quota.  Given the space problems we’ve been having with the scratch, we will be adjusting this to a 5TB soft quota, and a 7 TB hard quota.  This change should only affect a small handful of users.  Given the close proximity to our maintenance next week, we will be making this change at the end of January, NOT next week.  This is an easy first step that we can take to start addressing the recent lack of space on scratch storage.  We are looking at a broad spectrum of other policy and technical changes, including changing retention times, improving our detection of “old” files, as well as increasing capacity.  If you have any suggestions for other adjustments to scratch policy, please feel free to let me know.  Please remember that the scratch space is intended for transient data – not as a long term place to keep things.

Finally, we will also be completing the upgrade of the remaining RHEL5 portions of the Atlas cluster to RHEL6.  Likewise, we will continue the migration of the Athena cluster from RHEL5 to RHEL6, leaving only a few nodes as RHEL5.

 

–Neil Bright

Powered by WordPress