PACE A Partnership for an Advanced Computing Environment

June 26, 2012

upcoming maintenance day, 7/17 – please test your codes

Filed under: tech support — admin @ 6:38 pm

It’s that time of the quarter again, and all PACE-manager clusters will be taken offline for maintenance on July 17 (Tuesday). All jobs that will not complete by then will be held by the scheduler. They will be released by the scheduler once the clusters are up and running again, requiring no further action on your end. If you find that your jobs does not start running, then you might like to check its walltime to make sure it does not exceed this date.

With this maintenance, we are upgrading our RedHat 6 clusters to RedHat 6.2, which includes many bugfixes and performance improvements. This version is known to provide better software and hardware integration with our systems, particularly with the 64-core nodes we have been adding over the last year.

We are doing our best to test existing codes with the new RedHat 6.2 stack. In our experience, codes currently running on our RedHat 6 systems continue to run without problems. However we strongly recommend you test your critical codes on the new stack. For this purpose, we renovated the “testflight” cluster to include RedHat 6.2 nodes, so all you need for testing is to submit your RedHat 6 jobs to the “testflight” queue. If you would like to recompile your code, please login to the testflight-6.pace.gatech.edu head node. Please try to keep the problem sizes small since this cluster only includes ~14 nodes with cores varying from 16 to 48, plus a single 64 core node. We have limited this queue to two jobs at a time from a given user. We hope the testflight cluster will be sufficient to test drive your codes, but if you have any concerns, or notice any problems with the new stack, please let us know at pace-support@oit.gatech.edu.

We will also upgrade the software on the scratch storage Panasas. We have observed many ‘failover’ events resulting in brief interruptions of service under high loads, potentially incurring performance penalties on running codes. This version is supposed to help address these issues.

We have new storage systems for Athena (/nv/pz2) and Optimus (/nv/pb2). During maintenance day, we will move these filesystems off of temporary storage, and onto their new servers.

More details will be forthcoming on other maintenance day activities, so please keep an eye on our blog at https://blog.pace.gatech.edu/

Thank you for your cooperation!

-PACE Team

June 11, 2012

Scheduler Problems

Filed under: tech support — Semir Sarajlic @ 5:14 pm

The job scheduler is currently under heavy load (heavier than any we have seen so far).

Any commands you run to query the scheduler (showq, qstat, msub, etc.) will probably fail because the scheduler can’t respond in time.

We are working feverishly to correct the problem.

Powered by WordPress