GT Home : : Campus Maps : : GT Directory

Author Archive

Jobs Accidentally Killed

Posted by on Friday, 14 February, 2014

Dear users,

At least 1,600 queued and running jobs were accidentally killed last week by a member of the PACE-team, who was trying to clear out their own jobs. PACE-team accounts have elevated rights to certain commands, and the person who deleted the jobs did not realize that the command they were using would apply to more then just their own jobs.

If you have access to the iw-shared-6 queue, and were running jobs and/or had jobs queued earlier this week, this accident has likely impacted you.

Our deepest apologies for the unexpected and early job-terminations. We are re-evaluating our need to grant elevated permissions to our regular accounts, in order to prevent this from happening again

Thank you,
PACE team

Login Problems

Posted by on Thursday, 6 June, 2013

With the exception of RHEL-5 Atlas users, it is currently not possible for regular users to log into PACE, due to a problem with the PANFS storage system. We are working to get the problem resolved as quickly as possible.

TSRB Connectivity Restored

Posted by on Wednesday, 12 December, 2012

Network access to the RHEL-5 Joe cluster compute nodes has been restored.

The problem was caused by a UPS power disruption to a network switch in the building. In addition to recovering the switch and UPS, the backbone team added power redundancy to the switch by adding another PDU to the switch and connecting it to a different UPS.

TSRB Connectivity Problem

Posted by on Wednesday, 12 December, 2012

All of the RHEL-5 Joe nodes are currently unavailable, due to an unspecified connectivity problem at TSRB. This problem does not impact any joe-6 nodes, or nodes from any other group.

Since connectivity between Joe and the rest of PACE is required for home, project, and scratch storage access, all of the jobs currently running on Joe will eventually get stuck in a IO-wait state, but should resume once connectivity has been restored.

pace-stat

Posted by on Friday, 31 August, 2012

In answer to the requests made by many for insight on the status of your queues, we’ve developed a new tool for you called ‘pace-stat’ (/opt/pace/bin/pace-stat).

When you run pace-stat, a summary of all available queues will be displayed, and for each queue, values for:

– The number of jobs you have running, and the total number of running jobs
– The number of jobs you have queued, and the total number of queued jobs
– The total number of cores that all of your running jobs are using
– The total number of cores that all of your queued jobs are requesting
– The current number of unallocated cores free on the queue
– The approximate amount of memory/core that your running jobs are using
– The approximate amount of memory/core that your queued jobs are requesting
– The approximate amount of memory/core currently free in the queue
– The current percentage of the queue that has been allocated (by all running jobs)
– The total number of nodes in the queue
– The maximum wall-time for the queue

Please use pace-stat to help determine resource availability, and where best to submit jobs.