The next maintenance day (4/16, Tuesday) is just around the corner and we would like to remind you that all systems will be powered off for the entire day. You will not be able to access the headnodes, compute nodes or your data until the maintenance tasks are complete.
None of your jobs will be killed, because the job scheduler knows about the planned downtime, and does not start any jobs that would be still running by then. You might like to check the walltimes for the jobs you will be submitting and modify them accordingly so they will complete sometime before the maintenance day, if possible. Submitting jobs with longer walltimes is still OK, but they will be held by the scheduler and released right after the maintenance day.
We have many tasks to complete, and here’s a summary:
1) Job Resource Manager/Scheduler maintenance
Contrary to the initial plan, we decided NOT to upgrade the resource manager (torque) and job scheduler (moab) software yet. We have been testing the new versions of these software (with your help) and, unfortunately, identified significant bugs/problems along the way. Despite being old, the current versions are known to be robust, so we will maintain the status quo until we resolve all of the problems with the vendor.
2) Interactive login prevention mechanism
Ideally, compute nodes should not allow for interactive logins, unless the user has active jobs on the node. We noticed that some users can directly ssh to compute nodes and start jobs, however. This may lead to resource conflicts and unfair use of the cluster. We identified the problem and will apply the fix on this maintenance day.
3) continued RHEL-6 migration
We are planning to convert all of the remaining Joe nodes to RHEL6 in this cycle. We will also convert the 25% of the remaining RHEL5 FoRCE nodes. We are holding off the migration for Aryabhata and Atlas clusters per request of those communities.
4) Hardware installation and configuration
We noticed that some of the nodes in the Granulous, Optimus and FoRCE are still running diskless, although they have local disks. Some nodes also not using the optimal choice for their /tmp. We will fix these problems.
We received (and tested) a replacement for the fileserver for the Apurimac project storage (pb3), since we have been experiencing problems there. We will install the new system and swap the disks. This is just a mechanical process and your data will is safe. As an extra precaution, we have been taking incremental backups (in addition to the regular backups) of this storage since it first started showing the signs of failure.
5) Software/Configurations
We will also patch/update/add software, including:
- Upgrade the node health checker scripts
- Deploy new database-based configuration makers (in dry-run mode for testing)
- Reconfigure licensing mechanism so different groups can use different sources for licenses
6) Electrical Work
We will also perform some electrical work to better facilitate the recent and future additions to the clusters. We will replace some problematic PDUs and redistribute the power among racks.
7) New storage from Data Direct Networks (DDN)
Last, but not least! In concert with a new participant, we have procured a new high performance storage system from DDN. In order to make use of this multi-gigabyte/sec performing monster, we are installing the GPFS filesystem. This is a commercial filesystem which PACE is funding. We will continue to operate the Panasas in parallel with DDN, and both storage systems can be used at the same time from any compute node. We are planning a new storage offering that allows users to purchase additional capacity on this system, so stay tuned.
As always, please contact us form pace-support@oit.gatech.edu for any questions/concerns you may have.
Thank you!
PACE Team