PACE A Partnership for an Advanced Computing Environment

April 27, 2020

OIT Network Services Team Firewall upgrades (5/5/2020)

Filed under: Uncategorized — Tags: , , — Semir Sarajlic @ 9:02 pm

PACE has been informed that the OIT Network Services Team is preparing for software upgrades on multiple firewall servers across the Georgia Institute of Technology Atlanta campus on 5/5/2020 20:00 – 23:59, 5/7/2020 20:00 – 23:59, 5/8/2020 19:00 – 5/9/2020 02:00. While there are no direct impacts on the Rich and Coda Datacenter networks, there is potential for interruptions in connections to license servers, which can lead to job failures. Applications which may be impacted include

  • Abaqus
  • Ansys
  • Comsol
  • Dymola
  • Matlab

and any other application that may have a license server not internal to PACE. Due to potential interruptions, please check any jobs scheduled to run during these periods. PACE apologizes for any impact on your research workflow that this may cause. 

The Network Team will report their status for the project via the status.gatech.edu. Please check blog.pace.gatech.edu for updates. 

April 18, 2020

[Resolved again] Rich scratch mount down

Filed under: Uncategorized — Michael Weiner @ 6:35 pm

[Update 4/19/20 7:15 AM]

In coordination with our support vendor, we restored access to all scratch volumes at approximately 11:30 PM last night. Users on the affected scratch volumes should check any jobs that ran yesterday and resubmit if the job failed.
We are continuing to work with the support vendor to determine the source of the issue and make hardware changes to improve reliability of the scratch system in Rich going forward. Thank you for your patience yesterday. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Update 4/18/20 8:00 PM]

We are experiencing ongoing issues with our scratch filesystem. Users on volumes 1, 2, and 6 of scratch are currently unable to access their scratch directories. Volumes 0, 3, 4, 5, 7, 8, and 9 are unaffected.
You can identify your scratch volume by running the command “ll” in your home directory and looking for the scratch symbolic link’s destination. The volume is a digit 0-9 immediately preceding a slash and then your username at the end of the path.
e.g. “scratch -> /gpfs/scratch1/8/gburdell3” means that George is in scratch volume 8.

We are currently working to repair access to scratch and will update you when that is complete. We apologize for the continued disruption.

 

[Update 4/18/20 5:15 PM]

We have restored access to the GPFS mounted scratch filesystem in Rich, and compute nodes are again online and accepting jobs.
During a routine disk swap this morning, one of the dual controllers needed to be restarted, which caused an unexpected disruption. The system was automatically offlined to preserve data integrity. We have recovered and verified the filesystem, and nodes are back online. Users should check any jobs that were running earlier today, especially those that were accessing scratch, and resubmit if the job failed.
A few nodes will need additional fixes and remain offline. These will be released individually as they are repaired.
Please note that systems in Coda (Hive and testflight-coda) were unaffected. CUI/ITAR clusters in Rich were also unaffected.
Again, we apologize for the disruption. Please contact us at pace-support@oit.gatech.edu with any remaining concerns.

 

[Original Post]

The GPFS mounted scratch system (~/scratch) in Rich is currently down again. This means that you cannot currently access your scratch directory, and jobs writing to scratch will fail.
Due to the loss of the scratch mount, most PACE nodes are now marked “down or offline” to prevent new jobs from starting and failing.
We are working to restore the mount and will update you when a repair is in place. We apologize for the disruption.

PACE systems in Coda (Hive and testflight-coda) are unaffected.

April 15, 2020

[Resolved] Scratch inaccessible on datamover node

Filed under: Uncategorized — Michael Weiner @ 6:00 pm

[Update]

This issue has been resolved. We still encourage users to take advantage of Globus for an improved data transfer experience.

[Original Post]

While the scratch filesystem is once again available on the login & compute nodes, it is still inaccessible on the datamover node (iw-dm-4), which many of you use to access your files via scp or sftp protocols. Your data directories are currently available there. We always encourage you to use Globus instead of scp or sftp, and that is the best workaround at this time to move files between scratch and non-PACE locations. For instructions on using Globus, please visit http://docs.pace.gatech.edu/storage/globus/. The datamover node may eventually be decommissioned, so now is a good time to begin using Globus if you have not already done so. Please contact us at pace-support@oit.gatech.edu if you have any questions. We apologize for the ongoing disruption.

April 14, 2020

[RESOLVED] Rich data center storage problems (/user/local)– Paused Jobs

Filed under: Uncategorized — Semir Sarajlic @ 8:23 pm

Dear PACE Users,

At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue),  please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[Original Message]

In addition to this morning’s ongoing project/data and scratch storage problems, our fileserver that serves the shared “/usr/local” on all PACE machines located in Rich Data Center started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

At this point, we have paused all schedulers for Rich-based resources.   With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved.   Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

This storage problem and scheduler pause does not impact Coda data center’s Hive and TestFlight-Coda cluster.

We are working to resolve these problems ASAP and will keep you updated on this post.

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

Filed under: Uncategorized — Aaron Jezghani @ 1:21 pm

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

April 8, 2020

PACE License Manager and Server Issues

Filed under: Uncategorized — Aaron Jezghani @ 1:56 pm

Overnight we experienced issues with several of our servers, including our License manager, GTLib server, and the Testflight and Novazohar queues. We are actively addressing the problem, having restored functionality to the License manager and Novazohar. We are still working on Testflight, and will provide updates as they are available. As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu accordingly.

April 3, 2020

Hive Cluster — Scheduler modifications/Policy Update

Filed under: Uncategorized — Semir Sarajlic @ 10:00 pm

Dear Hive Users,

Hive cluster has been in production for over half a year, and we are pleased with the continued growth of user community on Hive cluster along with the utilization of the cluster that has remained highly utilized.  As the cluster has begun to near 100% utilization more frequently, we have received additional feedback from users that compels us to make additional changes that are to ensure continued productivity for all users of Hive.  Recently, Hive PIs have approved the following changes listed below that will be deployed on April 10:

  1. Hive-gpu-short: We are creating a new gpu queue with a maximum walltime of 12 hours.  This queue will consist of 2 nodes that will be migrated from hive-gpu queue.  This will address the longer job wait times that select users experienced on hive-gpu queue, and this queue will promote users who have short/interactive/machine learning jobs to further develop and grow on this cluster.
  2. Adjust dynamic priority: We will adjust the dynamic priority to reflect the PI groups in addition to individual users.  This will provide an equal and fair opportunity for each of the research teams to access this cluster.
  3. Hive-interact: We will reduce hive-interact queue to 16 nodes from the current 32 nodes due to this queue’s utilization being low.

Who is impacted:  All Hive users will be impacted by the adjustment to the dynamic priority.

User Action:  For users who want to use the new hive-gpu-short queue, users will need to update their PBS scripts with the new queue and the walltime may not exceed 12 hours.

Our documentation will be updated on April 10 to reflect these changes to queues that you will be able to access from the http://docs.pace.gatech.edu/hive/gettingStarted/. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu.

Best,
The Past Team

Powered by WordPress