PACE A Partnership for an Advanced Computing Environment

August 18, 2020

[Resolved/Monitoring] GPFS – network issues

Filed under: Uncategorized — Semir Sarajlic @ 6:41 pm
We began experiencing a network issue earlier today at approximately 2:00AM with the connection between our GPFS filesystem (data and scratch directories) and about one third of PACE’s compute nodes in the Rich datacenter. Affected nodes are on these racks, indicated by the second section of the node name (e.g., rich133-s40-20 or iw-s40-21 would be on rack s40):

b13, b14, b16, b17, c32, c34, c36, c38, g13, g14, g15, g16, g17, h31, h33, k35.

As a result of this network issue, users may have experienced slow read/write on GPFS directories from these nodes that may also have impacted MPI running jobs on these nodes. We finished making a repair late this afternoon, but the slowness could return, and we are continuing to monitor the system. Thank you to users who have been reporting the issue today via support tickets. Please continue to contact us if the slowness returns.
If your jobs have been running on the impacted nodes and not producing output, please cancel and resubmit them.  To check what nodes your job is running on, please run the following command: qstat -u USER_NAME -n, replacing USER_NAME with your username, eg. “qstat -u.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress