GT Home : : Campus Maps : : GT Directory

[RESOLVED] Temporary unavailability of home directories

Wednesday, September 19, 2018 Posted by
Comments closed

At around 6:10pm on Sep 19, 2018 the storage servers that export PACE home directories and the software repository experienced a problem. We have identified and resolved the issue within 15 min after the event.

This problem caused temporary unavailability of home directories and applications. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

 

 

Testflight queue transition and unavailability

Wednesday, September 12, 2018 Posted by
Comments closed

As you know, the testflight queue includes nodes that are reserved for testing the systems/services that are planned to be deployed in the future.

As a part of our preparations to transition to the next OS (RHEL7) we will offline this queue, swap its nodes with newly purchased nodes (that better represent the modern systems currently in use), and finally deploy the RHEL7 on these new nodes.

Once these preparations are complete, we’ll reach out to you and ask you to test your codes. Until then, testflight will not be available and submissions will be declined.

There are currently some jobs running on this queue. We’ll wait until the current jobs complete instead of killing them, but we would like to once again emphasize that the use of testflight for production is against policy. This queue should only be used for testing purposes.

Please let us know if you have any questions.

[Resolved] File locking issues causing hanging in codes and login troubles

Thursday, September 6, 2018 Posted by
Comments closed

If you have been observing mysteriously hanging codes, or trouble logging in on headnodes, please read on!

We started receiving reports for hanging processes, mostly for GPU codes. In addition, users who are using tcsh/csh shell as default had difficulties logging into nodes.

Upon further investigation, we found that a storage problem was affecting file locking mechanism on home directories (where most applications keep the configuration files, regardless of where they run).

This problem was very subtle, as it was impacting only a small number of processes and data operations appeared to be working well otherwise.

We have addressed this issue this morning (9/6, 10am) and you should no longer see hanging codes. Please report any ongoing issues to pace-support@oit.gatech.edu.

[RESOLVED] Scratch storage problems

Tuesday, August 14, 2018 Posted by
Comments closed
Update (08/15/2018): As suspected, internal data migrations were not happening automatically. We worked with the vendor to address the issue and it’s now once again safe to use the scratch storage. We’ll keep on monitoring the utilization just in case.
Original post:
We have received multiple reports of jobs crashing due to insufficient scratch storage, although the physical usage is only at %38.
We suspect that this issue is related to some disk pools that are not able to migrate data to other pools internally.
We are currently looking in to this problem. In the mean time, we recommend not using the scratch space if possible, until we have a better understanding of the situation.
Thank you, and sorry for this inconvenience.

[COMPLETE] PACE quarterly maintenance – (Aug 9-11, 2018)

Monday, July 30, 2018 Posted by
Comments closed

update (Aug 10, 2018, 8:00pm): Our Aug 2018 maintenance is complete, one day ahead of schedule. All of the tasks are completed as planned. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.

Please note the important changes regarding decommissioned login nodes, including the commonly used force-6 headnode.
Our next maintenance period is scheduled for Thursday, Nov 1 through Saturday, Nov 3, 2018.
Original message:

The next PACE maintenance will start on 8/9 (Thr) and may take up to 3 days to complete, as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 8/9 and resubmit if this will give them enough time to complete successfully.

Planned Tasks

Headnodes

  • (some user action needed) Most PACE headnodes (login nodes) are currently Virtual Machines (VM) with slow response time and sub-optimal storage performance, which are often the cause of slowness.

We are in progress of replacing these VMs with more capable physical servers. After the maintenance day, your login attempts to these VMs will be rejected with a message that tells you which hostname should you be using instead. In addition, we are in the progress of sending each user a customized email with a list of old and new login nodes. Please don’t forget to configure your SSH clients to use these new hostnames.

Simply, “login-s.pace.gatech.edu” will be used for all shared clusters and “login-d.pace.gatech.edu” will be for dedicated clusters. You’ll notice that once you  login, you’ll be redirected to one of the several physical nodes automatically (e.g. login-s1, login-d2, …) depending on their current load.

There will be no changes to clusters which already come with a dedicated (and physical) login node (e.g. gryphon, asdl, ligo, etc)

  • (some user action needed) As some of the users have already noticed, users can  no longer edit cronjobs (e.g. crontab -e) on the headnodes. This is on purpose because the access to new login nodes (login-d and login-s) are dynamically routed to different servers depending on their load. This means, you may not be able to see the cron jobs you have installed the next time you login to one of these nodes. For this reason, only PACE admins can install the cronjobs on behalf of users to ensure consistency (only login-d1 and login-s1 will be used for crons jobs). If you need to add (or edit) cronjobs, please contact pace-support@oit.gatech.edu. If you already have user cron jobs setup on one of the decommissioned VMs, they will be moved over to login-d1 or login-s1 during the maintenance so they’ll continue to run.

Storage

  • (no user action needed) Add a dedicated protocol node to the GPFS system to increase capacity and response time for non-InfiniBand connected systems. This system will gradually replace the IB gateway systems that are currently in operation.
  • (no user action needed) Replace batteries to DDN/GPFS storage controllers

Network

  • (no user action needed) Upgrades to the DNS appliances in both PACE datacenters
  • (no user action needed) Add redundant storage links to specific clusters

Other

  • (no user action needed) Perform network upgrades
  • (no user action needed) Replace devices that are out of support

[Resolved] Shared scheduler problems

Sunday, July 22, 2018 Posted by
Comments closed
Update (07/22/2018, 2:30am): The scheduler is back in operation again after we cleared a large number of jobs submitted by a user. We’ll continue to monitor the system for similar problems and work with users to normalize their workflows.
The shared scheduler has been going through some difficulties, which looks like due to large number of job arrays submitted recently. We don’t know the exact cause yet, but we are aware of the problems currently working on a resolution.
Until this issue is resolved, commands like qsub and qstat will not work, and showq will return an incomplete list of jobs.
This problem only applies to job submission and monitoring , your running and queued jobs are safe otherwise.

The PACE Scratch storage just got faster!

Friday, July 20, 2018 Posted by
Comments closed
We have made some improvements to the scratch file system, namely by adding SSD drives to be used for faster metadata management and data storage. We are pleased to report that this strategic allocation of relatively small number of SSDs yielded impressive performance improvements, more than doubling the write and read speeds (according to our standard benchmarks).
This work, performed under the guidance of the vendor, didn’t require any downtime and no jobs were impacted.
We hope you’ll enjoy the increased performance for a faster, better research!

 

[Resolved] Datacenter cooling problem with potential impact on PACE systems

Friday, June 29, 2018 Posted by
Comments closed

Update (06/29/2018, 3:30 pm): We’re happy to report that the issues with cooling systems are largely addressed without any visible impact on systems and/or running jobs. The schedulers are resumed, allocating new jobs as they are submitted. There is more work to be done to resolve the issue fully, but it can be performed without any disruption to services. You may continue to use PACE systems as usual. If you notice any problems, please contact pace-support@oit.gatech.edu

For a related status update from OIT, please see: https://status.gatech.edu/incidents/0ykh9wwnw50j

Original post:

The operations team notified PACE of cooling problems that started around noon today, impacting the datacenter housing the storage and virtual machine infrastructure. We immediately started monitoring the temperatures and turning off some non-critical systems as a precautionary step, and paused schedulers to prevent new jobs from running. Submitted jobs will be held until the problem is sufficiently addressed.

Depending on the course of this issue, there is a possibility that we may need to power down critical systems such as storage and Virtual Headnodes, but all critical systems are currently online for now.

We will continue to provide updates as we have them here on this blog and pace-available email list as needed.

Thank you!

 

 

Possible Water Service May Impact PACE Clusters

Monday, June 4, 2018 Posted by
Comments closed
You probably saw the announcement from Georgia Tech Office of Emergency Management (copied below). Our knowledge on the matter is limited to this message, but as far as we can understand a complete outage is unlikely, but still within possibility.

Impact on PACE Clusters:

In the event of a large scale outage, PACE datacenter cooling systems will stop working and we will need to urgently shutdown all systems, including but not limited to compute nodes, login nodes and storage systems as an emergency step. This will impact all of the running jobs and active sessions.
We’ll continue to keep you updated. Please check this blog for the most up-to-date information.
Thanks!

—————————————–

Original communication from Georgia Tech Office of Emergency Management:

To the campus community:

Out of an abundance of caution, Georgia Tech Emergency Management and Communications has taken steps to prepare the campus for the possibility of a water outage tonight in light of needed repairs to the City of Atlanta’s water lines.

The City of Atlanta’s Department of Watershed will repair a major water line beginning tonight between 11 p.m. and midnight. The repair is scheduled to be completed this week and should not negatively impact campus. If all goes according to plan, the campus will operate as usual.

In the event the repairs cause a significant loss of water pressure or loss of water service completely, the campus will be closed and personnel will be notified through the Georgia Tech Emergency Notifications System (GTENS).

If GTENS alerts are sent, essential personnel who are pre-identified by department leadership should report even if campus is closed. If the campus loses water, all non-essential activities will be canceled on campus.

Those with specialized research areas need to make arrangements tonight in the event there is a water failure. All lab work and experiments that can be delayed should be planned for later in the week or next week.

In the event of an outage, employees are asked to work with department leadership to work remotely. Employees who can work remotely should prepare before leaving work June 4 to work remotely for several days. Toilets won’t be operational, drinking water will not be available, and air conditioning will not be functioning in buildings on campus and throughout the city.

All who are housed on campus should fill bathtubs and other containers to have water on hand to manually flush toilets should there be a loss in pressure. Plans are underway to relocate campus residents to nearby campuses such as Emory University or Kennesaw State University in the event of a complete loss of water to the campus.

Parking and Transportation Services will continue on-campus transportation as long as the campus is open.

In the event of an outage, additional instructions and information on campus operations will be shared at gatech.edu.

Major Outage of GT network on Sunday, May 27

Thursday, May 24, 2018 Posted by
Comments closed

OIT Operations team informed us about a service outage on Sunday (5/27, 8am). Their detailed note is copied below.

This outage should not impact running jobs, however you will not be able to login to headnodes and VPN for the duration of this outage.

If you have ongoing data transfers (using SFTP, scp, rsync), they *will* be terminated. We strongly recommend waiting until successful completion of this work before starting any large data transfers. Similarly, your active SSH connections will be interrupted, please save your work and exit all sessions as you can.

PACE team will be in contact with the Operations team and provide status updates in this blog post as needed: http://blog.pace.gatech.edu/?p=6259

More details:

There will be a major service disruption to Georgia Tech’s network due to a software upgrade to a core campus router beginning on Sunday, May 27 at 8:00 a.m. Overall, network traffic from on campus to off and off campus to on will also be affected. Some inter-campus traffic will remain up during the work, but most services will not be available.While the software upgrade is expected to be complete by 9:00 a.m., and most connectivity restored, there may be outages with various centrally provided services. Therefore, a maintenance window is reserved from 8:00 a.m. until 6 p.m. The following services may be affected and therefore not available. These include, but are not limited to CAS (login.gatech.edu), VPN, LAWN (GTwifi, eduroam, GTvisitor), Banner/Oscar, Touchnet/Epay, Buzzport, Email (delayed delivery of e-mail but no e-mail lost), Passport, Canvas, Resnet network connectivity, Vlab, T-Square, DegreeWorks, and others.Before services go down, questions can be sent to support@oit.gatech.edu or via phone call at 404-894-7173.  During the work, please visit status.gatech.edu for updates. Our normal status update site, status.oit.gatech.edu, will not be available during this upgrade. After the work is completed, please report issues to the aforementioned e-mail address and phone number or call OIT Operations at 404-894-4669 for urgent matters.The maintenance consists of a software upgrade to a core campus router that came at the recommendation of the vendor following an unexpected error condition that caused a brief network outage earlier this week. “We expect the network connectivity to be restored by noon, and functionality of affected campus services to be recovered by 6:00 PM on Sunday May 27, though many services may become available sooner,” says Andrew Dietz, ITSM Manager, Sr., Office of Information Technology (OIT).We apologize for the inconvenience this may cause and appreciate your understanding while we conduct this very important upgrade.