GT Home : : Campus Maps : : GT Directory

Author Archive

Improvements to job accounting and queue wait times on PACE clusters

Posted by on Thursday, 23 December, 2021
We would like to share two updates with you regarding improvements to job accounting and queue wait times on the Phoenix and Firebird clusters.
  • Due to an error, some users have seen the wrong account names listed in our pace-quota and pace-whoami utilities in recent months. We have corrected this, and all users can now use pace-quota to see available charge accounts and balances on Phoenix or Firebird. At the same time, a new improvement to our utility now makes balances visible for all accounts, including multi-PI or school-owned accounts that previously displayed a zero balance, so researchers can always check available balances. Read our documentation for more details about the charge accounts available to you and what they mean. The pace-quota command is available on Phoenix, Hive, Firebird, and ICE. It provides user-specific details:
    • your storage usage on that cluster
    • your charge account information for that cluster (Phoenix and Firebird only)
  • Additionally, in order to improve utilization of our clusters and reduce wait times, we have enabled spillover between node classes, allowing waiting jobs to run on underutilized, more capable nodes rather than those requested, requiring no user action, at no additional charge. Spillover on GPU nodes was enabled in September, while CPU nodes gained the capability last week, on both Phoenix and Firebird.
Please note that targeting a specific/more expensive node class to reduce wait time is no longer effective or necessary. Please request the resources required for your job. Your job will continue to be charged based on the rate for the resources it requests, even if it ends up being assigned to run on more expensive hardware.
As always, please contact us if you have any questions.

PACE availability during the Holidays

Posted by on Tuesday, 21 December, 2021

While leaving 2021 behind, we wanted to remind everyone that PACE clusters will continue to operate during the GT Institute Holiday. However, PACE staff will not be generally available for support. Please continue to report any problems or requests you may have to pace-support@oit.gatech.edu. We will receive those and get back to you as soon as possible after the holidays.

2021 was a pivotal year for PACE. We migrated all of our services to our new datacenter, changed our service model, and working to better serve GT researchers and students. We could not have done any of these without your input, support and patience. We are grateful for that and look forward to achieving more great things together in 2022.

Happy Holidays and a New Year!

PACE’s centralized OSG service, powered with a new cluster “Buzzard”

Posted by on Thursday, 14 October, 2021

We are happy to announce a new addition to PACE’s service portfolio to support Open Science Grid (OSG) efforts on campus and beyond. This service is kick-started by a brand new cluster, named “Buzzard”, funded by an NSF award* lead by Dr. Mehmet Belgin and Semir Sarajlic of PACE, in collaboration with Drs. Laura Cadonati, Nepomuk Otte, and Ignacio Taboada of the Center for Relativistic Astrophysics (CRA). 

Open Science Grid (OSG) is a unique consortium that provides shared infrastructure and services to unify access to supercomputing sites across the nation, making a vast array of High Throughput Computing (HTC) resources available to US-based researchers. OSG has been instrumental in ground-breaking scientific advancements, including but not limited to the Nobel-winning Gravitational Waves research (LIGO).  

Did you know that all of the GT researchers already qualify for OSG? This means you can join today and start running jobs on this vast resource at no cost. We highly encourage you to register for PACE’s next OSG orientation class, which will get you started with the basics of running on OSG.  As an added resource, PACE offers documentation to get researchers quickly started with OSG. 

In addition to training and documentation, PACE offers resource integration services. More specifically, GT faculty members now have an option to acquire new resources to expand Buzzard with their own OSG projects, similar to the High Performance Computing (HPC) services PACE had been successfully offering since 2009 prior to the new cost model. As a part of the NSF award, PACE already started supporting several exceptional OSG projects, namely LIGO, IceCube and CTA/VERITAS, and we look forward to supporting more OSG projects in the future! 

If you are interested in the OSG service, please feel free to reach out to us (pace-support@oit.gatech.edu) and we’ll be happy to discuss how our new service can transform your research. 

Thank you! 

 

* This material is based upon work supported by the National Science Foundation under grant number 1925541. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 

[RESOLVED] Phoenix Scheduler is Down

Posted by on Thursday, 13 May, 2021

Update (5/13 2:00pm): We are happy to report that the Phoenix Scheduler is now online and accepting jobs.

We are sorry for the inconvenience this has caused and please let us know if you continue to observe any problems (pace-support@oit.gatech.edu)
—-
At around 10:30am this morning, we restarted the Phoenix scheduler to apply a new license file. The scheduler is having trouble coming back online and we are actively troubleshooting this issue. So far we know the issue is unrelated to the license, rather some left over job files may be causing the issue. We are working on reviving the scheduler as soon as possible.

This issue doesn’t impact any running jobs, or those submitted before the incident. Only new job submissions will fail with an error.

We’ll update this post (http://blog.pace.gatech.edu/?p=7075) and send a follow up message once the issue is resolved.

Thank you for your patience and sorry for this inconvenience.

 

 

[RESOLVED] Intermittent unavailability of Phoenix login nodes

Posted by on Wednesday, 2 December, 2020
Phoenix login nodes 1, 2 and 4 became unavailable today for short periods of time. We identified the issue as excessive user activity and rebooted the nodes, which are now available. We will reach out to the relevant users to prevent this from happening again.

PACE Procurement Timeline Adjustments

Posted by on Friday, 29 March, 2019

PACE Staff have completed our move to the CODA building and are settling in. We’ve also added a couple of new faces to the team, announcements will be forthcoming shortly.

As the year-end purchasing deadlines approach, we wanted to update the community on some changes to our procurement calendar. We’re doing our best to advocate for the research community and navigate some tough realities. We’ve nearly exhausted our space in the Rich Computer Center, and are very limited in our ability to deploy new equipment in that space. The CODA datacenter will be our new home (more on that below) but is not quite ready yet.

As such, we have cancelled the previously planned FY19-Phase3 and will need to shift some dates for our last order in FY19, FY19-Phase4. This shift results in FY19-Phase4 and FY20-Phase1 essentially being deployed concurrently around October of 2019. For this reason, we strongly encourage faculty to participate in FY20-Phase1 and reserve FY19-Phase4 for those who need to use funds expiring in FY19.

We will also adjust configurations and pricing for FY19-Phase4 and FY20-Phase1 based on upcoming processing technology and market conditions once that pricing is available to the public.

Finally, planning is in progress for PACE to migrate existing research cyberinfrastructure from the Rich data center to CODA, and all efforts will be made to minimize disruption to research efforts during this move. The execution phase will not begin until at least October 2019.

To view the published schedule online or for more information, visit http://pace.gatech.edu/participation or email pace-support@oit.gatech.edu.

Best Regards,

-PACE Team

PACE clusters ready for research

Posted by on Saturday, 16 February, 2019
Our February 2019 maintenance (http://blog.pace.gatech.edu/?p=6419) is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Please let us know any problems you may notice: pace-support@oit.gatech.edu

Compute
* (COMPLETE) Vendor will replace defective components on groups of servers
Network

* (COMPLETE) Ethernet network reconfiguration

Storage
* (COMPLETE) GPFS / DDN enclosure reset

* (COMPLETE) NAS maintenance and reconfiguration

Other
• (COMPLETE) PACE VMWare reconfiguration to remove out of support hosts

* (COMPLETE) Migration of Megatron cluster to RHEL7

PACE quarterly maintenance – (Feb 15-16, 2019)

Posted by on Friday, 18 January, 2019

[Update – 02/11/2019] Our updated quarterly scheduled maintenance task list will include the following:

Compute

  • (no user action needed) Vendor will replace defective components on groups of servers

Network

  • (no user action needed) Ethernet network reconfiguration

Storage

  • (no user action needed) GPFS / DDN enclosure reset
  • (no user action needed) NAS maintenance and reconfiguration

Other

  • (no user action needed) PACE VMWare reconfiguration to remove out of support hosts

 

[Original Post – 01/18/2019] We are preparing for a short maintenance day on February 15, 2019. Unlike our regular schedule, which starts on Thursdays and takes three days, this maintenance will start on a Friday and take only two days.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

In general, we’ll perform maintenance on the GPFS storage, migrate some Virtual Machines to new servers, perform hardware changes on one of the clusters, and finalize the migration of “/usr/local”, which is network attached mount point on all machines, to a more reliable storage pool.

While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

We’ll update this post as we have more details.

 

 

Changes to mount points (no user impact expected)

Posted by on Thursday, 3 January, 2019

The investigation results that followed the system failures that temporarily rendered the scientific repository unresponsive (http://blog.pace.gatech.edu/?p=6390) will require some additional maintenance. To facilitate this maintenance, we will make a change to the mount point for /usr/local, which is network mounted and identical on all compute nodes.

Our tests indicate that this swap can be performed live, without impacting running jobs. It’s also completely transparent to users; you don’t need to change or do anything as a result.

In the unlikely event of job crashes that you suspect are caused by this operation, please contact pace-support@oit.gatech.edu and we’ll be happy to assist.

Thank you,
PACE Team

[Resolved] Wide spread problems impacting all PACE machines

Posted by on Tuesday, 11 December, 2018

Update (12/21, 10:15am): A correction: The problems have started this morning around 8:15am, not yesterday evening as previously communicated. The systems were back online at 8:45am.

Update (12/21, 9:15am): There has been another incident started last night, causing the same symptoms (hanging and unavailability of scientific repository). OIT storage engineers reverted the services on the redundant system (high availability pair) and the storage is available again. We continue to work on investigating the root cause of recurring failures experienced since the past several weeks.

Update (12/12, 6:30pm): The services are successfully migrated to the high availability pair and the filesystems are once again accessible. We’ll continue to monitor the systems and take a close look into the errant components. It’s still a possibility that some of these problems may recur, but we’ll be ready to address them should they happen.

Update (12/12, 5:30pm): Unfortunately the problems seem to be coming back. We continue to work on this. Thank you for your patience.

Update (12/12, 11:30am): We identified the root cause as a configuration conflict between two devices and resolved the problem. All systems are back online and available for jobs.

Update (12/12, 10:00am): Our battle with the storage system continues. This filesystem is designed as a high availability service with redundancy components to prevent such situations, but unfortunately the second system failed to take over successfully. We are investigating the possibility of network being the culprit. We continue to work rigorously to bring the systems back online ASAP.

Update (12/11, 9:00pm): Continued problems, we are working on it with support from related OIT units. 

Update (12/11, 7:30pm): We mitigated the issue, but the intermittent problems may continue to recur until the root cause is addressed. We continue to work on it.

Original message:

Dear PACE Users,

At around 3:45pm on Dec 11  the fileserver that serves the shared “/usr/local” on all PACE machines started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

We are working to resolve these problems ASAP and will keep you updated on this post.