Executive summary
PACE recognizes that the increasing frequency of performance issues on the storage system is causing disruptions to your research on the Phoenix cluster. We are striving to mitigate the impact of these events while taking proactive measures to improve the reliability of our systems for the future. To this end, we are introducing new storage technology, and prioritizing migration of data from the existing project storage to the new system over the next year. We are currently working towards finalizing a seamless migration plan. Once the plan is ready, around late spring, we will follow up with detailed information regarding the timeline and any potential workflow impacts. Our goal is to minimize disruption and ensure that everyone is aligned on key milestones. There will be no changes to the unit price until the new system is fully implemented, data migration is complete, the existing system is reconfigured, and we have sufficient data usage metrics to determine any necessary price adjustments. We estimate this will be no earlier than the end of 2025. We believe this to be the fastest path towards a stable and effective storage solution that can cater to the varied storage needs of our user community. Please find more details below.
Phoenix cluster at PACE hosts two major storage systems – scratch and project. Scratch is the temporary file system for storing files used during job execution — it is cleaned up once a month by including files older than 60days for deletion. Project is a long-term file system – it has 2.3pB of data and about 1.8B files on it. Besides environmental factors such as chilled water outage, these two storage systems have been significant contributors to downtime and degraded performance of the Phoenix cluster. Based on our analysis, so far for the calendar year 2024, storage failure or severe degradation accounts for 47% of unplanned downtime. Addressing this concern has been our primary focus over the past six months. This message aims to share our progress and plans for the next 12 months in this regard.
Scratch Space
The Scratch space is supported by a DDN 400NVX2 unit, which includes a mix of flash drives (NVMe) and spinning disks. The system was offlined during August maintenance for a major software upgrade performed by the vendor to improve its stability. Furthermore, software issues/instability on this unit has required us to turn off the use of flash drives for hot pools that has impacted performance. To address this, a second software upgrade is planned for January 2025 during our scheduled maintenance period, at which point we will configure the flash space to use Progressive File Layout (PFL), keeping small files in flash and progressively increasing the number of stripes across all devices for improved performance.
Project Space
Project space is supported by a DDN18K system purchased in 2020. This unit primarily uses spinning disks and is nearing its end of life. Over the past two years, we have been experiencing an increase in issues related to software defects and disk failures. While the system has built-in redundancies to support multiple disk failures, the disk rebuild process coupled with an increasing number of disk intensive research jobs on Phoenix are negatively impacting the performance.
We have adopted a two-pronged strategy to address the project storage:
- Perform software and limited hardware upgrades. During the August maintenance period, the hardware subsystem supporting the metadata functionality of the Lustre filesystem was replaced with a new dedicated unit. Due to its complexity, this operation required an additional day of work. The objectives of this upgrade were to a) improve the performance of the metadata functionality and b) provide an upgrade path for the future software releases. During the January 2025 maintenance period, the vendor will be able to perform a major software upgrade, to standardize Lustre 12.4 on all the storage appliances. This will increase the stability and performance of the project storage while simplifying its management.
- Consult with other research computing sites (including the Texas Advanced Computing Center) to invest in a new storage system to replace or complement our DDN 18K unit.
Based on feedback from other research computing sites, we made an initial investment in an all-flash storage system from VAST Data. Being an all-flash storage with disaggregated architecture we expect significantly better uptime and performance on this new system. Particularly, VAST system does not require downtime to perform major code upgrades. The vendor has installed the system, and we are in the process of bringing the unit into production to host data, initially in support of improvements to the DDN18K.
Our plan over the next 12 months includes:
- Migrate all data from the current DDN18K project space to the new VAST storage space to support these improvements. Due to the storage outages and performance issues affecting our user community, we want to clarify the following during the migration process over the next 12 months, or as long as necessary to complete the DDN18K improvements:
- All storage credit balances, including the free tier, will be equivalent to the existing system.
- The unit rate of $5.67 per TB per month for the paid tier will remain the same for both the current project space and the new VAST Storage system.
- Work with our vendor to reconfigure the DDN18K appliance to efficiently use all available SSD / NVMe drive space and recreate storage pools to remove performance bottlenecks during disk failure. This will require a complete reformatting of the storage space.
- Gather metrics on storage efficiencies (e.g., capacity reduction after data de-duplication and compression) and operational efficiencies so we can more accurately calculate the rate for VAST Storage.
- Leverage the VAST Data analytics to help users archive older data to lower-cost storage options such as CEDAR.
- Publish comparative analysis about functionality, performance, and resiliency between the different storage services provided by PACE to help users make decisions on what storage service(s) to use based on their type of research and data.
In the long term, we expect the new VAST Data storage system to be offered as a separate service with its own storage rate. This rate will likely be higher than the current $5.67 per TB per month for the DDN18K system. However, we cannot determine the final rate until we better understand how our data usage and efficiencies are managed by the new system.
Storage credits that have already been purchased, or are purchased during this transition, will retain their value on the DDN18K system (in terabyte months). Alternatively, they can be converted to storage credits on the VAST system at a ratio to be determined once the transition is complete. At that point, users will have the option to stay on VAST or move back to the reconfigured DDN18K, or choose a different, potentially cheaper storage option.
Our goal is to enhance the reliability of the storage system in PACE while introducing new technologies to meet the diverse needs and budgets of the Georgia Tech Research community. Our aim is to develop migration strategies that minimize disruptions to your workflows and offer more storage options tailored to your requirements. To this end, we are cross-evaluating bulk versus individual group migrations, and we will be engaging with different research groups as necessary. We are committed to providing regular updates during this process.
Thank you for your understanding and support during this project. If you have any questions or concerns, please feel free to reach out to us at pace-support@oit.gatech.edu.