PACE A Partnership for an Advanced Computing Environment

December 21, 2012

TestFlight upgraded to new 6.3 stack

Filed under: tech support — Semir Sarajlic @ 9:55 pm

Here’s our present for the holidays: a new OS stack based on RHEL6.3, which our tests indicated that we get a performance boost across all CPU architectures. Please try your codes now on TestFlight to make sure we haven’t introduced new bugs in this stack, and report to us any problems you see.

December 20, 2012

Scheduled Quarterly Maintenance on 01/15/2013

Filed under: tech support — Semir Sarajlic @ 6:11 pm

The first quarterly maintenance of 2013 will take place on 01/15. All of the systems will be offlined for the entire day. We hope that no jobs will need to be killed, since we have been placing holds on jobs that would still be running on that day. If you submitted jobs with long walltimes (exceeding 01/15), then you will notice that they are being held by the scheduler to protect them from getting killed.

Here’s a summary of the tasks that we are planning to accomplish on the maintenance day.

* OS upgrade (6.2 to 6.3): We will upgrade the RHEL OS to version 6.3. This version offers better compatibility with our hardware, with potential benefits on performance. We have been testing existing software stack with this version to verify compatibility and do not expect any problems. We are upgrading the testflight nodes to 6.3 (they should be online very soon), so please submit test jobs to this queue to verify that your codes will continue to run on the new system.

* Scratch storage maintenance: As most of you already know, we have been working with Panasas to resolve the ongoing crashes. Panasas has identified the cause that will require a new release of their system software. We expect to deploy a tested version on this maintenance day.

 Important: The new release will be tested on a separate storage system that was provided by Panasas, and not on our production system. Therefore, we must be prepared for the possibility of unforeseen problems that will only be triggered by production runs with actual usage patterns. As an effort to shield long running jobs from such an undesired event, we are placing another reservation to only allow jobs that will complete by 02/17/2013, while holding longer jobs. This way, should we need to declare an emergency downtime on that day, we will be able to do so with minimal impact. This will require jobs with more than 31 days of walltime to be held until February the 17th, so please consider this while setting walltimes for your jobs. This reservation is contingent upon the stability of the system, and it can be removed earlier than this date if we feel confident enough. We are sorry for this  inconvenience.

* Conversion of more RHEL5 nodes to RHEL6: The majority of our users have made the switch to RHEL6 systems already. Therefore, we will migrate more of the FoRCE and Joe nodes to corresponding RHEL6 queues. We are not getting rid of the RHEL5 queues entirely (just yet), but the number of nodes they contain will be significantly reduced. Please contact us if your jobs are still dependent on RHEL5, since this version will be depreciated in the near future.

* Deployment of new database-driven configuration builders (dry-run mode only): We are developing a new system to manage user accounts and queue management, along with many other system management tasks, to minimize human error and maximize efficiency. We will deploy a dry-run mode only prototype of this system, which will run alongside with existing mechanisms. This will allow us to test and verify the new system against real usage scenarios to assist in the development effort, and will not be used for actual management tasks.

* New license server: We will start using a new license server, since the system on the existing server is getting old. We will migrate the existing licenses to the new server on the maintenance day. We don’t expect any difficulties, but please contact us if you notice any problems with licenses.

As always, please let us know if you have any concerns or questions at pace-support@oit.gatech.edu.

December 18, 2012

TestFlight in process of update

Filed under: Events,News,tech support — Semir Sarajlic @ 9:43 pm

We have temporarily stopped the queues for TestFlight to allow them to drain so that we may upgrade TF machines to a new stack based on RHEL 6.3. Once all machines have been upgraded, we will re-enable the queues for jobs to test the suitability of this new stack.

Should there be no major software issues, this will become the de facto OS for RHEL6 based clusters on the next maintenance day, scheduled for January 17, 2013.

Jobs failing to start due to scheduler problems (~10am this morning)

Filed under: Uncategorized — Semir Sarajlic @ 7:12 pm

We experienced scheduler-related problems this morning (around 10am), which caused jobs to terminate immediately after they are allocated on compute nodes. The system is back to normal, however we are still investigating what caused the issue.

If you have jobs that are affected by this issue, please resubmit them. If you continue to have problems, please contact us as soon as possible.

We are really sorry for this inconvenience.

 

December 12, 2012

Cluster Downtime December 19th for Scratch Space Cancelled

Filed under: Uncategorized — Tags: , — pm35 @ 9:25 pm

We have been working very closely with Panasas regarding the necessity of emergency downtime for the cluster to address the difficulties with the high-speed scratch storage. At this time, they have located a significant problem in their code base that, they believe, is responsible for this and other issues. Unfortunately, the full product update will not be ready in time for the December 19th date so we have cancelled this emergency downtime and all jobs running or scheduled will continue as expected.

We will update you with the latest summary information from Panasas when available. Thank you for your continued patience and cooperation with this issue.

– Paul Manno

TSRB Connectivity Restored

Filed under: Uncategorized — rlara3 @ 2:47 pm

Network access to the RHEL-5 Joe cluster compute nodes has been restored.

The problem was caused by a UPS power disruption to a network switch in the building. In addition to recovering the switch and UPS, the backbone team added power redundancy to the switch by adding another PDU to the switch and connecting it to a different UPS.

New Software: VASP 5.3.2

Filed under: Uncategorized — Semir Sarajlic @ 12:02 pm

VASP 5.3.2 – Normal, Gamma, and Non-Collinear versions

Version 5.3.2 of VASP has been installed.
The newly installed versions have been checked against our existing tests; the expected results agree to within some small error.
Please check this new version against your known correct results!

Using it

#First, load the required compiler 
$ module load intel/12.1.4
#Load all the necessary support modules
$ module load mvapich2/1.6 mkl/10.3 fftw/3.3
#Load the vasp module
$ module load vasp/5.3.2
#Run vasp $ mpirun vasp #Run the gamma-only version of vasp $ mpirun vasp_gamma #Run the noncollinear version of vasp $ mpirun vasp_noncollinear

Compilation Notes

  • Only the Intel compiler generated MPI-enabled vasp binaries that correctly executed the test suite.
  • The “vasp” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DNGZhalf -DMPI_BLOCK=8000
  • The “vasp_gamma” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DNGZhalf -DwNGZhalf -DMPI_BLOCK=8000
  • The “vasp_noncollinear” binary was compiled with these preprocessor flags: -DMPI -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=12000 -DMINLOOP=1 -DPGF90 -Davoidalloc -DMPI_BLOCK=8000

TSRB Connectivity Problem

Filed under: Uncategorized — rlara3 @ 1:41 am

All of the RHEL-5 Joe nodes are currently unavailable, due to an unspecified connectivity problem at TSRB. This problem does not impact any joe-6 nodes, or nodes from any other group.

Since connectivity between Joe and the rest of PACE is required for home, project, and scratch storage access, all of the jobs currently running on Joe will eventually get stuck in a IO-wait state, but should resume once connectivity has been restored.

December 7, 2012

Cluster Downtime December 19th for Scratch Space Issues

Filed under: Uncategorized — Tags: , — pm35 @ 7:16 pm

As many of you have noticed, we have experienced disruptions and undesirable performance with our high-speed scratch space. We are continuing to work diligently with Panasas to discover the root cause and repair for these faults.

As we are working toward a final resolution of the product issues, we will need to schedule an additional cluster-wide downtime on the Panasas to implement a potential resolution. We are scheduling a short downtime (2 hours) for Wednesday, December 19th at 2pm ET. During this window, we expect to install a tested release of software.

We understand this is an inconvenience to all our users but feel this is important enough to the PACE community to warrant this disruption. If this particular date and duration falls at a time that is especially difficult, please contact us and let us know and we will do our best to negotiate a better date or time.

It is our hope this will implement a permanent solution to these near-daily disruptions.

– Paul Manno

New and Updated Software: BLAST, COMSOL, Mathematica, VASP

Filed under: Uncategorized — Semir Sarajlic @ 5:14 pm

All of the software detailed below is available through the “modules” system installed on all PACE-managed Redhat Enterprise 6 computers.
For basic usage instructions on PACE systems see the Using Software Modules page.

NCBI BLAST 2.2.25 – Added multithreading in new GCC 4.6.2 version

The 2.2.25 version of BLAST that was compiled with GCC 4.4.5 has multithreading (i.e. multi-CPU execution) disabled.
A new version of BLAST with multithreading enabled has been compiled with the GCC 4.6.2 compiler.

Using it

#First, load the required compiler 
$ module load gcc/4.6.2
#Now load BLAST
$ module load ncbi_blast/2.2.25
#Setup the environment so that blast can find the database
$ export BLASTDB=/path/to/db
#Run a nucleotide-nucleotide search
$ blastn -query /path/to/query/file -db <db_name> -num_threads <number of CPUS allocated to job>

COMSOL 4.3a – Student and Research versions

COMSOL Multiphysics version 4.3a contains many new functions and additions to the COMSOL product suite. These Release Notes provide information regarding new functionality in existing products and an overview of new products.
See the COMSOL Release Notes for information on updates to this version of COMSOL.

Using it

#Load the research version of comsol 
$ module load comsol/4.3a-research
$ comsol ...
#Use the matlab livelink
$ module load matlab/r2011b
$ comsol -mlroot ${MATLAB}

Mathematica 9.0

Mathematica 9 is a major update to the Mathematica software.

Using it

$ module load mathematica/9.0 
$ mathematica

VASP 5.2.12

The pre-calculated kernel for the vdW-DF functional has been installed into the same directory as the vasp binary.
This precalculated kernel is contained in the file “vdw_kernel.bindat”

Using it

#First, load the vasp module (and all the prerequisites) 
$ module load intel/12.1.4 mvapich2/1.6 mkl/10.2 fftw/3.3 vasp/5.2.12
#Copy the kernel to where vasp expects (normally the working directory)
$ cp ${VDW_KERNEL} .
# Run vasp
$ mpirun vasp

Older Posts »

Powered by WordPress