PACE A Partnership for an Advanced Computing Environment

September 19, 2011

XSEDE Scholars: includes travel to SC11, applications due Fri Sep 30

Filed under: Events — admin @ 3:57 pm

XSEDE Scholars Program

— Includes travel grants for the SC11 supercomputing conference in Seattle WA Nov 12 – 16 2011!

Applications due Fri Sep 30 2011

http://bit.ly/xsede_2011

 

An outstanding student opportunity in the computational sciences!

Supercomputers, data collections, new tools, digital services, increased productivity for thousands of scientists around the world.

Sound exciting? Apply to become an XSEDE Scholar today!

XSEDE is a five-year, $121 million project supported by the National Science Foundation that replaces and expands on the NSF’s TeraGrid project.

More than 10,000 scientists used the TeraGrid to complete thousands of research projects, at no cost to the scientists.

XSEDE continues that same sort of work — only in more detail, generating more knowledge, and improving our world in an even broader range of fields.

You can become involved in XSEDE, too, if selected as a Scholar.

XSEDE Scholars will:

  • Attend the SC11 (supercomputing) conference in Seattle, November 12-16, 2011, through travel grants.
  • Meet other XSEDE Scholars in special sessions at SC11.
  • Participate in at least four activities with other Scholars during the year (e.g., technical training, content-based and mentoring webinars).
  • Network with leaders in the XSEDE research community.
  • Learn about research, internships, and career opportunities.

The XSEDE Scholars Program is directed by Richard Tapia and managed by Alice Fisher, Rice University.

 

Apply now at:

http://bit.ly/xsede_2011

 

 

APPLICATION DEADLINE: Friday, September 30, 2011

Underrepresented minority undergraduate or graduate students interested in the computational sciences are especially

encouraged to apply for the year-long program.

 

Questions about the XSEDE Scholars?

Contact Alice Fisher: afisher@rice.edu

 

The Extreme Science and Engineering Discovery Environment (XSEDE) is supported by the National Science Foundation.

XSEDE is led by the University of Illinois’ National Center for Supercomputing Applications. The partnership includes: Carnegie Mellon University/University of Pittsburgh, Cornell University, Indiana University, JüSupercomputing Centre, National Center for Atmospheric Research, The Ohio State University, Purdue University, Rice University, Shodor Education Foundation, Southeastern Universities Research Association, University of California Berkeley, University of California San Diego, University of Chicago, University of Illinois at Urbana-Champaign, University of Tennessee Knoxville, University of Texas at Austin, and the University of Virginia.

 

Find out more about XSEDE at:

https://www.xsede.org/

September 14, 2011

NSF Presidential Awards for STEM Mentoring due Oct 5

Filed under: Events — admin @ 1:59 pm

Invitation for Nominations
Presidential Awards for Excellence in Science, Mathematics and Engineering Mentoring (PAESMEM) Nominations
due Wed Oct 5 2011
PAESMEM@nsf.gov

PAESMEM recognizes outstanding mentors and mentoring programs that enhance the participation of individuals who might not otherwise have considered or had access to opportunities in science, technology, engineering, and mathematics (STEM), including persons with disabilities, women, and minorities.

Awardees serve as exemplars to their colleagues and leaders in the national effort to develop the nation’s human resources in STEM.

Who is eligible?

Individuals who arc U.S. Citizens or Permanent Residents, and U.S. organizations and companies.

Mentors and mentoring programs from a variety of contexts — academia, corporate, government, non-profit — are all welcome.

What are the criteria?

At least five years of outstanding mentoring to a significant number of persons who might not otherwise have considered or had access to opportunities in STEM (including persons with disabilities, women, and minorities) who are either:

  • Students at the K-12, undergraduate or graduate education level,

or

  • Early career scientists, mathematicians, or engineers who have completed their degree in the past three years.

What is required?

A description and documentation of the mentoring methods and procedures of the individual or organizational nominee, and letters supporting the nomination (a maximum of 5).

What is the award?

$10,000 honorary award and an invitation to Washington D.C. for recognition events, meetings with policy leaders, and professional development workshops.

How do I nominate?

Anyone can nominate a person (including themselves) or organization.

When is the deadline?

Nominations are due by Wednesday October 5, 2011.

For more information or to submit a nomination, please visit:

http://www.nsf.gov/

or contact:

Division of Undergraduate Education
National Science Foundation
4201 Wilson Blvd., Suite 835, Arlington, VA 22230
PAESMEM@nsf.gov

September 13, 2011

Workshop on Advanced Computational Methods in Engineering and Environmental Science

Filed under: Events — admin @ 8:42 pm

WORKSHOP ON ADVANCED COMPUTATIONAL METHODS IN ENGINEERING AND ENVIRONMENTAL SCIENCE:  Sept 26-28, 2011

This workshop will focus on the use of advanced computational and programming methodologies in the development of (1) land-use, hydrological,  ocean, and/or air models that are used to address the effect of megacity development on the regional and the world-wide environment and (2) models to predict and assess the resistance of structures to blast damage.

Programming models to be discussed will include the use of global address space programming models such as Coarray Fortran, and Unified Parallel C, and the Message Passing Interface Library and partitioning libraries in large parallel applications.

Presenters will include V. Balaji, Geophysical Fluid Dynamics Laboratory-Princeton University, T. Clune, NASA-Goddard Institute of Space Studies, G. Karypis, University of Minnesota, R. Numrich, College of Staten Island, Uwe Kuster, HLRS.

Additional information is also available at:

http://www.csi.cuny.edu/cunyhpc/workshops.php

Advanced registration is required.

Parking for the workshop will be available on the campus of the College of Staten Island.

downloadable flyer

September 9, 2011

Clusters Are Back!

Filed under: tech support — pm35 @ 8:20 pm

1530

After days of continuous struggle and troubleshooting, we are happy to tell you that the clusters are finally back in a running state. You can now start submitting your jobs. All of your data have been safe, however the jobs that were running during the incident were killed and they need to be restarted. We understand how this interruption must have adversely impacted your research and apologize for all the trouble. Please let us (pace-support@oit.gatech.edu) know if there is anything we can do to bring you up to speed once again.

The brief technical explanation of what happened:
At the heart were a set of fiber optic cables that interacted to intermittently interrupt communications among the Panasas storage modules.  This would result in the remaining modules beginning to move the services handled by a non-communicating module to a backup location.  During the process of moving the service, one of the other modules (including the one accepting the new service) would either send or receive some garbled information causing the move now in process to be re-recovered or an additional service to be relocated, depending upon which modules were involved.  Interestingly, the cables themselves appear not to be bad but instead interacted badly with the networking components. Thus, when cables were replaced or switch ports or network switch itself were swapped, the problems would appear “fixed” for a short while then return before a full recovery could be completed. The three vendors involved provided access to their top support and engineering resources and these have never seen this kind of behavior. Our experience and adversity have been entered into their knowledge bases for future diagnostics.

Thank you once again for your understanding and patience!

Regards,
PACE Team

Cluster status 9 Sept

Filed under: tech support — pm35 @ 2:54 pm

0800

Our replacements continue to function properly and the overnight tests have all passed.  We are starting the process of bringing the cluster back up though, as indicated in yesterday evening’s update, this will take us some hours to accomplish as we will test each step along the way to ensure the success of operation.

1050

So far, so good.  Looks like we’re still on track to have the cluster ready early afternoon today.  We’ll post a notice here and in the email lists announcing when it is ready.

Note, all the jobs that were running should be assumed failed.  Any job that was in the queue and had not yet started remains in the queue and will be started as soon as the cluster is ready.

September 8, 2011

Problems Continued (8 Sept)

Filed under: tech support — pm35 @ 11:32 am

0720

Our replacement Force-10 switch is expected to be delivered before noon by FedEx.  While waiting, we have continued working with Panasas overnight (between short sleeps) providing them with updated information.   We’ll continue to update this blog as we have further information.

1500

The replacement was received, installed but the problems remain.  Panasas and Penguin are both on-site and are working to restore service with remote support.  We have escalated this to the highest levels at both corporations and believe they are doing their best at this time to help restore service.  We will update this blog and send emails once we have some positive news.

2100

After replacement of the Force-10 switch, we continued to have network instability issues with the Panasas.  After much sleuthing, it was discovered there is an interaction between the cable type used to connect the Panasas devices to the Force-10 switch and the switch itself.  This was a bit difficult to diagnose due to the intermittent nature of the failures where some cable paths would work fine for some long period of time then partially corrupt packets for some short period of time, then be fine once again.  The corruption time and significance was sufficient to alert the Panasas software there was a problem with one or more of the units and it would attempt recovery.  Part-way through the recover process, the data path would be fine but often another path would begin to fail in a similar mode.  Unusual behaviour for any cables at best and not seen before with this cable type.  This interaction is now logged in both the Panasas and Force-10 knowledge archives.

Once the cables were replaced, there remained some significant problems with the file systems themselves.  Again, there is no loss or corruption of data.  Just the volumes of information being moved automatically at too-frequent an interval.  After the Panasas realm was settled, it is a long task to re-certify the data partitions and insure the data is correct.  This process is ongoing now and will continue for some while overnight as it re-certifies the many TB of data.

If all goes according to plan, we will arrive in the morning to a still-stable Panasas storage and will begin the recovery of the cluster operation.  Expectation is we will have the cluster back in operation by early afternoon, earlier if at all possible.  Once we have successfully tested the cluster, we will restart all the scheduler services and announce both here and via the mailing lists. Hopefully, we’ll have the good word in the morning.

September 7, 2011

Problems continued (7 Sept)

Filed under: tech support — pm35 @ 6:52 pm

0830

Panasas engineers on their way to be on-site.  Expecting tracking information for Force-10 switch replacement.

We re-configured the networking for the Panasas shelves bypassing part of the Force-10 switch.  The Panasas realm appears stable but unavailable via the Infiniband.  We will continue to work with Panasas on this resolution.

1430

The replacement Force-10 switch has not been received. Panasas and PenguinComputing (now both on site) are investigating.  Alternate networking topologies are being explored with the campus backbone team and level 3 Panasas engineers.  The PACE staff continues to work as quickly as possible in an effort to restore service for this critical resource.  We are investigating some ways we can restore service access to the files but at a temporarily reduced capacity until a final solution may be obtained.  We are also investigating how we may bring access to your files to the head nodes so you may be able to access them or copy them as you need though, perhaps, at a reduced speed than normal.

These possibilities are being examined by PACE, OIT network, Panasas and Penguin Computing personnel.  We will continue to update you here with information as it becomes available.

1850

The replacement Force-10 switch has not been received and will not be on-site until tomorrow morning. We have a FedEx tracking number and will monitor it’s progress.  Some NFS-access to the Panasas has been restored by creating an alternate network topology.  While not ideal, this will allow us to bring a few parts of the clusters back into operation.  However, at this time, we are still unable to access the Panasas from most of the systems and head nodes. We continue to work as quickly as possible to restore service with this critical resource.  We can confirm at this time, no files have been lost or corrupted through all of the actions to date.  All that is left is to restore access to them.

Panasas engineers remain on-site and and engaged and will continue working on the panfs access problem.

September 6, 2011

Problems this morning (6 Sept)

Filed under: tech support — admin @ 1:38 pm

0936

We’re having some widespread problems this morning, seemingly related to the high-performance scratch space. This is impacting most PACE clusters, with the exception of Atlantis and the Legacy PACE Community Cluster. Watch this blog post for updates.

1044

Looks like some network issues internal to the Panasas..  Still working.

1202

We’ve been on the phone with Panasas support.  Issues still remain…

1739

Still having problems.  We’ve been escalated to level 3 technical support within Panasas.  Current theory is some sort of network interaction between the new Panasas and the Force10 10gige switch (provided by Panasas).  The “old” set of Panasas shelves aren’t having this problem, so there’s a pretty good chance that we don’t have any issues with upstream networking equipment.

2250

Problems remain and the storage remains unavailable.  We’ve been working with Panasas and Force-10 level 3 support to resolve a very unusual combination of problems.  We are now awaiting replacement parts from both Panasas and Force-10 in the morning.  The good news is Panasas believes we have not lost nor corrupted any data on the storage.  We’ll keep you up-to-date with where we are in resolution.  We apologize for this inconvenience.

Powered by WordPress