Advanced Computing Graduate Physics Program

Introduction

Penn has become one of the lead institutions in addressing new issues in data intensive computing and the use of parallel and distributed computation techniques in data intensive applications. Much of this has been the result of the NSCP project which won national prizes in data mining at the annual SuperComputing Conference in 1995, 1996, 1997 and most recently the High Performance Computing Award in 1999. The National Scalable Cluster Project (NSCP) brings together a number of experts in computing, large databases, high speed networks, and the application of parallel programming techniques. Hardware installed includes five IBM SP2 frames ("Deep Blue Processors" made famous in the chess match, several large disk arrays (parallel high speed disks), a 26 Terabyte tape library for archival storage, and high speed networking equipment. The parallel implementations are capable of increasing the execution speeds by an order of magnitude or more for some applications. Penn is also one of the first universities to have access to the vBNS (very Broadband Network System) and now the Abilene Internet 2 network which provides high-speed links to the supercomputing centers and other U.S. leaders in computational researchi. NSCP is part of the National Partnership for Advanced Computational Infrastructure (NPACI) led by the San Diego Supercomputing Center

Research Areas which make use of NSCP

Data Mining in Particle Physics

In the search for the basic particles which determine the structure of the universe, particle detector experiments collect data which result in samples of 10-100 Terabytes. Future experiments will likely increase this by a further factor of ten. Analysis of the data consists of statistical analysis of selected items, extensive analysis of correlations within records, and significant amounts of cpu intensive monte carlo and visualization. In the past several years, many techniques have developed to apply ``embarrassingly parallel'' analysis which uses multiple processors but with each cpu processing a single record. More extensive use of parallel programming is being investigated for mining these large samples including the use of geographically dispersed processing elements and data. NSCP is actively involved in work with the Stanford Linear Accelerator Center on future designs for linear colliders in the TeV range. The data management and data mining techniques under development in this research are useful in several analogous problems outside of physics, for example, market basket analysis, and trend analysis.

Data Storage for the National Library of Medicine

NSCP is building a Next Generation Internet pilot of a system for containing and indexing mammography data from many hospitals. This system is a phase II National Library of Medicine project. Project elements include large scale data collections, secure networks, metadata management, and computer assisted diagnostics. We are particularly interested in content data mining and the use of data mining techniques on this very large collection.

Video Storage

Large video archives are being collected for use in distance education and for trials in content mining.

Philadelphia Neighborhood Information System

A project to collect, organize and mine large volumes of data from organizational units in the City of Philadelphia.

Census Data and Economic Data

Organization of Census data and data from the State of Philadelphia and applications for data mining in that space.

High Performance Communications

The focus on network centric computing within the NSCP requires expertise in high performance networking. The dispersed applications being used within the NSCP provide an ideal suite of applications for investigating performance issues, cpu and kernel limitations, and developing network diagnostic tools. An extensive ATM/LAN infrastructure is used at PENN, and developments are uderway for a parallel disk to disk wide area network using high speed dense wave division multiplexing (DWDM).

Real-Time Functional Brain Imaging using MRI

Magnetic Resonance Imaging is a common technique for medical imaging which in the case of the brain can be used to view not only structure but also function. Parallel computation is being used in a pilot project which provides analysis of MRI images from Childrens Hospital of Philadelphia. Applications include location of visual and aural centers in the brain and applying fuzzy cluustering methods to understand functional connectivity in the resting brain.

Imaging using low power Lasers in Opaque Media (and Human tissue)

Low power lasers can be used in several wavelengths to determine the properties of dense media such as colloids, suspensions, gels, etc.. From the intensity and phase of the scattered light, the density profile of the scatterers can be determined. This technique used in condensed matter physics is being applied to medical imaging. When combined with standard mammography techniques, it is hoped that the laser scattering will both provide improved tumor detection as well as potentially different tumor specificity.

Models of the Early Universe

Several groups in astrophysics are preparing to use the clusters both for compute intensive jobs in parallel or for the analysis of large collections of astrophysics data.

C++ for Managing Collections of Complex Data

The large data sets used for particle physics contain complex collections of data. Unlike many databases, most items have associated variable length fields. The relations between the data elements can be efficiently described in most cases with C++ classes and object techniques. Reading the data efficiently requires strategies which group together objects used for query and selection to avoid reading complete records.

Data Mining Tools Research

With the Kensington Group at Imperial College and the National Center for Data Mining at UIC

Ptool for Managing Distributed Data

The persistent object store tools developed at UIC by R. Grossman group are being used to populate large object stores of C++ Objects derived from a sample of data from the CDF detector at Fermilab. The advantage of the object store is that it provides efficient management of data across many nodes in a geographically distributed system. We are investigating using the object store to reference legacy data through disk seek pointers.

BaBar Detector at SLAC

The Babar detector at the Stanford Linear Accelerator Center will begin operation in 1999. NSCP clusters have been used to produce extensive monte carlo sets of data to assist in the design of the detector and its software.
Graduate Brochure / Our address