Advanced Computing Graduate Physics Program
Introduction
Penn has become one of the lead institutions in addressing new issues
in data intensive computing and the use of parallel and distributed
computation techniques in data intensive applications. Much of this
has been the result of the NSCP project which won
national prizes in data mining at the annual SuperComputing Conference in 1995, 1996, 1997 and
most recently the High Performance Computing Award in 1999.
The National Scalable
Cluster Project (NSCP) brings together a number of experts in
computing, large databases, high speed networks, and the application
of parallel programming techniques. Hardware installed includes
five IBM SP2 frames ("Deep Blue Processors" made famous in the chess match, several large disk arrays
(parallel high speed disks), a 26 Terabyte tape library for archival storage,
and high speed networking equipment. The parallel implementations are
capable of increasing the execution speeds by an order of magnitude or more for some
applications. Penn is also one of the first universities to have access to the vBNS
(very Broadband Network System) and now the Abilene Internet 2 network
which provides high-speed links to the supercomputing
centers and other U.S. leaders in computational researchi.
NSCP is part of the National Partnership
for Advanced Computational Infrastructure (NPACI) led by the San Diego Supercomputing Center
Research Areas which make use of NSCP
Data Mining in Particle Physics
In the search for the basic particles which determine the structure of the
universe, particle detector experiments collect data which result in
samples of 10-100 Terabytes. Future experiments will likely increase
this by a further factor of ten. Analysis of the data consists of statistical
analysis of selected items, extensive analysis of correlations within
records, and significant amounts of cpu intensive monte carlo and
visualization. In the past several years, many techniques have developed
to apply ``embarrassingly parallel'' analysis which uses multiple
processors but with each cpu processing a single record. More extensive
use of parallel programming is being investigated for mining these large
samples including the use of geographically dispersed processing
elements and data. NSCP is actively involved in work with the Stanford Linear
Accelerator Center on future designs for linear colliders in the TeV range.
The data management and data mining techniques under development in this research are useful in
several analogous problems outside of physics,
for example, market basket analysis, and
trend analysis.
Data Storage for the National Library of Medicine
NSCP is building a Next Generation Internet pilot of a system for containing and indexing
mammography data from many hospitals. This system is a phase II National Library of Medicine
project. Project elements include large scale data collections, secure networks, metadata management,
and computer assisted diagnostics. We are particularly interested in content data mining and the use
of data mining techniques on this very large collection.
Video Storage
Large video archives are being collected for use in distance education and for trials in content mining.
Philadelphia Neighborhood Information System
A project to collect, organize and mine large volumes of data from organizational units in the City of Philadelphia.
Census Data and Economic Data
Organization of Census data and data from the State of Philadelphia and applications for data mining in that space.
High Performance Communications
The focus on network centric computing within the NSCP requires
expertise in high performance networking. The dispersed applications
being used within the NSCP provide an ideal suite of applications for
investigating performance issues, cpu and kernel limitations, and
developing network diagnostic tools. An extensive ATM/LAN
infrastructure is used at PENN, and developments are uderway for a parallel disk to disk
wide area network using high speed dense wave division multiplexing (DWDM).
Real-Time Functional Brain Imaging using MRI
Magnetic Resonance Imaging is a common technique for medical
imaging which in the case of the brain can be used to view not only
structure but also function. Parallel computation is being used in a pilot
project which provides analysis of MRI images from Childrens
Hospital of Philadelphia.
Applications include location of visual
and aural centers in the brain and applying fuzzy cluustering methods to
understand functional connectivity in the resting brain.
Imaging using low power Lasers in Opaque Media (and Human
tissue)
Low power lasers can be used in several wavelengths to determine the
properties of dense media such as colloids, suspensions, gels, etc.. From
the intensity and phase of the scattered light, the density profile of
the scatterers can be determined. This technique used in condensed matter
physics is being applied to medical imaging.
When combined with standard mammography techniques, it
is hoped that the laser scattering will both provide improved tumor detection
as well as potentially different tumor specificity.
Models of the Early Universe
Several groups in astrophysics are preparing to use the clusters both for
compute intensive
jobs in parallel or for the analysis of large collections of astrophysics
data.
C++ for Managing Collections of Complex Data
The large data sets used for particle physics contain complex collections
of data. Unlike
many databases, most items have associated variable length fields.
The relations between
the data elements can be efficiently described in most cases with C++ classes
and object
techniques. Reading the data efficiently requires strategies which group
together objects
used for query and selection to avoid reading complete records.
Data Mining Tools Research
With the Kensington Group at Imperial College and the National Center for Data Mining at UIC
Ptool for Managing Distributed Data
The persistent object store tools developed at UIC by R. Grossman group
are being used
to populate large object stores of C++ Objects derived from a sample of data
from the CDF detector at Fermilab.
The advantage of the object store is that it provides efficient
management of data across many nodes in a geographically distributed system. We are
investigating using the object store to reference legacy data through disk
seek pointers.
BaBar Detector at SLAC
The Babar detector at the Stanford Linear Accelerator Center will
begin operation in 1999.
NSCP clusters have been used to produce extensive monte carlo sets of data
to assist in the design of the detector and its software.
Graduate Brochure / Our address