WHAT DID THE 2011 EARLY CAREER AWARD ALLOW YOU TO DO?
The goal of this Early Career research project was to develop scalable graph algorithms for problems motivated by DOE-relevant life science applications.
DNA and protein molecular sequence data from environmental microbial samples encode key pieces of information on which species inhabit these environments and how they function.
To find such information, we first needed to reduce the complexity of these data sets. We took a graph-theoretic approach where we first compared sequences with each other. Using those comparisons, we built a network (or graph) and used that network to mine for clusters or groups that represent the key biological building blocks (species groups or functional modules).
To deal with the vastness of these input sequence collections, we used parallel computing. This helped us build very large graph representations in short time and perform clustering on them.
More specifically, we developed two classes of parallel methods. One class used dense subgraph detection schemes (i.e., to identify parts of the network that have a high concentration of edges). The other class used community detection schemes that are more heavily used to discover friendship groups or “communities” in social networks.
However, as often is the case, addressing one research problem led to another. The community detection problem led to an array of other interesting problems. Examples of these problems included: how to identify concurrency while processing a graph by coloring the vertices; how to identify communities in a graph containing different types of vertices (or entities); and how to design effective parallel strategies for assembling a genome from its pieces (much like assembling a jigsaw puzzle).
The efforts led to a series of publications and open source software for performing parallel graph analytics at scale. At the same time, the project also led to a set of research collaborations and productive partnerships with domain scientists and high-performance computing experts at Pacific Northwest National Laboratory. It also led to the training of six PhD students.
The foundations laid in the project have allowed us to design scalable (parallel) algorithms for more complex graph problems including parallel influence maximization (with applications to infectious disease modeling), dynamic graph problems, and generating genome assemblies of complex genomes at scale.
Ananth Kalyanaraman is a Professor and Boeing Centennial Chair in Computer Science at the School of Electrical Engineering and Computer Science (EECS), Washington State University in Pullman. He serves as the Associate Director for the School of EECS, and he also holds a joint appointment with Pacific Northwest National Laboratory (PNNL) as a Senior Scientist. At WSU, he holds affiliate faculty positions at the Molecular Plant Sciences Graduate Program and the Paul G. Allen School for Global Animal Health.
SUPPORTING THE DOE SC MISSION:
The Early Career Research Program provides financial support that is foundational to early career investigators, enabling them to define and direct independent research in areas important to DOE missions. The development of outstanding scientists and research leaders is of paramount importance to the Department of Energy Office of Science. By investing in the next generation of researchers, the Office of Science champions lifelong careers in discovery science.
For more information, please go to the Early Career Research Program.
THE 2011 PROJECT ABSTRACT:
Efficient Graph Kernels for Extreme Scale Analysis of Environmental Community Data
The goal of this research is to develop novel parallel algorithms for graph‐theoretic analysis of biological data on next‐generation supercomputing platforms. Graph methods have an immense potential to transform the information space in bioinformatics and computational biology. The growing sizes of data repositories have ensured that such potential cannot be realized without a comprehensive embrace of high-performance computing.
The methods developed in this project will allow scientists to discover community structures hidden within very large graphs built out of high‐throughput biological data. The specific aims of the project are to (1) develop scalable graph‐theoretic parallel algorithms for clustering biological data; (2) build new inter‐database analytical capabilities using novel multi‐graph representations of biological data; and (3) apply graph‐theoretic algorithms on the clustering components of three large‐scale applications in metaproteomics and phylogenetics.
Additional profiles of the Early Career Research Program award recipients can be found at https://www.energy.gov/science/listings/early-career-program.
The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit www.energy.gov/science.
Sandra Allen McLean is a Communications Specialist in the Office of Science, Office of Communications and Public Affairs [email protected].