BME 580.688/580.488, CS 600.688/600.488, Spring 2009
Home     I     Office Hours     I     Schedule     I     Projects     I     Assignments     I     Lab     I     FAQ

 

Projects

Title Description Contact
Develop an algorithm to assess disease liability of rare missense mutations found in cystic fibrosis patients. Cystic fibrosis is an autosomal recessive Mendelian disease. This means that if you inherit a loss-of-function allele from each parent, you will have the disease. Loss-of-function mutations in the CFTR gene are sufficient to cause the disease. Full-blown CF is devastating. Predicted median age for survival is only 37. While genetic testing for common CF disease alleles is widespread, many people have rare mutations of uncertain significance and are perplexed by their test results. Garry Cutting from the Institute of Genetic Medicine (JHMI) is leading an initiative that combines clinicians, molecular biologists, and bioinformaticians to discover the disease relevance of all rare mutations in CFTR. From a computational point of view, possible predictors could be designed from multiple sequence alignments (protein or DNA) of the CFTR gene, a protein structure homology model (provided to us by Jack Riordan of UNC), and clinical data available to Garry.

A clever machine learning algorithm and combination of predictive features, combined with training data provided by Garry, could help alot of people. This project will be done by Ting Li.
Karchin
Predicting functional missense mutations with a support vector machine and tree kernel that incorporates phylogenetic information. Protein multiple sequence alignments (MSAs) play an important role in state-of-the-art computational predictions of the functional importance of missense mutation. However, by representing a family of related proteins in a MSA, we may be losing important information about the evolutionary history of each mutated position. We hypothesize that by representing each column in a MSA as a phylogenetic tree and developing a novel support vector machine kernel, that we can substantially improve predictive accuracy. This project will be done by Yun-Ching Chen. Karchin
Computational studies of BRCT domains. The BRCT domain is a protein structural unit important in DNA repair, DNA damage response, and cell cycle regulation. It gets its name (BRCA1 C-terminal) because it was first identified by sequence analysis of the C-terminus of the hereditary breast cancer gene BRCA1. Functional studies revealed that this region was essential for the tumor suppressor function of BRCA1. The BRCT domain consists of ~90-100 amino acid residues arranged into a central beta-sheet surrounded by 2-3 alpha-helices. While single BRCT repeats have been found in proteins (e.g. XRCC1 and DNA ligase III), most BRCT repeats that have been detected occur in multiples. In a number of cases, tandem BRCT repeats have been shown to function as a unit and bind to phosphorylated protein partners at a distinct phosphopeptide binding site.

Two possible projects involving BRCT domains are:

Develop a predictive algorithm to assess whether a pair of adjacent BRCT domains function as a unit. Possible predictors include: length of the linking region that connects them, secondary structures that form in the linking region, identifying the existence of a phosphopeptide binding site formed when the two domains interact. This project will be done by Andy Wong.

Use sequence analysis to explore what the prototype of the BRCT domain was like. There are ~50 unique BRCT domains in 24 genes in the human genome. Are they more or less conserved than the remaining part of the genes/proteins that they are located in? In the SMART database of protein domains, there are 4038 BRCT domains in 2650 proteins. They are found in ~53% of eukaryotes, ~29% of bacteria, and less than 1% of archae.
Karchin and Monteiro
Model the distribution of somatic mutations in breast, colorectal, pancreatic cancer and glioblastoma multiforme that impact binding sites of the SFRS1 splicing enhancer (ESE) binding sites.

Jeremy Sanford, from U.C. Santa Cruz, has just published a paper in the journal Genome Research describing a canonical SFRS1 binding site and its relation to mutations in human inherited disease. He would like to do a similar analysis of SFRS1 binding sites and their relationship to human cancer. We have data from several large-scale sequencing studies of tumor genomes from the Vogelstein/Kinzler/Velculescu labs at JHMI that can be used in this analysis. This project will be done by Dewey Kim.
Karchin and Sanford
Mutations found in human phosphoinositdyl-kinase 3 proteins.

Phosphoinositdyl-kinase 3 proteins are lipid kinases that play an important role in intracellular signalling processes which control cell growth, proliferation and apoptosis, among many other pathways. Their hyperactivation has been recently discovered to be prevalent in many cancers and is an anti-cancer drug target of great interest. The COSMIC database of somatic mutations contains ~100 mutations in one PI3K isoform, PIK3CA, which have been shown to concentrate in hot spots. And there is growing interest in other PI3K isoforms.

Two possible projects involving PIK3CA mutations are:

Develop a specialized PI3K mutant database, to integrate clinical, functional, and bioinformatically relevant information about each mutation, and possibly about interactions amonng mutations. This database will initially focus on the location and structural properties of PIK3CA mutations and use the code infrastructure developed for LS-SNP. Currently, there is one X-ray crystal structure of PIK3CA in the PDB 2rd0. The Sukumar lab at JHMI is doing high-throughput assays to measure the lipid kinase activity of each mutant PIK3A isoform, to be followed by studies in cell culture and mouse models. We want to make all of this information easily accessible to cancer researchers, physicians and patients.

Develop a predictive algorithm to classify PIK3CA missense mutations as oncogenic (to a first order approximation these are those mutations that hyperactivate its lipid kinase activity) or neutral. The Karchin lab has a software program that can compute approximately 90 features of protein sequence, structure, and evolution that are possibly useful for such an algorithm. There is a small training set available of known oncogenic and neutral missense mutations which we have extracted from published studies. Our collaborators in the Sukumar lab will soon be generated a much larger training set, but this project will likely require an algorithm that can handle a very small amount of labeled training data, perhaps taking advantage of a large body of unlabeled training data. This project will be done by Sancar Adali.
Karchin and Sukumar
Model a directed in vitro evolution experiment in Escheria coli.

Manel Camps, from U.C. Santa Cruz, has developed a laboratory technology for artificial evolution in E. Coli. In these engineered bugs a reporter gene is replicated by an error-prone polymerase. This system generates mutations at a rate much higher than normally seen in nature and has yielded large libraries of random clones. While the system is tuned to make on the order of one mutation per gene many instances of genes with with more than one mutation can be found. The distribution of mutations is roughly Poisson, but the distance between mutations in the DNA sequence is not random, clustering around certain distances. This suggests that the distribution of mutations may be determined by an intrinsic property of the polymerase. The spectrum of double mutants is also different from that of the single mutants, suggesting that the double mutants are not independent events. Analysis and modeling of these trends could reveal new information about how DNA polymerase works. I could also and have important evolutionary implications, as it would suggest a mechanism to explain the presence of multiple hits in a given gene, as most organisms have very low spontaneous mutation rates. This project will be done by David Simcha.
Karchin and Camps
Design a biologically aware sequence analysis kernel to detect functionally important protein binding sites on pre-mRNA, mRNA or DNA.

This kernel can take into account any kind of information that you find to be biologically relevant, including but not limited to inter-species conservation, synteny, positional information on linear sequence, and biophysical properties. Extensive functionality for this kind of kernel development is built into the Shogun machine learning toolbox, which contains implementations of several modern string kernels we've discussed already in class, including Spectrum, Weighted Degree with shifts, and many others. Shogun also allows construction of kernel functions that are weighted linear combinations of other kernels an optimization method to find the best weights. This project will be done by Dongwon Lee.
Karchin