Seminar on Applied Mathematics
Zoran Obradovic, Director of the Center for Information Science and Technology, Temple University
DATA MINING APPROACH TO FUNCTIONAL CHARACTERIZATION OF PROTEIN DISORDER
Abstract: About 10 years ago we developed a prediction based method to show that thousands of proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. This revolutionary discovery is followed by many efforts aimed to development of improved protein disorder predictors including three community-wide biannual competitions (CASP 5,6,7). In this talk we will first briefly present our most recent predictor rated as the best model in disorder category at the seventh critical assessment of structure prediction experiments (CASP7, Nov. 26-30, 2006).
Next, we will describe how we used this predictor to develop a data mining method that provided a leap jump in answering a challenging question of understanding relationship between protein disorder and protein function. In our approach a statistical evaluation is employed to rank the significance of correlations of functions in the SwissProt database of about 200,000 sequences with intrinsic disorder. In our analysis protein sequence data redundancy and the relationship between protein length and protein structure were taken into account. Overall, out of the 710 Swiss-Prot functional keywords that were each associated with at least 20 proteins, 238 were found to be strongly positively correlated with predicted long intrinsically disordered regions, whereas 302 were strongly negatively correlated with such regions. The results of this large scale data mining based analysis agree well with smaller extremely costly studies performed by manual curation of biomedical experts.
Finally, we will discuss our machine learning approach to the problems of identifying biomedical publications with relevant protein-disorder related experimental evidence from MEDLINE, a major biomedical repository collecting millions of articles from different journals. This learning task is challenged by richness of biomedical terminology, diversity of experimental evidence expressions and a small number of annotated articles that can be used as labeled examples for training. We will describe our novel substring construction algorithm which derives attributes from semantically related terms with shared stems or morphemes. In this data-driven approach we used Wilcoxon Rank-Sum test for feature selection followed by developing support vector machine (SVM) and Naive Bayes methods for classification. In addition to re-ranking retrieved literature based on their relevance to protein disorder, we have also successfully applied the new algorithm to five post-translational modification datasets where curators confirm that the selected substrings are consistent with their much more costly manual annotation.
The reported results were obtained through a collaboration with A.K. Dunker, B. Han, Z. Hu, K. Peng, P. Radivojac, V. Uversky, S. Vucetic, X. Xie, and C.H. Wu and were published in 2006 and 2007 at MBC Bioinformatics 7(1) 208, Bioinformatics 22(23):2876-82 and J. Protein Research May;6(5):1882-1932.
Predavac ce odrzati jos jedno predavanje, u petak, 24-tog u 12, na ETF-u u sali 61.
DATA MINING SUPPORT FOR AEROSOL OPTICAL DEPTH RETRIEVAL AND ANALYSIS
Abstract: Aerosol Optical Depth (AOD) indicates the amount of depletion that a beam of radiation undergoes as it passes through the atmosphere. One of the biggest challenges of current climate research is to characterize and quantify the effect of AOD on the Earth's radiation budget. We will describe a novel data mining method for improving AOD prediction or so called retrieval accuracy based on training neural networks that take advantage of high resolution satellite observations and collocated high quality ground based measurements. The experimental results obtained using thousands of observations over the entire globe suggest that ensembles of neural networks are more accurate than the operational MODIS AOD retrieval algorithm. Our study of differences between neural networks and the MODIS algorithm over the continental United States also revealed information that can help improve quality of the MODIS algorithm.
The reported results were obtained through a collaboration with Bo Han, Zhanqing Li, Wen Mi, Vladan Radosavljevic and Slobodan Vucetic funded by NSF IIS-0612149 research grant.
Zoran Obradovic is Director at the Center for Information Science and Technology and a Professor of Computer and Information Sciences at Temple University. His research interests focus on developing data mining and statistical learning technology for an efficient knowledge discovery at large databases. Funded by NSF, NIH, NIJ, DOE and industry he contributed to about 190 refereed articles on these and related topics and to several academic and commercial software systems.For more details see www.ist.temple.edu/~zoran.