- Daniela Witten, University of Washington
Statistics for Big Data: Challenges and Opportunities
In the past ten years, there has been a lot of hype about the potential for genomics to transform all aspects of medicine. But thus far, actual instances in which genomics is used to inform patient care have lagged pretty far behind the hype. This is the case for (at least) two reasons: the data are more complex than expected, and the statistical methods to analyze the data are still in their infancy. To illustrate these points, I'll present recent work on a penalized score test for quantifying conditional associations in high dimensions, as well as a new framework for predicting the deleteriousness of genetic variants.
Professor Witten's research interest spans over statistical machine learning, particularly in developing statistical machine learning techniques for problems in genomics. Her research is funded by the National Science Foundation, the Alfred P Sloan Foundation, The National Institute of Health, and the National Defense Science and Engineering Graduate Fellowship. She is author of more than 50 papers in top statistics journals and also has co-authored the textbook "An Introduction to Statistical Learning".
- Bill Corser, Institute for Health Policy, Michigan State University
Research Designs using Large Data Sources: “Lessons learned” from over a Billion Data
This presentation will summarize the results and lessons derived by the two presenters from over a combined decade of funded work using large data sets such as : a) regional/national health system (e g. Veterans Health Administration electronic health record), b) publicly-accessible (e g. Medical Expenditure Panel Survey (MEPS) by AHRQ) and c) statewide Michigan Medicaid Program Data Warehouse beneficiary claims when examining healthcare delivery outcomes and phenomena. The MSU presenters (William Corser and Kathleen Oberst) will: a) review the major types of nationally-representative, healthcare system and statewide data set sources frequently available to researchers, and b) provide a framework (2013, Corser) of “advantages and concerns” for work with such large data sets. Thirteen specific strategies and several anecdotes concerning large data set analytics will also be summarized with a list of pertinent research citations provided to interested attendees.
- Bhramar Mukherjee, Biostatistics, University of Michigan
Where have all the environmental interactions gone?
Many complex diseases have a multi-factorial etiology, characterizing the dynamic interplay between an individual’s genetic profile and lifetime environmental exposure history. Even though most epidemiologists conceptually believe in the existence of gene-environment interaction (GEI), there has been a limited number of replicable findings in terms of detecting GEI, as defined through the statistical interaction parameter. In this talk, I will first discuss some of the major statistical advances that have been made in this area in the last decade. I will then indicate some of the critical limitations of current design and analytic strategies that can potentially explain the limited success in identifying GEI. We will conclude with the eternal debate regarding statistical interaction versus biological interaction and how GEI studies can be relevant for designing targeted public health intervention.
- Yuehua Cui, Statistics and Probability, Michigan State University
Statistics in post genomic era: challenges and opportunities
Massive amount of genetic/genomic data (e.g., next generation sequencing data as well as different types of ‘omics’ data) are routinely generated nowadays, owing to the radical breakthrough in high throughput technologies. These data are high-dimensional in nature, typically with the number of features much larger than the sample size, thus presenting daunting challenges for statistical modeling and computation. Analysis and interpretation of such data call for large interdisciplinary effort with many exciting opportunities for statisticians. In this short talk, I will briefly introduce some open problems, such as study of gene-gene and gene-environment interactions and ‘omics’ data integration, then discuss potential opportunities at the interface of statistical genetics/genomics.
- Sayan Chakraborty, Statistics and
Probability, Michigan State University
Latent Space Clustering in Social Network Analysis
Network models are widely used to represent relations between actors or nodes. Some times we see transitivity in the network data which means if two nodes have ties with a third node then they are more likely to be tied. Sometimes the interest lies on finding the clusters in the network. Here we define a model where the adjacency structure is unobserved and the probability of tie between two nodes depend on their latent positions in the unobserved Euclidean social space. To identify the clusters within the network we develop a latent space clustering algorithm and propose a fully Bayesian estimation method which used Reversible Jump MCMC to perform the posterior inference. Finally we implement this algorithm to figure out the Network Structure and the clustering configuration of US business firms.
- Sougata Chaudhuri, Department of
Statistics, University of Michigan
Perceptron-like Algorithms and Generalization Bounds for Learning to Rank
Learning to rank is a supervised learning problem where the output space is the space of rankings but the supervision space is the space of relevance scores. We make theoretical contributions to the learning to rank problem both in the online and batch settings. First, we propose a perceptron-like algorithm for learning a ranking function in an online setting. Our algorithm is an extension of the classic perceptron algorithm for the classification problem. Second, in the setting of batch learning, we introduce a sufficient condition for convex ranking surrogates to ensure a generalization bound that is independent of number of objects per query. Our bound holds when linear ranking functions are used: a common practice in many learning to rank algorithms. En route to developing the online algorithm and generalization bound, we propose a novel family of listwise large margin ranking surrogates. Using the proposed family, we provide a guaranteed upper bound on the cumulative NDCG (or MAP) induced loss under the perceptron-like algorithm. We also show that the novel surrogates satisfy the generalization bound condition.
- Akshita Chawla, Statistics and Probability,
Michigan State University
Meta-analysis for rare events with application to medical data
In clinical trials and many other applications, meta-analysis is mainly conducted to summarize the effect size on primary endpoints or on key secondary endpoints. However, there are times when safety endpoints such as risk of complications in pancreatic surgery or the risk of myocardial infarction (MI) is the main point of interest in a drug study. As these types of safety endpoints are rare in nature, most studies report zero such incidences; the general statistical framework of meta-analysis based on large sample theory falls apart to combine the effect size. As a workaround, either trials with both arms having zero events are deleted or a 0.5 correction is applied. In a randomized control trial (RCT) set-up, Cai, Parast and Ryan (Statist. Med. 2010, 29 2078-2089) proposed methods based on Poisson random effects models to draw inferences on relative risks for two arms with rare event data. In this article, we give the general framework of how their assumption of having RCT can be further relaxed and utilized for non-RCT studies to draw inferences for two treatments. We also develop two new approaches based on zero inflated Poisson random effects models that are more appropriate with excessive zero counts data. We explicate our proposed models on two real world data sets by doing extensive simulations that corroborate the appropriateness of the proposed methods.
- Chunyu Chen, Animal Science, Michigan
Exploring extensions and properties of expectation-maximization methods for whole genome prediction
As the densities of SNP marker panels increase, computational efficiency becomes more important for whole genome prediction (WGP) such that algorithms other than MCMC might warrant greater consideration for heavy-tailed alternatives (e.g. BayesA) to normality. One popular alternative is based on the expectation maximization (EM) algorithm. We have previously extended BayesA to allow for correlated effects in anteBayesA using MCMC. We explore an EM implementation of anteBayesA (EM-anteBayesA) and compare it to a recently developed conventional EM-BayesA strategy. By both simulation and application to the heterogeneous stock mice dataset, we found that EM-anteBayesA had accuracies of genomic prediction comparable to its MCMC analogue. Furthermore, we demonstrate, contrary to conventional wisdom, that it is feasible to estimate key hyperparameters based on EM implementations of these hierarchical Bayesian models.
- Somak Dutta, Department of Statistics,
University of Chicago
Matrix-Free Conditional Simulations of Gaussian Markov Random Fields
We develop a matrix-free method for conditional simulations for hidden Gaussian Markov random field models. For regular arrays, we exploit the analytic structure of the precision matrix using the two-dimensional discrete cosine transformation and employ the Lanczos algorithm to solve system of linear equations. As a key ingredient, we use an incomplete Cholesky factor of the sparse precision matrix as the preconditioner in the Lanczos algorithm and bring down the computation cost to O(n log(n)) where n is the sample size. We provide further use of the conditional simulation to 1) computing maximal simultaneous exceedance regions, 2) compute the marginal likelihood by Monte Carlo integrating over missing observations and 3) sample from the posterior distribution of the spatial field and other parameters under a Bayesian regime. We then demonstrate our method with a real data on groundwater arsenic contamination in Bangladesh.
- Simone Graetzer, Communicative Sciences
and Disorders, Michigan State University
Age-related Changes in Speech Breathing
Speech changes with age, and these changes can occur in articulatory and velo-pharyngeal function, and in breathing function, e.g., due to a reduction of vital lung capacity. For six subjects – three females and three males - who were recorded over the course of 18 to 30 years for females and 38 to 48 years for males, the duration of the speech breath group (SBG) was measured in milliseconds by seven raters (2-3 per speaker; > 14,000 data points). Speaker age at the time of recording was calculated to two decimal places. A logarithmic transformation (base 10) was used for duration to reduce skew in the distribution. The continuous variable of age was categorized into 3-quantiles per speaker gender group: 44.45-58.7 [tertile I], 58.97-68.97, [tertile II], and 68.97-81.05 [tertile III] years for females, and 50-68 [tertile I], 68-80 [tertile II], and 80-98 [tertile III] years for males. Using R version 3.1.0 and the packages 'lme4' and 'multcomp', a linear mixed-effects model was fitted for log-transformed SBG duration as a function of age (with subject and rater as random effects terms) with restricted maximum likelihood estimates. Other statistical procedures were also used, e.g., post-hoc Tukey multiple comparisons of means, and Chi-squared likelihood ratio tests for model testing. For females, tertile II was associated with a higher estimate than both tertile I (p<0.0001) and tertile III (<0.001). Tertile I was associated with a higher estimate than tertile III (p<0.05). For males, tertile I was associated with a higher estimate than tertiles II and III (p<0.0001), and tertile II was associated with a higher estimate than III (p<0.0001). In sum, tertile III was associated with lower durations than tertile II for both females and males. A decrease in breath group duration in older age may be associated with a loss of rib cage flexibility, a reduction of vital capacity, and/or atrophy (e.g., bowing) of the vocal folds. While these findings may show some bias due to non-random subject sampling, they indicate the value of examining speech breathing as a potential marker of physiological changes that can reduce quality of life in older individuals.
- Tao He, Statistics and Probability,
Michigan State University
Testing high-dimensional nonparametric function with application to gene set analysis
High-dimensional data arise nowadays in a wide variety of areas, such as biology, imaging and climate. In this paper, we proposed a test statistic for testing the high-dimensional nonparametric function in a reproducing kernel Hilbert space generated by a positive definite kernel. An interesting feature of the proposed method is that we do not need to assume any specific structure for high-dimensional functions. We studied the asymptotic distribution of the test statistic under the null hypothesis and a series of local alternative hypotheses in a “large p small n" setup. By choosing an appropriate kernel function, we showed that the proposed test procedure had better power performance then the tests developed for linear models.
- Britny Hildebrandt, Psychology,
Michigan State University
Examination of Strain Differences in Binge Eating Behaviors in Rats: Evidence for Genetic Differences
Binge eating is a significantly heritable phenotype that cuts across eating disorder diagnoses (e.g., bulimia nervosa, binge eating disorder) and is influenced by biological and genetic factors. Animal studies have distinct advantages for understanding the etiology of behavioral phenotypes like binge eating, as they are able to isolate biological processes while minimizing environmental sources of variation. To date, no study has examined rat strain differences for binge eating. Examining strain differences in animal models of binge eating could help identify genetic risk factors that differentiate strains and possibly contribute to the clinical phenotype in humans. The current study used the binge eating resistant/binge eating prone model to investigate potential strain differences in binge eating. Sprague-Dawley female (n = 30), Wistar female (n = 23), and Sprague-Dawley male (n = 30) rats were exposed to six palatable food feeding tests. ANOVAs and Chi Square analyses were used to examine effects of strain on palatable food intake and binge eating status. Across strains, Wistar rats consumed significantly less palatable food than Sprague-Dawley female rats. Additionally, they were less likely to be defined as binge eating prone, similar to the Sprague-Dawley male rats, a known low-risk group for binge eating. Indeed, a significantly higher rate of binge eating proneness was observed in the Sprague-Dawley female rats as compared to other groups. Results suggest that there may be important genetic differences in rat strains for binge eating, and that Sprague-Dawley female rats may be a particularly vulnerable strain for this behavior. Future research should examine specific genetic and biological factors (e.g., differences in gene expression) that differentiate the strains and may translate into biological processes contributing to individual differences in binge eating in humans.
- Qinhua Huang, Epidemiology and Biostatistics,
Michigan State University
Relationship between AUC and Misclassification Rate of Binary Classifiers – A Simulation Study
The area under ROC curve (AUC) and the misclassification rate are two important criteria used to measure the performance of classifiers. Usually one cannot get maximum AUC or minimum misclassification rate under the same classifier. Studying the relationship between the AUC and misclassification rate, Cortes and Mehryar(2004) provided a formula for the exact expression of the expected value and the variance of the AUC for a given misclassification error rate. In this study, we aim to study the misclassification rate for a given AUC by simulating a binary classifier using logistic regression. We further compare the expected AUC value by the formula in Cortes and Mehryar’s and the simulated value of AUC. Our results show that when the proportion of binary classifier is not close to the boundary value 0 or 1, Cortes and Mehryar’s formula is accurate, but when the proportion is close to the boundary value 0 or 1, the formula’s expected value of AUC deviates largely from the true AUC value. This indicates that the formula of the AUC for given misclassification rate may not be accurate and caution should be used. Our results also provide useful information for the quantiles of the misclassification rate for given AUC, with the proportion of the classifier varying in [0,1].  Cortes, Corinna, and Mehryar Mohri. "AUC optimization vs. error rate minimization." Advances in neural information processing systems 16.16 (2004): 313-320.
- Abishek Kaul, Statistics and Probability,
Michigan State University
Corrected Quantile regression for High Dimensional Measurement Error Models
We study the asymptotic properties of the penalized version of the corrected quantile loss function introduced by Wang et. al.(2012) for additive measurement error linear models. Penalizing the loss function enables simultaneous parameter estimation and variable selection while also extending the applicability to high dimensional models, where the dimensionality can grow exponentially with the sample size. As is common for measurement error models, the objective function to be minimized is non-convex, however we are able to approximate it by a convex function and thereby providing bounds for the statistical error associated with the prediction and estimation, which hold with high probability.
- Yongfang Lu, Animal Science, Michigan
An Alternative Approach to Modeling Genetic Merit of Feed Efficiency in Dairy Cattle
Genetic improvement of feed efficiency (FE) in dairy cattle requires greater attention given increasingly important resource constraint issues. A commonly used measure of FE in dairy cattle is residual feed intake (RFI); however, the use of RFI may be limiting for a number of reasons, including differences in recording frequencies between various component traits of RFI and potential differences in genetic versus non-genetic relationships between traits. We propose an alternative multiple-trait modeling strategy that exploits the Cholesky decomposition (CD) to provide a potentially more robust measure of FE. We assessed both approaches by simulation as well as by application to 23,770 mid-lactation weekly records on 1,967 cows from a dairy feed efficiency consortium study involving 7 different research stations within US. Although the CD model fared better than the RFI approach when simulated genetic and non-genetic associations between dry matter intake and component traits were substantially different from each other, there were no meaningful differences in predictive performance between the two models on application to the consortium data.
- Xiaochen Luo, Psychology, Michigan
A direct comparison of dimensional, categorical and hybrid models for the latent structure of eating pathology in females
The classification of eating disorders has been debated, with the current major diagnostic system (DSM-5) assuming a categorical nature rather than a dimensional nature for eating disorders. However, previous studies suffered from several issues including using unexamined assumptions of a categorical nature in modeling, not being able to examine hybrid models with both dimensional and categorical features, or using only clinical samples and only categorical diagnostic indicators that limited the ability to capture the full variation of eating pathology. The current study aimed to directly compare categorical, dimensional and hybrid models for disordered eating in a community twin sample of 2,389 female and male participants ages 9-30 years (67% female, mean age=14.43(5.00). The Minnesota Eating Behaviors Scale (MEBS) was used to assess disordered eating symptoms. We first examined the factor structure of the MEBS and found that the four-factor structure established previously (i.e., scales for body dissatisfaction, weight preoccupation, binge eating, and compensatory) fit the data well and was invariant to sex, age and puberty status. We then fit a categorical model (i.e., a latent class model), a dimensional model (i.e., a latent trait model) and a hybrid model (i.e., a non-parametric factor model) to each the factors. Results showed that a dimensional model fit the data best for binge eating, weight preoccupation, and compensatory behaviors. For body dissatisfaction, a four-class model, which categorized people who were dissatisfied with different parts of their bodies into four groups, fit the data best, while the dimensional model was the second best model. We conclude that despite an a priori preference for categorical models in the current classification system, a dimensional framework should be included in conceptualizing eating pathology. Our results also emphasize the importance of assessing and understanding symptoms of eating disorders along the entire spectrum rather than only focusing on a narrow part of this continuum.
- Ashwini Maurya, Statistics and Probability,
Michigan State University
A Well Conditioned and Sparse Estimate of Covariance and Inverse Covariance Matrix Using Joint Penalty
We develop a method for estimating a sparse and well conditioned covariance matrix from a sample of vectors drawn from a subgaussian distribution in high dimensional setting. The estimator uses a squared loss function and a joint penalty of L1-norm and a sum of squared penalty on the sample eigenvalues. The joint penalty plays two important roles: i) L1 penalty on each entry of covariance matrix to reduce the effective number of parameters and consequently the estimate is sparse and ii)sum of squared deviations penalty on the sample eigenvalues which controls the over-dispersion in the eigenvalues of sample covariance matrix. In contrast to other existing method of covariance and inverse covariance matrix estimation, where the interest is to estimate a sparse matrix, the proposed method is flexible in estimating both a sparse and well-conditioned matrix simultaneously. We extend the method to inverse covariance matrix estimation assuming the existence of sample inverse covariance matrix. We show that the proposed estimator of covariance and inverse covariance matrix is consistent in Frobenius and operator norm for subgaussian random vectors. The resulting optimization problem is convex and easy to solve. We perform extensive simulation analysis that suggests that joint penalty estimate of covariance and inverse covariance matrix is better than Graphical lasso, PDSCE and Ledoit-Wolf estimate for various choices of structured covariance matrices for varying sample size and number of variables. We use our proposed estimator for tumor tissues classification using gene expression data and compare its performance to other classification methods.
- Siddhartha Nandy, Statistics and
Probability, Michigan State University
Multi-resolution kriging for anisotropic Gaussian spatial process
We developed a multi-resolution model for kriging on two-dimensional irregularly spaced spatial fields that can account for anisotropy of the spatial process. Our model represents the field as a sum of basis functions multiplied by coefficients. The basis consists of level of multiple levels of radial basis with their centers (nodes) organized on a regular grid and the kernel being a compacted supported function. The spatial dependence is modeled through dependence among the coefficients of the basis functions. Here the coefficients are assumed to follow a second order spatial autoregressive model and variation in the autoregressive parameters can control elliptical nature and tilt in the direction of an anisotropic covariance matrix. The flexibility over parameter values of neighborhood matrix gives us the freedom of modeling through an anisotropic covariance matrix. One of the most important features of this model is that it can be applied to statistical inference for large spatial datasets because key matrices in the computations are sparse. This computational efficiency is applicable to evaluation of both likelihood, spatial prediction, and spatial inference.
- Shannon O’Connor, Psychology, Michigan
A Co-Twin Control Design Unraveling Selection Effects from Pure Socialization in the Association Between Body Conscious Peer Groups and Disordered Eating
Previous studies suggest strong associations between body-conscious peer groups and disordered eating. This association has been attributed to socialization effects (i.e., membership in peer groups leads to disordered eating); however, selection effects (i.e., selecting into peers group based on genetic or environmental predispositions toward disordered eating) could contribute to or even account for these associations. The current study was the first to use a co-twin control design to disentangle genetic and shared environmental selection factors from pure socialization effects. Participants included 612 female twins (ages 8-14) drawn from the Michigan State University Twin Registry. To comprehensively examine the full range of eating pathology, several disordered eating attitudes and behaviors (e.g., body dissatisfaction, binge eating, loss of control over eating) were examined via self-report. Self-report questionnaires also were used to assess peer group emphasis on body weight and shape. Analyses examined peer group exposure for each twin individually, as well as within-pair differences separately for monozygotic (MZ) and dizygotic (DZ) twin pairs. Comparisons of within-individual versus within-twin pair effects allow for identification of socialization effects, genetic selection effects, or a combination of genetic and environmental selection effects. For the vast majority of disordered eating measures, a combination of genetic and environmental selection effects were present, as associations between body-conscious peer groups and disordered eating were found within-individual, but not within MZ or DZ twin pairs. Discussion: These findings are the first to suggest that associations between peer groups and disordered eating may be due to girls with genetic or environmental predispositions towards disordered eating selecting into body-conscious peer groups.
- Pablo Reeb, Epidemiology and Biostatistics,
Michigan State University
Evaluation of dissimilarity measures for sample-based hierarchical clustering of RNA sequencing experiments
Hierarchical cluster analysis has been extensively adopted as a tool for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. Euclidean and Pearson correlation based distances have been widely used in gene expression analysis for microarray data but may not be appropriate for RNA-seq data analysis. In order to account for RNA-seq data characteristics, regularization functions, data transformation and model-based dissimilarities have been proposed. Adequacy of such dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. In contrast, we propose the simulation of realistic conditions through creation of plasmode datasets from two experimental datasets. On one hand, an experiment on highly inbred individuals and two main sources of variation determined by experimental design allowed us to build plasmode datasets with a known number of differentially expressed transcripts. On the other hand, an experiment with RNA-seq profiles from outbred individuals allowed us to construct plasmode datasets creating synthetic individual with known proportions of original shared read. We evaluated eight dissimilarity measures including Euclidean and correlation based approaches as well as Poisson dissimilarity. To compare dendrograms, we used correlation of cophenetic matrices computed between and within dissimilarities, and we contrasted the obtained hierarchical structure against the configuration known a priori from the plasmode generation process. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
- Lei Xu, Computer Science and Engineering,
Michigan State University
Data Cleaning in Long Time-series Plant Photosynthesis Phenotyping Data
The scale of biomedical phenotyping data is growing exponentially. However, the quality of phenotyping data is compromised by sources of noise, which are difficult to remove in data collection step. The ability to improve their quality is a key requirement for effective knowledge mining from phenotype data. We developed a coarse-to-refined model called Dynamic Filter to effectively identify abnormalities in plant phenotyping data. Unlike the existing data cleaning methods, our model can identify both abnormalities and biological discoveries which are difficult to separate from distribution. Dynamic Filter employs an Expectation-Maximization process to dynamically adjust the kinetic model in coarse and refined regions to identify both abnormalities and biological outliers. Firstly, residuals are generated at a coarse level, and are subsequently modeled with the Gaussian mixture model. Candidates of abnormalities are defined as the values falling off certain confidence interval of the major normal distribution. Secondly, we refine the results in a projected feature space using KNN. Third, we divide abnormal candidates by regions, and reexamine residuals using local data. Finally, the local results are propagated to all the regions for further performance improvement. The experimental results show that our algorithm can effectively identify most of the abnormalities in both the real and synthetic datasets.
- Lei Zhao, Epidemiology and Biostatistics,
Michigan State University
Observational Study Methods Application in Safety of Lumbar Puncture in Comatose Malawian Children
In hospitals in sub-Saharan African presentation of a child in coma is a frequent occurrence. There are differential diagnosis including cerebral malaria, bacterial meningitis, tuberculosis meningitis, crypotococcal meningitis, viral encephalitis and drug intoxication. Distinguishing clinically between these diagnoses is problematic. Lumber puncture (LP) is critical diagnostic tool and maybe the only way to distinguish between some of these pathologies. However there is considerable controversy as to whether LPs are safe in children with decreased consciousness. Over 3000 children have been admitted in coma to the Pediatric research unit in Blantyre over the past 15 years with detailed records kept on their progress and outcome. In observational study, propensity score estimation is a statistical technique that attempts to estimate the effect of a treatment by accounting for the covariates that predict receiving the treatment. In the study, several methods based on propensity score help to find wether LP is a proper treatment in these comatose children.