This article presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the electronic health record (EHR) data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.

Large healthcare databases maintained by health plans have been widely used to conduct customized protocol-based epidemiological safety studies as well as targeted routine sequential monitoring of suspected adverse events for newly licensed vaccines. These databases also offer a rich data source to discover vaccine-related adverse events not known prior to licensure using data mining methods, but they remain relatively under-utilized for this purpose. Initial safety applications of data mining methods using ‘big healthcare data’ are promising, but stronger integration of database expertize, epidemiological design, and statistical analysis strategies are needed to better leverage the available information, reduce bias, and improve reporting transparency. We enumerate major methodological challenges in mining large healthcare databases for vaccine safety research, describe existing strategies that have been used to address these issues, and identify opportunities for methodological advancements that emphasize the importance of adapting techniques used in customized protocol-based vaccine safety assessments. Investment in such research methods and in the development of deeper collaborations between database safety experts and data mining methodologists has great potential to improve existing safety surveillance programs and further increase public confidence in the safety of newly licensed vaccines.

Recent advances in data mining have integrated kernel functions with Bayesian probabilistic analysis of Gaussian distributions. These machine-learning approaches can incorporate prior information with new data to calculate probabilistic rather than deterministic values for unknown parameters. This article extensively analyzes a specific Bayesian kernel model that uses a kernel function to calculate a posterior beta distribution that is conjugate to the prior beta distribution. Numerical testing of the beta kernel model on several benchmark datasets reveals that this model's accuracy is comparable with those of the support vector machine (SVM), relevance vector machine, naive Bayes, and logistic regression, and the model runs more quickly than all the other algorithms except for logistic regression. When one class occurs much more frequently than the other class, the beta kernel model often outperforms other strategies to handle imbalanced datasets, including under-sampling, over-sampling, and the Synthetic Minority Over-Sampling Technique. If data arrive sequentially over time, the beta kernel model easily and quickly updates the probability distribution, and this model is more accurate than an incremental SVM algorithm for online learning.

The Updating Sequential Probability Ratio Test (USPRT) developed by MaCurdy *et al.*. (2009, Updating sequential probability ratio test for real-time surveillance of vaccine safety, unpublished working paper) has been used by the U.S. Food and Drug Administration for near real-time surveillance of the safety of the flu vaccine since 2008. This procedure was the first method developed to account for data delay in pharmacovigilance studies. However, the current implementation is based on the strong assumption that the clinical and reporting delays do not vary from previous years. When this assumption does not hold, size distortion of the USPRT procedure might result. The goal of this article is to numerically investigate the robustness of the detection probabilities of the USPRT method with respect to possible misspecification of the clinical and reporting delay distributions through extensive simulations. We find that if the delay distribution used in calibrating the critical bound is lengthier than the delay distribution in the data generating process, then there is a higher rate of false signaling and vice versa. This is an inherent property of a real-time testing procedure. However, the distortion created by misspecifying the reporting delay distribution appears to be insignificant when compared to the overall power generated by an elevation of the adverse event rate. The size distortion is unevenly distributed across the interim tests, so the effect of misspecification of the delay distributions is more prominent in the median time-to-signal. In summary, although a misspecified delay distribution induces size distortion, we find that it does not erode the overall power.

In several modern applications, ranging from genomics to neuroimaging, there is a need to compare measurements across different populations, such as those collected from samples of healthy and diseased individuals. The interest is in detecting a group effect, and typically many thousands or even millions of tests need to be performed simultaneously, as exemplified in genomics where single tests are applied for each gene across the genome. Traditional procedures, such as multivariate analysis of variance (MANOVA), are not suitable when dealing with nonvector-valued data structures such as functional or graph-structured observations. In this article, we discuss an existing distance-based MANOVA-like approach, the distance-based F (DBF) test, for detecting such differences. The null sampling distribution of the DBF test statistic relies on the distribution of the measurements and the chosen distance measure, and is generally unavailable in closed form. In practice, Monte Carlo permutation methods are deployed which introduce errors in estimating small *p*-values and increase familywise type I error rates when not using enough permutations. In this work, we propose an approximate distribution for the DBF test allowing inferences to be drawn without the need for costly permutations. This is achieved by approximating the permutation distribution that would be obtained by enumerating all permutations by the Pearson type III distribution using moment matching. The use of the Pearson type III distribution is motivated via empirical observations with real data. We provide evidence with real and simulated data that the resulting approximate null distribution of the DBF test is flexible enough to work well with a range of distance measures. Through extensive simulations involving different data types and distance measures, we provide evidence that the proposed methodology yields the same statistical power that would otherwise only be achievable if many millions of Monte Carlo permutations were performed.

The proliferation of electronic health records, driven by advances in technology and legislative measures, is stimulating interest in the analysis of passively collected administrative and clinical data. Observational data present exciting challenges and opportunities to researchers interested in comparing the effectiveness of different treatment regimes and, as personalized medicine requires, estimating how effectiveness varies among subgroups. In this study, we provide new motivation for the *local control* approach to the analysis of large observational datasets in which patients are first clustered in pretreatment covariate space and treatment comparisons are made within subgroups of similar patients. The motivation for such an analysis is that the resulting local treatment effect estimates make inherently fair comparisons even when treatment cohorts suffer variation in balance (treatment choice fraction) across pretreatment covariate space. We use an example of Simpson's paradox to show that estimates of the overall average treatment effect, which marginalize over covariate space, can be misleading. Thus, we provide an alternative definition that uses a single, shared marginal distribution to define overall treatment comparisons that are inherently fair given the observed covariates. However, we also argue that overall treatment comparisons should no longer be the focus of comparative effectiveness research; the possibility that treatment effectiveness does vary across patient subpopulations must not be left unexplored. In the spirit of the now ubiquitous concept of personalized medicine, estimating heterogeneous treatment effects in clinically relevant subgroups will allow for, within the limits of the available data, fair treatment comparisons that are more relevant to individual patients.

Safety of medical products is a major public health concern. We present a critical discussion of the currently used analytical tools for mining spontaneous reporting systems (SRS) to identify safety signals after use of medical products. We introduce a pattern discovery framework for the analysis of SRS. The terminology ‘pattern discovery’ is borrowed from the engineering and artificial intelligence literature and signifies that the basis of the proposed framework is the medical case, formalizing the cognitive paradigm known to clinicians who evaluate individual patients and individual case safety reports submitted to SRS. The fundamental contribution of this approach is a strong probabilistic component that may account for selection and other biases and facilitates rigorous modeling and inference. We discuss somewhat in depth the concept of signal in pharmacovigilance and connect it with the concept of a pattern; we illustrate this conceptual framework using the example of anaphylaxis. Finally, we propose a research agenda in statistics, informatics, and pharmacovigilance practices needed to advance the pattern discovery framework in both the short and long terms.

Given the recent interest of subgroup-level studies and personalized medicine, health research with causal inference has been developed for interaction effects of measured confounders. In estimating interaction effects, the inverse of the propensity weighting (IPW) method has been widely advocated despite the immediate availability of other competing methods such as G-computation estimates. This paper compares the advocated IPW method, the G-computation method, and our new Tree-based standardization method, which we call the Interaction effect Tree (IT). The IT procedure uses a likelihood-based decision rule to divide the subgroups into homogeneous groups where the G-computation can be applied. Our simulation studies indicate that the IT-based method along with the G-computation works robustly while the advocated IPW method needs some caution in its weighting. We applied the IT-based method to assess the effect of being overweight or obese on coronary artery calcification (CAC) in the Chicago Healthy Aging Study cohort.

The problem of max-kernel search arises everywhere: given a query point , a set of reference objects and some kernel , find . Max-kernel search is ubiquitous and appears in countless domains of science, thanks to the wide applicability of kernels. A few domains include image matching, information retrieval, bio-informatics, similarity search, and collaborative filtering (to name just a few). However, there is no generalized technique for efficiently solving max-kernel search. This paper presents a single-tree algorithm called *single-tree FastMKS* which returns the max-kernel solution for a single query point in provably time (where is the number of reference objects), and also a dual-tree algorithm (*dual-tree FastMKS*) which is useful for max-kernel search with many query points. If the set of query points is of size , this algorithm returns a solution in provably time, which is significantly better than the linear scan solution; these bounds are dependent on the expansion constant of the data. These algorithms work for abstract objects, as they *do not* require explicit representation of the points in kernel space. Empirical results for a variety of datasets show up to five orders of magnitude speedup in some cases. In addition, we present approximate extensions of the FastMKS algorithms that can achieve further speedups.

Transfer learning has benefited many real-world applications where labeled data are abundant in source domains but scarce in the target domain. As there are usually multiple relevant domains where knowledge can be transferred, multiple source transfer learning (MSTL) has recently attracted much attention. However, we are facing two major challenges when applying MSTL. First, without knowledge about the difference between source and target domains, *negative transfer* occurs when knowledge is transferred from highly irrelevant sources. Second, existence of *imbalanced distributions* in classes, where examples in one class dominate, can lead to improper judgement on the source domains' relevance to the target task. Since existing MSTL methods are usually designed to transfer from relevant sources with balanced distributions, they will fail in applications where these two challenges persist. In this article, we propose a novel two-phase framework to effectively transfer knowledge from multiple sources even when there exists irrelevant sources and imbalanced class distributions. First, an effective supervised local weight scheme is proposed to assign a proper weight to each source domain's classifier based on its ability of predicting accurately on each local region of the target domain. The second phase then learns a classifier for the target domain by solving an optimization problem which concerns both training error minimization and consistency with weighted predictions gained from source domains. A theoretical analysis shows that as the number of source domains increases, the probability that the proposed approach has an error greater than a bound is becoming exponentially small. We further extend the proposed approach to an online processing scenario to conduct transfer learning on continuously arriving data. Extensive experiments on disease prediction, spam filtering and intrusion detection datasets demonstrate that: (i) the proposed two-phase approach outperforms existing MSTL approaches due to its ability of tackling negative transfer and imbalanced distribution challenges, and (ii) the proposed online approach achieves comparable performance to the offline scheme.

Regression methods are commonly used to learn the mapping from a set of predictor variables to a continuous-valued target variable such that their prediction errors are minimized. However, minimizing the errors alone may not be sufficient for some applications, such as climate modeling, which require the overall predicted distribution to resemble the actual observed distribution. On the other hand, histogram equalization methods, such as quantile mapping, are often used in climate modeling to alter the distribution of input data to fit the distribution of observed data, but they provide no guarantee of accurate predictions. This paper presents a flexible regression framework known as *contour regression* that simultaneously minimizes the prediction error and removes biases in the predicted distribution. The framework is applicable to linear, nonlinear, and conditional quantile models and can utilize data from heterogenous sources. We demonstrate the effectiveness of the framework in fitting the daily minimum and maximum temperatures as well as precipitation for 14 climate stations in Michigan. The framework showed marked improvement over standard regression methods in terms of minimizing their distribution bias.

Transfer learning, which aims to help learning tasks in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications such as text mining, sentiment analysis, and so on. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, the piece of information carried by them may compensate one another. If we can exploit these auxiliary domains in a collective manner, and transfer the knowledge to the target domain, we can improve the target model building from multiple perspectives. In this article, we consider this transfer learning problem as *Transfer Learning with Multiple Views and Multiple Sources*. As different sources may have different probability distributions and different views may compensate or be inconsistent with each other, merging all data in a simplistic manner will not give an optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, at the same time, it revises the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different kinds of state-of-the-art baselines.

Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on *triangles*, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider *wedge sampling*, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimating local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives.

With the development of social media and social networks, user-generated content, such as forums, blogs and comments, are not only getting richer, but also ubiquitously interconnected with many other objects and entities, forming a heterogeneous information network between them. Sentiment analysis on such kinds of data can no longer ignore the information network, since it carries a lot of rich and valuable information, explicitly or implicitly, where some of them can be observed while others are not. However, most existing methods may heavily rely on the observed user–user friendship or similarity between objects, and can only handle a subgraph associated with a single topic. None of them takes into account the hidden and implicit dissimilarity, opposite opinions, and foe relationship. In this paper, we propose a novel information network-based framework which can infer hidden similarity and dissimilarity between users by exploring similar and opposite opinions, so as to improve post-level and user-level sentiment classification at the same time. More specifically, we develop a new *meta path*-based measure for inferring pseudo-friendship as well as dissimilarity between users, and propose a semi-supervised refining model by encoding similarity and dissimilarity from both user-level and post-level relations. We extensively evaluate the proposed approach and compare with several state-of-the-art techniques on two real-world forum datasets. Experimental results show that our proposed model with 10.5% labeled samples can achieve better performance than a traditional supervised model trained on 61.7% data samples.