Tutorials Given

  1. Sample Selection Bias - Covariate Shift: Problems, Solutions, and Applications




Sample Selection Bias - Covariate Shift: Problems, Solutions, and Applications

by Wei Fan and Masashi Sugiyama, given in ICDM'08, Pisa, Italy, December 2008

Sample selection bias/covariate shift is a common problem encountered when using data mining algorithms for many real-world applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so-called "stationary or non-biased distribution assumption." However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, medical diagnosis etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the train! ing data. For example, there has been speculations that the most recent US subprime mortgage problem is due to sample selection bias problem where the default customers do not follow the same risk model as traditional mortgage customers. In this tutorial, we will employ various examples to describe the problem, describe various solution, and end the tutorial with a systematic approach to address a real-world problem.

The Powerpoint can be found here