5th International ICST Conference on Collaborative Computing: Networking, Applications, Worksharing

Research Article

VisGBT: Visually analyzing evolving datasets for adaptive learning

Download698 downloads
  • @INPROCEEDINGS{10.4108/ICST.COLLABORATECOM2009.8281,
        author={Keke Chen and Fengguang Tian},
        title={VisGBT: Visually analyzing evolving datasets for adaptive learning},
        proceedings={5th International ICST Conference on Collaborative Computing: Networking, Applications, Worksharing},
        proceedings_a={COLLABORATECOM},
        year={2009},
        month={12},
        keywords={Computer science Costs Data analysis Data engineering Data visualization Machine learning Machine learning algorithms Multidimensional systems Regression analysis Training data},
        doi={10.4108/ICST.COLLABORATECOM2009.8281}
    }
    
  • Keke Chen
    Fengguang Tian
    Year: 2009
    VisGBT: Visually analyzing evolving datasets for adaptive learning
    COLLABORATECOM
    ICST
    DOI: 10.4108/ICST.COLLABORATECOM2009.8281
Keke Chen1,*, Fengguang Tian1,*
  • 1: Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA
*Contact email: keke.chen@wright.edu, tian.9@wright.edu

Abstract

Many machine learning problems involve changes in both feature distribution and label distribution, such as domain adaptation and learning drifting concepts from data streams. Correctly detecting, identifying, and understanding the changes of data distributions can help us properly select data samples or algorithms for learning models. However, since the training datasets are often in high dimensionality and large size, it has been difficult to effectively analyze them. Furthermore, the joint distribution between features and labels makes the problem more difficult to handle. In this paper, we propose a visual analysis method (VisGBT) that combines the gradient-boosting-trees (GBT) modeling method, regression analysis, and multidimensional visualization to capture the mismatches between datasets and models. The GBT model consists of a series of trees with a predefined number of terminal (leaf) nodes per tree. These terminal nodes partition the high dimensional space with a few most informative features to minimize the label prediction error. VisGBT maps various kinds of detailed model information to the terminal node matrix (TNM) and visualizes it with an appropriate design. With this visual analysis method, we can easily find out the detailed differences between datasets with the help of a learned model. We will illustrate the use of various visual patterns and in particular show how this method can help us analyze domain similarity for domain adaptation.