Technology

Technology & Services: KDE Tutorial: How it Works

Overview | Applications / Classification | How it Works | Supporting Software | Publications

The Knowledge Discovery Engine®
A Powerful, General, Classification Algorithm

How the KDE Works

The KDE finds a near-optimal feature set solution (the “model”) using three components:

  • a genetic algorithm that selects features
  • a self-organizing, adaptive pattern recognition algorithm that clusters data
  • a statistic that provides information on cluster homogeneity

To build a model, the algorithm will cycle through a sequence of events many times. Each of these cycles is called a generation. During each generation, the algorithm seeks to find a better model than the one it found in the previous generation. In order to build and test models in this way, KDE requires the available sample data to be divided into at least three distinct groups, although a fourth is also preferable:

  • Training data are used in an iterative process to create the cluster maps for assessment. These data are used in each cycle of model evolution.
  • Testing data are used to score each cluster map generated by the training data. The testing data are also used in each iterative cycle; and the final model evolves specifically to fit this data set.
  • Validation data are used only after a final model has evolved from the training/testing iteration to evaluate the final model performance. The validation data are used only once for each model. The results from this data set are used to select the single best model from a set of modeling runs.
  • Blinded validation data (optional, but strongly recommended) are used only once as an independent measure of performance of the single best model.

This strategy is most successful when there are large data sets available on which to train the models.

Modeling is performed in a massively parallel computation. It starts by defining the modeling conditions and selecting a large number of random, unique, combinations of features called chromosomes. For example, for a 3-feature map (N=3), the user may specify starting with 50,000 unique 3-feature chromosomes.

Figure 1
Figure 2
Figure 3

For each chromosome, the KDE extracts the corresponding 3 features from each data file in the training set, normalizes the data within the features (Figure 1), and projects the resulting data as a cluster map in 3 dimensions. This results, in this case, in 50,000 cluster maps.

Figure 1 shows how data is translated into a cluster map by KDE. The first part of this figure shows the data for chromosome XYZ, from two files, plotted to create two unique points in 3-dimensional space. As more data are added, the additional points may cluster around the original points or may seed additional clusters. The clustering is unsupervised; the number of clusters that form depends entirely on the nature of the data.

The homogeneity of each cluster map is then scored, using the testing data set, and ranked. The chromosomes are then subjected to a low rate of genetic recombination, in a manner that increases the chance that the “better” chromosomes (i.e., yielding the better cluster maps) are more likely to recombine with each other to produce even better chromosomes (Figure 2). A low rate of spontaneous mutation during this recombination also enables new features to enter the chromosomes.

Figure 2 describes the process of Genetic Recombination. Two 3-feature chromosomes, ABC and JKL, are allowed to recombine. The results are the parental chromosomes and two hybrid chromosomes. Recombination increases the likelihood that informative features on different chromosomes can reside on the same chromosome, creating a more informative combination of features. The cross-over point in recombination is selected randomly.

This new set of 50,000 chromosomes now goes through the cycle again, and the process is repeated, until a near optimal solution emerges (Figure 3). In the last cycle, the best chromosome, and its associated cluster map, is designated as “the model.” This final model is then tested against the validation data set to provide an independent measure of the model performance.

Figure 3 provides an overview of the KDE algorithm. Since the number of features required to obtain the best classification of data is not normally known in advance, a range will be examined to identify the optimum number of features. The final output includes the features and coordinates of the cluster map, as well as measures of accuracy, sensitivity, specificity, and their standard deviations. The KDE also measures robustness and monitors over fitting of data.


Overview | Applications / Classification | How it Works | Supporting Software | Publications

Back to top