News & EventsFebruary 19, 2008 October 2007 July 3, 2007 |
![]() Technology & Services: KDE™ Tutorial: How it WorksOverview | Applications / Classification | How it Works | Supporting Software | Publications The Knowledge Discovery Engine® How the KDE WorksThe KDE finds a near-optimal feature set solution (the “model”) using three components:
To build a model, the algorithm will cycle through a sequence of events many times. Each of these cycles is called a generation. During each generation, the algorithm seeks to find a better model than the one it found in the previous generation. In order to build and test models in this way, KDE requires the available sample data to be divided into at least three distinct groups, although a fourth is also preferable:
This strategy is most successful when there are large data sets available on which to train the models. Modeling is performed in a massively parallel computation. It starts by defining the modeling conditions and selecting a large number of random, unique, combinations of features called chromosomes. For example, for a 3-feature map (N=3), the user may specify starting with 50,000 unique 3-feature chromosomes. For each chromosome, the KDE extracts the corresponding 3 features from each data file in the training set, normalizes the data within the features (Figure 1), and projects the resulting data as a cluster map in 3 dimensions. This results, in this case, in 50,000 cluster maps. Figure 1 shows how data is translated into a cluster map by KDE. The first part of this figure shows the data for chromosome XYZ, from two files, plotted to create two unique points in 3-dimensional space. As more data are added, the additional points may cluster around the original points or may seed additional clusters. The clustering is unsupervised; the number of clusters that form depends entirely on the nature of the data. The homogeneity of each cluster map is then scored, using the testing data set, and ranked. The chromosomes are then subjected to a low rate of genetic recombination, in a manner that increases the chance that the “better” chromosomes (i.e., yielding the better cluster maps) are more likely to recombine with each other to produce even better chromosomes (Figure 2). A low rate of spontaneous mutation during this recombination also enables new features to enter the chromosomes. Figure 2 describes the process of Genetic Recombination. Two 3-feature chromosomes, ABC and JKL, are allowed to recombine. The results are the parental chromosomes and two hybrid chromosomes. Recombination increases the likelihood that informative features on different chromosomes can reside on the same chromosome, creating a more informative combination of features. The cross-over point in recombination is selected randomly. This new set of 50,000 chromosomes now goes through the cycle again, and the process is repeated, until a near optimal solution emerges (Figure 3). In the last cycle, the best chromosome, and its associated cluster map, is designated as “the model.” This final model is then tested against the validation data set to provide an independent measure of the model performance. Figure 3 provides an overview of the KDE algorithm. Since the number of features required to obtain the best classification of data is not normally known in advance, a range will be examined to identify the optimum number of features. The final output includes the features and coordinates of the cluster map, as well as measures of accuracy, sensitivity, specificity, and their standard deviations. The KDE also measures robustness and monitors over fitting of data. Overview | Applications / Classification | How it Works | Supporting Software | Publications
|