News & EventsFebruary 19, 2008 October 2007 July 3, 2007 |
![]() Technology & Services: KDE™ Tutorial: OverviewOverview | Applications / Classification | How it Works | Supporting Software | Publications The Knowledge Discovery Engine® OverviewData does not necessarily equal information. The world is awash in data, and decision makers in all disciplines -- financial market analysts, national security experts, scientists in every field -- are continually faced with the challenge of too much data, and limited tools for converting the data to information. At the core of this challenge is the fact that complex problems rarely lend themselves to simple answers, a single answer, or a single data point. The problem is well illustrated in biology, where attempts to diagnose cancer using a single biomarker, e.g., Prostate Specific Antigen (PSA) for prostate cancer, produce results that are insufficiently sensitive and specific. To improve classifications, biologists often combine multiple features from complex data streams, such as those generated by mass spectrometry. This creates the problem of finding the right combination of a few features within the data stream that can provide the classification required. As an example of this problem consider the challenge of identifying a 5-feature pattern that is diagnostic for a disease in a mass spectral profile of serum containing 100,000 m/z features. A naïve search of all possible 5-feature combinations requires searching through ~[(100,000)5]/5! = 8x1022 potential combinations. To put this number in perspective, there are only ~ 3.2x107 seconds in a year. To solve this single problem in a year would require searching ~ 2.5x1015 combinations per second. Even with such a generous timeframe, the task is beyond any existing computing facility. The problem is magnified in the more typical situation, where the optimum number of features is not known and the task is to scan all potential combinations of 3, 4, 5, 6 or more features to identify the best feature combination. Clearly an impossible task. There are a number of more rational ways to solve this challenge. The simplest and most widely used methods are based on linear transformation techniques using linear statistical methods such as regression analysis. An alternative approach is to simplify the complexity of the search by reducing the dimensionality of the problem using techniques such as principal component analysis. However, reducing dimensionality may lose valuable data. Thus, while such methods are helpful in relatively straightforward analyses, they are less effective when applied to complex problems. It is because there is often a direct relationship between the complexity of a system and its non-linearity that classic statistically based algorithms generally fail to produce the required classification. Other techniques such as those based on k-nearest neighbor and decision trees, move beyond simple linear combinations of features and can explore complex data space more effectively. However, these approaches are inherently limited because they explore the feature space in a ‘steepest descent’ manner which is prone to being caught in local minima, resulting in sub-optimal classification models. In more sophisticated approaches, a genetic algorithm can be used to minimize the effects of local minima. To address these difficulties, Correlogic has developed an efficient, non-linear, classification algorithm, the Knowledge Discovery Engine (KDE), which embraces the full dimensionality of the feature space. KDE technology combines a genetic algorithm with a fast adaptive pattern recognition algorithm to identify a near-optimal set of classification features. This approach allows a rapid scan through the feature space and very efficient evaluation of feature combinations, overcoming the fundamental barriers of an exhaustive search. As a result, it can identify classification models in just hours to days, depending on the size of the data streams, the number of features composing the patterns, and the number of objects to be classified. The algorithm also stands out in two important respects from other classification systems. First, it uses a feature-specific normalization scheme that bounds the variations among the specific features examined. This contrasts to most normalization schemes, which normalize across all features within a spectrum. The value of this approach is the ability to minimize the impact of large variations in detector responses, such as those found in mass spectrometers or microarrays. Second, the algorithm divides data into three distinct sets during the modeling process to avoid over fitting and provide a more robust assessment of potential solutions than methods that rely solely on a cross-validation strategy. The type of resulting classification depends upon the study and type of data stream that is analyzed. In many cases a binary classification is appropriate, such as classifying samples of “phenotype 1” from “phenotype 2.” However, more complex classification states such as ternary (three phenotypes), quaternary (four phenotypes) or higher are readily achievable [2,3]. Overview | Applications / Classification | How it Works | Supporting Software | Publications
|