Data mining is one of the most promising modern information technologies. The corporate world has learned to derive new value from data by utilizing various intelligent tools and algorithms designed for an automated discovery of non-trivial, useful, and previously unknown knowledge in raw data.
- Which factors influence the future variation of the price of some security shares?
- What characteristics of a potential customer of some service make him/her the most probable buyer?
These and numerous other business questions can be successfully addressed by data mining.
The majority of available data mining tools are based on a few well-established technologies for data analysis. Decision trees, neural networks, case-based reasoning, genetic algorithms, and of course linear regression represent the most familiar analysis techniques. Different knowledge discovery methods are suited best for different applications. Among the useful knowledge presentation tasks one can name the dependency detection, numerical prediction, explicit relation modeling, or classification rules.
Despite the usefulness of traditional data mining methods in various situations, we choose to concentrate here first on the problems that plague these methods. Then we discuss the solutions to these problems, which become available with an advent of SKAT – a next generation data mining technology. We outline the reasons, foundations, and commercial implementations of this emerging approach.
Among the various tasks a data mining system is asked to perform, two questions are encountered most frequently:
- Which database fields influence the selected target field?
- Precisely how the target field depends on other fields in the database?
While there are many successful methods designed to answer the first question, it is far more difficult to answer the second. Why does this happen? Simply, an observation that across a number of cases with close values of all parameters except some parameter X, the target parameter Y varies considerably, implies that Y depends on X. For multi-dimensional dependencies the issue becomes less straightforward, but the basic idea for solving the problem is the same. At the same time, the task of automated determination of an explicit form of the dependence between several variables is significantly more difficult. The solution to this problem cannot be based on similar simple-minded considerations.
Traditional methods for finding the precise form of a sought relation implement the search for an expression representing the dependence among possible expressions from some fixed class. This idea is exploited in many existing data mining applications. For example, one of the most straightforward and popular methods of search for simple numerical dependencies – linear regression – selects a solution out of a set of linear formulas approximating the sought dependence. Systems from another popular class of data mining algorithms – decision trees – search for classification rules represented as trees involving simple equalities and inequalities in the nodes connected by Boolean AND and OR operations.
However, beyond the limits of the narrow classes of dependencies that can be found by these systems there is an endless sea of dependencies which cannot even be represented in the language used by these systems. For example, assume you are using a decision tree system to analyze the data holding the following simple rule: “most frequent buyers of Post cereal are homemakers of age smaller than the inverse square of their family income multiplied by a certain constant”. A traditional system has no means to discover such a rule. Only if one furnishes to the system explicitly the parameter “inverse square of the family income” can the stated rule be found by traditional systems. In other words, one has to guess an approximate form of the solution first – and then the machine does the rest of the job efficiently. While guessing a general form of the solution prior to automated modeling might be a challenging brain twister, it certainly does not make life of a corporate data analyst much easier.
Yet another problem of traditional methods is most familiar in the context of Neural Networks. A trained neural network is a predicting “black box”. The user cannot look inside the trained network and figure out why and how it makes a certain prediction. There is no control over the solution found. However, real world tasks require a human analyst to be able to understand and control the discovered model. “This is a critical issue for business applications, because managers are reluctant to use a model if they don’t understand how it works,” stresses Prof. Raymond Burke, C.W.Kelley Chair of Business Administration at Indiana University.
There also exists a common problem of data overfitting. Again, the case can be illustrated best with Neural Networks. To efficiently perform training, a network has to contain many intermediate neurons. This translates in having a large number of free parameters in the system. Often there exist so many free parameters that a network can fine-tune them to fit arbitrary, even randomly generated data. Such an overfit solution possesses absolutely no predictive power and should be discarded. To avoid this problem, there exist methods built around an idea of cross-validation, when a subset of data used for testing the model is excluded from training. However, these methods are very computationally demanding.
Is there a way to overcome the stated fundamental problems? Until recently there wasn’t. Yet the latest advancements in the field of knowledge discovery made it possible to develop models of the most general nature and present them explicitly in the form of comprehensible mathematical relations. The best approach is to create a system that develops evolving models in terms of a language suitable for the representation of an arbitrary rule expressed by some algorithm.
New Technology Helps
This article describes a new technology for an automated development of models that can reproduce, in principle, any relationship present in data. Models discovered in this approach are presented in an easy-to-understand explicit symbolic form, thus it is called Symbolic Knowledge Acquisition Technology (SKAT).
A system based on SKAT develops an evolving model from a set of elementary blocks, sufficient to describe an arbitrarily complex algorithm hidden in data, instead of routine searching for the best coefficients for a solution that belongs to some predetermined group of functions. Each time a better model is found, the system determines the best regression parameters for that model. In most general terms, this technology can be classified as a branch of Evolutionary Programming.
The SKAT technology is implemented as the Find Laws algorithm – one of the nine different data exploration engines of the PolyAnalyst data mining tool from Megaputer Intelligence. The tool features a number of innovative algorithms for clustering, classification, dependency detection, prediction but we concentrate here only on the most unique Find Laws algorithm. Since PolyAnalyst is the only system implementing the SKAT technology, the words SKAT and PolyAnalyst are used in the article interchangeably.
In brief, the new approach makes use of a very general underlying internal language. With this language the system consistently develops more and more complex hypotheses about the relationships hidden in data, and tests these hypotheses against the data to determine the most accurate and reliable model. You may notice that this sounds close to the way a human analyst would attack a problem. However, the tricky part is to make a machine follow this strategy independently and derive useful results automatically. In fact, there are so many potential problems and difficulties with this approach that until recently it seemed impossible to follow this path.
Under the Hood
The flowchart in Figure 1 shows how PolyAnalyst models are developed. This chart helps trace the cooperation of the basic modules that are involved in the real world implementation of the SKAT approach.
The game starts with the development of an underlying algorithmic language. In fact, such languages exist and are well known. For example, any computational algorithm can be expressed in a universal programming language such as C or PASCAL. Technically, it is beneficial to develop a special Symbolic Rule Language (SRL) for the PolyAnalyst operation. Still, any operator of SRL can be translated into a combination of the C language operators.
A far more difficult task is to create a module that develops increasingly more accurate hypotheses about the relationships in data out of the primitives – elementary constructions – of the underlying universal language. Potentially, this approach opens up a fantastic opportunity: imagine building a software system that automatically creates other programs modeling the dependence of the target variable on independent variables in a database. The model is not simply selected by calculating regression coefficients of a function from some predetermined class. Instead, the model is developed from scratch by building and testing hypotheses about the hidden relationship – much like a human analyst approaches the task. This system can scan raw data – and come up with a model describing the dependence between variables.
Let us consider whether this is indeed possible. First, note that we will only consider programs that take some parameters as their inputs and return an output value. Such programs are called functional programs and programming languages that allow only this type of programs are known as functional programming languages. A very important property of a functional programming language is that programs written in this language can be considered as consisting of simpler functional programs. This is true for all programs except a finite number of the most primitive programs expressing elementary constructions such as relational operators, Boolean operators, or, say, the calculation of Sine function. This property allows one to arrive at an arbitrarily complex program starting from a set of primitives and using a finite number of composition methods, which combine several programs into a single more complex one. Thus, we can say that a functional programming language is determined by defining three items: a set of possible data types, a set of functional primitives, and a set of allowed functional composition methods for building more complex programs. Having a sufficient set of data types, primitives, and composition methods, we can efficiently express any dependence or rule hidden in the data.
Now we have a language that can express any knowledge representable as a mathematical algorithm. Theoretically we could discover the knowledge hidden in data automatically if our data mining system builds consecutively all possible syntactically correct programs and checks the validity of rules expressed by them.
There remains an immense problem, one that threatens to wipe out our nice construction. If we blindly implement the described algorithm, the process of finding a suitable model most probably would take over a billion years to accomplish. This happens because the number of possible functional programs of a given complexity depends on the complexity degree stronger than exponentially. Any more or less nontrivial model might become available only to our great-grandchildren.
The described approach becomes practically useful only if we develop a reliable mechanism for navigating in the space of all possible functional programs. PolyAnalyst implements several special modules whose only purpose is to provide an efficient navigation in the space of all programs.
A hypothesis development process in PolyAnalyst features a healthy competition between a low priority completely general search in the space of all possible programs, and high priority Generalizing Transformations defined below.
Low priority search
The low priority complete search is guided by two PolyAnalyst supervisory modules: the Full Space Search Strategy module determines which new functional programs out of all possible should be built first, while the Trivial and Equivalent Program Prevention module systematically slaughters 99% of newly created programs, those which produce no new knowledge. A joint population reduction campaign of the two modules dramatically decreases the number of built hypotheses. However, this reduction is simply not sufficient: as it stands, even a simple model search still would take an unacceptably long time. We must supplement the described strategy with a very different model-building mechanism in order to put the nature under control.
High priority search
There is a way to circumvent an exponential growth of the number of programs, which is inherited from the complete search. The trick is that instead of constantly reinventing the wheel by building all new programs from scratch, we elaborate the best existing programs thoroughly by modifying them slightly, to increase an overall accuracy. Upon spotting the best current functional program created by the complete search mechanism, we start modifying this program slightly – and very quickly – in such a way that the accuracy of the prediction does not get worse.
The success of the SKAT approach is due to this new mechanism maintaining the search process highly focused around the currently best direction leading toward the sought solution. As the best direction shifts during the search process, the focus shifts correspondingly. This focusing mechanism bears the name of Generalizing Transformations (GT). It drives the high priority search for the best model developed by PolyAnalyst. This robust search runs on the background of the low priority complete search process.
To illustrate the GT-mechanism, assume that we need to predict a future price change for some financial instrument from current and historical data on its open/close/high/low price and trade volume. Using the market historical data, the complete search has discovered that the tomorrow’s price change is described best by the formula
close(tomorrow) – close(today) = A * (close(today)-close(yesterday)) * (volume(today)-volume(yesterday))
By adding to this formula a new term
we do not make our formula any worse because assigning a zero value to the B constant we can set the new formula identical to the old one. At the same time, it often happens that by slightly changing B from zero we obtain a more accurate relation expressing an actual hidden dependence. This is a very simple example of Generalizing Transformations.
Thus, GTs provide an existing model (i.e. an existing functional program) with additional degrees of freedom, which give the model a better chance to express the sought dependence well. For any given functional program there exist many classes of possible GTs. A balanced combination of the high priority GTs with the low priority complete search ensures an acceptable computational time for PolyAnalyst. At the same time, we manage to retain the generality of rules and dependencies discovered by the system.
The discovered model should be checked against data to determine its accuracy and find the best set of parameters for that model. The Program Evaluation module is responsible for this task in PolyAnalyst. The interchangeable module implementation provides PolyAnalyst with an ability to solve tasks of different nature. For example, for a task of predicting the value of a continuous variable the minimum sum of squares criterion is used to determine the accuracy of the developed model. Solving a classification problem requires an implementation of the minimum number of misclassifications criterion as a measure of the model accuracy. Thus an interchangeable Program Evaluation module facilitates the ability of PolyAnalyst to tackle quite different tasks.
A special new mathematical algorithm has been developed to allow the Significance Testing module of PolyAnalyst avoid overfitting. The implemented Target Randomization algorithm forces the system to solve a number of “no sense” problems where the values of the target variable have been mixed between different records, in parallel to the original problem. Only if the actual model provides a much better accuracy than any of “no sense” models, then this model passes the statistical significance test and is accepted as a final solution. In addition to being reliable, this algorithm is not as resource demanding as the cross-validation type algorithms. It also allows one to obtain meaningful results on even small amounts of available data.
Finally, on the output the system presents the discovered relationships explicitly in the form of mathematical formulae and logical constructions. “Unlike neural network programs, PolyAnalyst displays a symbolic representation of the relationship between the independent and dependent variables. The software provides a unique and powerful set of tools for data mining applications,” concludes Prof. Burke. The explicit knowledge presentation is made possible by the new functional model development mechanism, which constitutes the foundation of the SKAT approach.
In summary, PolyAnalyst implements the mechanism for finding explicit relationships in data as a process of growth and competition between several genetic lines of functional programs. At any given moment the system focuses on several programs that solve the problem better than all other built programs. Applying GTs to these best solutions we obtain their “successors”. The best successors are selected for the next generation of programs, and so on. This approach simulates the process of evolution occurring in nature; in PolyAnalyst this evolution and natural selection is realized in the world of computer programs.
The innovative Symbolic Knowledge Acquisition Technology has been exploited successfully over the past few years by users of the PolyAnalyst data mining tool in different application fields – and numerous geographic locations. “PolyAnalyst provides quick and easy access for inexperienced users to powerful modeling tools. The user interface is intuitive and new users come up to speed very quickly,” notes James Farkas, a Senior Navigation Engineer for the Boeing Company. “The results returned back by PolyAnalyst are really excellent,” adds Bernd Classen, MD from the Ulm University Clinic for Anesthesiology, Germany.
The new technology efficiently solves several well-known and long-standing problems of automated knowledge discovery. First, a system based on the SKAT technology searches for the best model explaining data in the most general space of possible algorithms. Second, it conveys the discovered knowledge to the user in an easy-to-understand explicit form. Finally, it provides a new computationally effective solution to the problem of data overfitting.
PCAI Magazine, January issue, page 48