We live in the Age of Information. The importance of collecting data that reflect your business or scientific activities to achieve competitive advantage is widely recognised now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system you study from the collected data.
- What goods should be promoted to this customer?
- What is the probability that a certain customer will respond to a planned promotion?
- Can one predict the most profitable securities to buy/sell during the next trading session?
- Will this customer default on a loan or pay back on schedule?
- What medical diagnosis should be assigned to this patient?
- How large the peak loads of a telephone or energy network are going to be?
- Why the facility suddenly starts to produce defective goods?
These are all the questions that can probably be answered if information hidden among megabytes of data in your database can be found explicitly and utilized. Modeling the investigated system, discovering relations that connect variables in a database are the subject of data mining.
Modern computer data mining systems self learn from the previous history of the investigated system, formulating and testing hypotheses about the rules which this system obeys. When concise and valuable knowledge about the system of interest had been discovered, it can and should be incorporated into some decision support system which helps the manager to make wise and informed business decisions.
Reasons for the growing popularity of Data Mining
Growing Data Volume
The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. According to information from GTE research center, only scientific organizations store each day about 1 TB (terabyte!) of new information. And it is well known that academic world is by far not the leading supplier of new data. It becomes impossible for human analysts to cope with such overwhelming amounts of data.
Limitations of Human Analysis
Two other problems that surface when human analysts process data are the inadequacy of the human brain when searching for complex multifactorial dependencies in data, and the lack of objectiveness in such an analysis. A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this helps, sometimes this hurts, but it is almost impossible to get rid of this fact.
Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has a much lower cost than hiring an army of highly trained (and payed) professional statisticians. While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data.
Tasks Solved by Data Mining
A task of learning a pattern from examples and using the developed model to predict future values of the target variable.
- PolyAnalyst PolyNet Predictor, Find Laws, Memory Based Reasoning, and LR engines
A task of finding a function that maps records into one of several discrete classes.
- PolyAnalyst Memory Based Reasoning, Classify and Discriminate engines
Detection of relations
A task of searching for the most influential independent variables for a selected target variable.
- PolyAnalyst Find Dependencies engine
A task of finding explicit formulae describing dependencies between various variables.
- PolyAnalyst Find Laws engine
A task of identifying groups of records that are similar between themselves but different from the rest of the data. Often, the variables providing the best clustering should be identified as well.
- PolyAnalyst Cluster engine
Market Basket Analysis
Processing transactional data in order to find those groups of products that are sold together well. One also searches for directed association rules identifying the best product to be offered with a current selection of purchased products.
- PolyAnalyst Market Basket Analysis engine
A task of determining the most significant changes in some key measures of data from previous or expected values.
Different DM Technologies and Systems
It would be very instructive to discuss various existing approaches to data mining while stressing out the following three vital criteria:
- Control of the significance of obtained results
- Transparency of developed empirical models and their interpretability
- Degree of search process automatisation and ease-of-use
To build a bridge from more traditional methods for data analysis to data mining methods we start by discussing some more traditional approaches:
- Subject-oriented analytical systems
- Statistical packages
And then proceed to consider Data Mining methods per se:
- Neural Networks
- Evolutionary Programming
- Memory Based Reasoning
- Decision Trees
- Genetic Algorithms
- Nonlinear Regression Methods
Subject-oriented analytical systems
All such systems are, of course, application domain specific. As one of the most developed systems of that kind we consider systems for analysis of financial markets based on the method of technical analysis. Technical analysis represents a bundle of a few dozen different techniques for forecasting of prices dynamics and selecting the optimal structure of investment portfolio, based on various empirical models of the market behavior. These methods range from very simple, for example – trend subtraction, to those having elaborate mathematical bases, such as fractal theory or spectral analysis.
The underlying empirical model is implanted into such systems by hand from the outset rather than be derived by automatic learning from the previous history. Thus our first two consideration criteria: statistical significance of the derived models and their interpretability, cannot be applied to such systems.
However, such systems usually provide another advantage to the user. They operate in specific terms of the application field. These terms are clear to traders and financial analysts. Often such systems have special interfaces for loading financial data. There exists a great number of subject-oriented systems based on technical analysis.
- MetaStock (Equis International)
- SuperCharts (Omega Research)
- Candlestick Forecaster (IPTC)
- Wall Street Money (Market Arts)
While in majority of the latest versions of the well known statistical packages traditional statistical methods are supplemented by some elements of data mining, their main data analysis methods remain to be of the classical nature: correlation, regression, and factor analyses and other techniques of that kind. Such systems cannot determine the form of dependencies hidden in data and require that the user provides his/her own hypotheses that will be tested by the system.
One of the major drawbacks of such systems is that they do not allow the data analysis to be performed by the user who does not have a thorough training in statistics. An “of-the-shelf” user has to go through a special several month course to be able to use these systems more or less intelligently. Another disadvantage of statistical packages is that during the data exploration the user has to perform again and again a set of some elementary operations. Tools for automatisation of the process either non-present do not exist, or require programming in some internal language.
These features, in our opinion, make statistical packages too clumsy and inefficient for solving complex real-world problems.
- SAS (SAS Institute)
- SPSS (SPSS)
- Statgraphics (Statistical Graphics)
- Statistica (Statsoft)
This is a large class of diverse systems whose architecture to some extent imitates structure of live neural tissue built from separate neurons. One of the most widespread architectures, multilayered perceptron with backpropagation of errors, emulates the work of neurons incorporated in a hierarchical network, where the input of each neuron of the next layer (narrower) is connected with the outputs of all neurons of the previous (wider) layer. Analyzed data are treated as neuron excitation parameters and are fed to inputs of the first layer. These excitations of a lower layer neurons are propagated to the next layer neurons, being amplified or weakened according to weights (numerical coefficients) ascribed to corresponding intraneural connections. As a final result of this process, the single neuron, comprising the topmost neuron layer, acquires some value (excitation strength) which is considered to be a prediction – the reaction of the whole network to the processed data.
In order to make meaningful predictions a neural network first has to be trained on data describing previous situations for which both, input parameters and correct reactions to them are known. Training consists of selecting weights ascribed to intraneural connections that provide the maximal closeness of reactions produced by the network to the known correct reactions.
This approach proved to be effective in problems of image recognition. However, experience shows that it is not suited well for, say, financial or serious medical applications. There are several reasons for this difficulty. First, actual neural networks that are built for analyzing a complicated system, such as financial markets, have to be highly intricate systems themselves. They include dozens of neurons with a couple hundred connections between them. As a result, the number of degrees of freedom of the created forecasting model (these are weights of all connections between the network neurons) often becomes larger than the number of examples (separate data records) that had been used to train the network. This deprives the forecast of any sense and value: the network easily can be “trained” to explain an array of some randomly generated data. Our experience in applying neural networks to forecasting securities prices reveals that they could ideally explain all fluctuations in the data used to train the network but failed when applied to forecasting future prices.
The second drawback is a non-transparency of forecasting models represented by a trained neural network. This difficulty is also closely connected with the intricacy of the neural network structure: knowledge reflected in terms of weights of a couple hundred intraneural connections cannot be analyzed and interpreted by a human.
It should be noted that despite of these difficulties neural networks are actively used (with varying success) in different financial applications in the majority of developed countries.
- PolyAnalyst (Megaputer Intelligence)
- BrainMaker (CSS)
- NeuroShell (Ward Systems Group)
- OWL (Hyperlogic)
At present this is the youngest and evidently the most promising branch of data mining. The underlying idea of the method is that the system automatically formulates hypotheses about the dependence of the target variable on other variables in the form of programs expressed in an internal programming language. By utilizing a universal programming language the approach ensures that any dependence or algorithm can be expressed in this language in principle.
The process of production of internal programs (hypotheses) is organized as evolution in the world of all possible programs (this is somewhat reminiscent of the genetic algorithms method). When the system finds a hypothesis describing the observed dependence reasonably well, it starts to introduce various slight modifications to this program and selects the best daughter programs achieved through this process, those that improve the accuracy of the prediction the most. In this way the system develops a number of genetic lines of programs that compete with each other in the accuracy of expressing the searched dependence.
When the best program (hypothesis) with a desired accuracy is obtained, a special module of the system translates the discovered dependence from the internal language to an explicit form easily understood by the human: mathematical formulae, predicting tables, etc. This provides the user with a very important insight and control of the obtained dependence, and allows the graphical visualization of the results as well. Control of statistical significance of the obtained results is performed through a number of effective modern statistical methods, for example, a method of data randomization testing.
The only commercially available system based on this method, PolyAnalyst, shows the best results in various applications from finance and medicine.
Memory Based Reasoning (MBR)
The main idea underlying this method is very simple. To forecast a future situation, or to make a correct decision, such systems find the closest past analogs of the present situation and choose the same solution which was the right one in those past situations. That is why this method is also called the nearest neighbor method. CBR systems demonstrate rather good results in vastly diverse problems. On the other hand, their largest drawback is that these systems do not create any models or rules summarizing the previous experience at all. Their predictions are based on processing the whole body of available historic data, and thus it is impossible to tell which specific factors influenced the CBR system prediction.
- PolyAnalyst (Megaputer Intelligence)
- KATE tools (Acknosoft)
- Pattern Recognition Workbench (Unica)
This method can be applied for solution of classification tasks only. This limits applicability of the decision trees method in many fields. For example, in financial applications the most common problem is the task of predicting values of some numerical variable.
As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type “IF…THEN…” is created. This structure has a form of a tree (similar to the species detector from botanics or zoology). In order to decide to which class an object or a situation should be assigned one has to answer questions located at the tree nodes, starting from the root. These questions are of the form “Is the value of variable A greater than x ?”. If the answer is yes, one follows the right branch of the tree to a node of the next level, if the answer is no – the left branch. Then a question in this node should be answered, and so on. Following this procedure one eventually comes to one of the final nodes (called leaves), where he/she finds a conclusion to which class the considered object should be assigned.
An advantage of this method is that this form of representation of rules is intuitive and easily understood by the human. However, the issue of determination of significance of a found rule becomes a very serious problem for the decision trees approach. The problem originates from the fact that there is a smaller number of records left at each next level of the classification tree that is being built. The tree is splitting data into a large number of small sets of specific cases. The larger is the number of different cases, the smaller is each next separate set of training examples, and there is less confidence that correct further classification can be performed. If the built tree is rather “bushy”: if it contains a large number of small branches, then such a tree does not provide a meaningful and statistically justified solution. Application to complex real world problems shows that in majority of systems utilizing the decision trees method this problem does not find a satisfactory solution so far.
We must remark that there exists many systems implementing the decision trees method.
- PolyAnalyst (Megaputer Intelligence)
- C5.0 (Rule Quest)
- Clementine (Integral Solutions)
- SIPINA (University of Lyon)
- IDIS (Information Discovery)
Strictly speaking, data analysis is not the main field of application of genetic algorithms. They should be viewed rather as a powerful technique for solution of various combinatorial or optimization problems. Nevertheless, genetic algorithms are among standard modern instruments for data mining, and thus they are included in the present overview.
The name of the method derives from its similarity to the process of natural selection. Let the problem be to find a solution to a problem that would be the most optimal from the point of view of a certain criterion. And let each solution be exhaustively described by some set of numerical or non-numerical parameters. For example, if the task is to select a fixed number of market parameters influencing the market performance the most, then the names of these parameters comprise such a descriptive set. One can think of this set as of a set of chromosomes determining qualities of an “organism” – a solution of the problem. Following this analogy, values of parameters determining a solution correspond to genes. A search for the optimal solution is similar then to the process of evolution of a population of organisms, where each organism is represented by a set of its chromosomes.
This evolution is driven by three mechanisms: first, selection of the strongest – those sets of chromosomes that characterize the most optimal solutions; second, cross-breeding – production of new organisms by mixing sets of chromosomes of parent sets of chromosomes; and third, mutations – accidental changes of genes in some organisms of the population. After a number of new generations built with the help of the described mechanisms one obtains a solution that cannot be improved any further. This solution is taken as a final one.
Genetic algorithms have two weak points. First, the very way of formulating the problem deprives one of any opportunity to estimate statistical significance of the obtained solution. Second, only a specialist can develop a criterion for the chromosome selection and formulate the problem effectively. Thus genetic algorithms should be considered at present more as an instrument for scientific research rather than as a tool for generic practical data analysis, for instance, in finance.
- PolyAnalyst (Megaputer Intelligence)
- GeneHunter (Ward Systems Group)
Nonlinear Regression Methods
These methods are based on searching for a dependence of the target variable on other variables in the form of function of some predetermined form. For example, in one of the most successful implementation of algorithms of this type, group attribute accounting method, a dependence is sought in the form of polynomials. Such methods must provide solutions with a larger statistical significance than neural networks do. An obtained formula, a polynomial, is more suitable for analysis and interpreting in principle (in reality it is usually still too complex for that). Thus this method has better chances of providing reliable solutions in such involved applications as financial markets or medical diagnostics.
- PolyAnalyst (Megaputer Intelligence)
- NeuroShell (Ward Systems Group)