Skip to content

Oracle Data Mining

by admin on December 9th, 2009

Supervised Data Mining

Supervised learning is also known as directed learning. The learning process is directed by a previously known dependent attribute or target. Directed data mining attempts to explain the behavior of the target as a function of a set of independent attributes or predictors.

Supervised learning generally results in predictive models. This is in contrast to unsupervised learning where the goal is pattern detection.

The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model “learns” the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.
Supervised Learning: Testing

Separate data sets are required for building (training) and testing some predictive models. The build data (training data) and test data must have the same column structure. Typically, one large table or view is split into two data sets: one for building the model, and the other for testing the model.

The process of applying the model to test data helps to determine whether the model, built on one chosen sample, is generalizable to other data. In particular, it helps to avoid the phenomenon of overfitting, which can occur when the logic of the model fits the build data too well and therefore has little predictive power.
Supervised Learning: Scoring

Apply data, also called scoring data, is the actual population to which a model is applied. For example, you might build a model that identifies the characteristics of customers who frequently buy a certain product. To obtain a list of customers who shop at a certain store and are likely to buy a related product, you might apply the model to the customer data for that store. In this case, the store customer data is the scoring data.

Most supervised learning can be applied to a population of interest. Scoring is the purpose of classification and regression, the principal supervised mining techniques.

Oracle Data Mining does not support the scoring operation for attribute importance, another supervised function. Models of this type are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An attribute importance model returns and ranks the attributes that are most important in predicting a target value.

See Also:
Table 2-1, “Oracle Data Mining Supervised Functions” for more information
Unsupervised Data Mining

Unsupervised learning is non-directed. There is no distinction between dependent and independent attributes. There is no previously-known result to guide the algorithm in building the model.

Unsupervised learning can be used for descriptive purposes. It can also be used to make predictions.
Unsupervised Learning: Scoring

Although unsupervised data mining does not specify a target, most unsupervised learning can be applied to a population of interest. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.

Oracle Data Mining supports the scoring operation for clustering and feature extraction, both unsupervised mining functions. Oracle Data Mining does not support the scoring operation for association rules, another unsupervised function. Association models are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An association model returns rules that explain how items or events are associated with each other. The association rules are returned with statistics that can be used to rank them according to their probability.

See Also:
Table 2-2, “Oracle Data Mining Unsupervised Functions”
Oracle Data Mining Functions

Oracle Data Mining supports the supervised data mining functions described in Table 2-1.

Table 2-1 Oracle Data Mining Supervised Functions
Function Description Sample Problem

Attribute Importance

Identifies the attributes that are most important in predicting a target attribute

Given customer response to an affinity card program, find the most significant predictors

Classification

Assigns items to discrete classes and predicts the class to which an item belongs

Given demographic data about a set of customers, predict customer response to an affinity card program

Regression

Approximates and forecasts continuous values

Given demographic and purchasing data about a set of customers, predict customers’ age

Oracle Data Mining supports the unsupervised functions described in Table 2-2.

Table 2-2 Oracle Data Mining Unsupervised Functions
Function Description Sample Problem

Anomaly Detection (implemented through one-class classification)

Identifies items (outliers) that do not satisfy the characteristics of “normal” data

Given demographic data about a set of customers, identify customer purchasing behavior that is significantly different from the norm

Association Rules

Finds items that tend to co-occur in the data and specifies the rules that govern their co-occurrence

Find the items that tend to be purchased together and specify their relationship

Clustering

Finds natural groupings in the data

Segment demographic data into clusters and rank the probability that an individual will belong to a given cluster

Feature Extraction

Creates new attributes (features) using linear combinations of the original attribute

Given demographic data about a set of customers, group the attributes into general characteristics of the customers

Data Mining Algorithms

An algorithm is a mathematical procedure for solving a specific kind of problem. Oracle Data Mining supports at least one algorithm for each data mining function. For some functions, you can choose among several algorithms. For example, Oracle Data Mining supports four classification algorithms.

Each data mining model is produced by a specific algorithm. Some data mining problems can best be solved by using more than one algorithm. This necessitates the development of more than one model. For example, you might first use a feature extraction model to create an optimized set of predictors, then a classification model to make a prediction on the results.

Note:
You can be successful at data mining without understanding the inner workings of each algorithm. However, it is important to understand the general characteristics of the algorithms and their suitability for different kinds of applications.

See Also:
Part III, “Algorithms” for more details about the algorithms supported by Oracle Data Mining
Oracle Data Mining Supervised Algorithms

Oracle Data Mining supports the supervised data mining algorithms described in Table 2-3. The algorithm abbreviations are used throughout this manual.

Table 2-3 Oracle Data Mining Algorithms for Supervised Functions
Algorithm Function Description

Decision Tree (DT)

Classification

Decision trees extract predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction.

Generalized Linear Models (GLM)

Classification and Regression

GLM implements logistic regression for classification of binary targets and linear regression for continuous targets. GLM classification supports confidence bounds for prediction probabilities. GLM regression supports confidence bounds for predictions.

Minimum Description Length (MDL)

Attribute Importance

MDL is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data.

Naive Bayes (NB)

Classification

Naive Bayes makes predictions using Bayes’ Theorem, which derives the probability of a prediction from the underlying evidence, as observed in the data.

Support Vector Machine (SVM)

Classification and Regression

Distinct versions of SVM use different kernel functions to handle different types of data sets. Linear and Gaussian (nonlinear) kernels are supported.

SVM classification attempts to separate the target classes with the widest possible margin.

SVM regression tries to find a continuous function such that the maximum number of data points lie within an epsilon-wide tube around it.

Oracle Data Mining Unsupervised Algorithms

Oracle Data Mining supports the unsupervised data mining algorithms described in Table 2-4. The algorithm abbreviations are used throughout this manual.

Table 2-4 Oracle Data Mining Algorithms for Unsupervised Functions
Algorithm Function Description

Apriori (AP)

Association

Apriori performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence.

k-Means (KM)

Clustering

k-Means is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals within the population) that are in a cluster are close to the centroid.

Oracle Data Mining supports an enhanced version of k-Means. It goes beyond the classical implementation by defining a hierarchical parent-child relationship of clusters.

Non-Negative Matrix Factorization (NMF)

Feature Extraction

NMF generates new attributes using linear combinations of the original attributes. The coefficients of the linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.

One Class Support Vector Machine (One- Class SVM)

Anomaly Detection

One-class SVM builds a profile of one class and when applied, flags cases that are somehow different from that profile. This allows for the detection of rare cases that are not necessarily related to each other.

Orthogonal Partitioning Clustering (O-Cluster or OC)

Clustering

O-Cluster creates a hierarchical, grid-based clustering model. The algorithm creates clusters that define dense areas in the attribute space. A sensitivity parameter defines the baseline density level.

From → Oracle

No comments yet

Leave a Reply

You must be logged in to post a comment.