com.rapidminer.operator.preprocessing.outlier
Class SearchSpace

java.lang.Object
  extended by com.rapidminer.operator.preprocessing.outlier.SearchSpace

public class SearchSpace
extends java.lang.Object

SearchSpace is a class for building a room full of SearchObjects (see class definition) and provides various methods to place those objects into the SearchSpace (by associating those Objects to the list of objects in the SearchSpace) as well as to do some Outlier Tests on those Objects.

Author:
Stephan Deutsch, Ingo Mierswa

Constructor Summary
SearchSpace()
          This constructor creates a SearchSpace with (integer) 2 dimensions as a default and initializes all fields in the instance of that Class with zero values where appropriate.
SearchSpace(int dim)
          This constructor creates a SearchSpace with (integer) dim dimensions and initializes all fields in the instance of that Class with zero values where appropriate.
SearchSpace(int dim, int minptslb, int minptsub)
          This constructor creates a SearchSpace with (integer) dim dimensions and initializes all fields in the instance of that Class with zero values where appropriate.
 
Method Summary
 void addObject(SearchObject objectToAdd)
          This adds a SearchObject to the SearchSpace.
 void allRadiusSearch(double d, double p, int kindOfDistance)
          This method invokes the class method radiusODSearch on all objects in the SearchSpace (associated to this Searchroom via the listOfObjects vektor). radiusODSearch does a brute force distance Outlier test based on the parameters d and p for DB(p,d)-Outliers acc. to Knorr and Ng's approach to unify statistical Outlier tests.
 void computeDKN(int dk, int n)
          This function computes the D^k_n Outliers according to Ramaswamy, Rastogi and Shim which computes the top-n D^k-Outliers, the outliers (= objects) with the maximum distance to the k-th nearest neighbors.
 void computeLOF(int kMin, int kMax)
          Some deeper magic to compute all the LOFs for the objects in the searchroom up to MinPtsUB = kMax!
 void findAllKdContainers(int kindOfDistance)
          Finds and fills all K distance containers for all objects in the Search Room by invoking the process of finding all k distance containers for one Search Object.
 void findKdistanceContainers(SearchObject so, int kindOfDistance)
          This method processes a sequential search over the SearchSpace for a SearchObject so (named p here to be in line with the literature).
 double[] getAverageDistanceMeasures(int kindOfDistance)
          Returns the average distances measures for the objects in the SearchSpace, calculating: mean distance standard deviation variance The calculation is time consuming and should only be invoked if the data set is parsed for the first time (to get a feeling on it for statistical choices of parameters p and d for e.g.
 double[] getAverageLOFMeasures()
          Returns the average LOF measures for the objects in the SearchSpace, calculating: mean LOF standard deviation variance
 int getDimensions()
          Returns the number of dimensions of the SearchSpace.
 double getMaximumOutlierFactor()
          This method returns the maximum Outlier Factor of all SearchObjects in the SearchSpace.
 int getNumberOfObjects()
          Returns the (integer) number of objects in the Searchroom (associated with it via addObject(SearchObject) to the room) as an integer value as we overall do not expect the searchroom to hold more than 2 billion objects.
 SearchObject getObject(int index)
          This method returns a SearchObject with the i-th index in the listOfObjects; the result has to be casted to SearchObject (Vector Class speciality, as it returns only a JAVA Object Class object).
 java.util.Enumeration getObjects()
          This method returns an Enumeration of all SearchObjects from a SearchSpace.
 boolean getSearchObjectOutlierStatus(int i)
          This method returns the outlierstatus of the Searchobject (element at index i) in the SearchSpace from the Searchroom's listOfObjects.
 java.util.Vector<SearchObject> getSearchObjects()
          Delivers the list of objects.
 void radiusODSearch(double d, double p, SearchObject rObject, int kindOfDistance)
          BruteForce Radius Search to determine the outlier status of an object rObject of the type SearchObject this method takes d and p as parameters acc. to distance based DB(p,D)-Outlier (Knorr, Ng) and identifies an object as being an outlier, if more than a proportion p of the objects is more than distance D from rObject away.
 void resetOutlierStatus()
          This method resets the Outlier Status for all Objects in the Search room to have a clean start or to have a new identification of outliers with a separate method.
 void setDimensions(int dim)
          Sets the number of dimensions for the SearchSpace to dim.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SearchSpace

public SearchSpace(int dim)
This constructor creates a SearchSpace with (integer) dim dimensions and initializes all fields in the instance of that Class with zero values where appropriate.


SearchSpace

public SearchSpace()
This constructor creates a SearchSpace with (integer) 2 dimensions as a default and initializes all fields in the instance of that Class with zero values where appropriate.


SearchSpace

public SearchSpace(int dim,
                   int minptslb,
                   int minptsub)
This constructor creates a SearchSpace with (integer) dim dimensions and initializes all fields in the instance of that Class with zero values where appropriate.

Method Detail

getNumberOfObjects

public int getNumberOfObjects()
Returns the (integer) number of objects in the Searchroom (associated with it via addObject(SearchObject) to the room) as an integer value as we overall do not expect the searchroom to hold more than 2 billion objects.


setDimensions

public void setDimensions(int dim)

Sets the number of dimensions for the SearchSpace to dim.

Attention: This is a value that the SearchSpace keeps for the purpose of consistency checks for all SearchObjects (as each SearchObject has its own number of dimensions and not all the dimensions of the SearchObjects need to be the same - to give implementation freedom).

Parameters:
dim -

getDimensions

public int getDimensions()
Returns the number of dimensions of the SearchSpace.


getSearchObjects

public java.util.Vector<SearchObject> getSearchObjects()
Delivers the list of objects.


getSearchObjectOutlierStatus

public boolean getSearchObjectOutlierStatus(int i)
This method returns the outlierstatus of the Searchobject (element at index i) in the SearchSpace from the Searchroom's listOfObjects.

Parameters:
i -
Returns:
the boolean outlier status

addObject

public void addObject(SearchObject objectToAdd)
This adds a SearchObject to the SearchSpace.

It prints a warning to STDOUT in case the dimensions of the SearchObject and SearchSpace are incompatible, but as the SearchSpace can perform some operations over SearchObjects with different dimensions, this is not a showstopper.

The method also automatically updates the min/max/range information the SearchSpace knows for itself.

Parameters:
objectToAdd -

getObject

public SearchObject getObject(int index)
This method returns a SearchObject with the i-th index in the listOfObjects; the result has to be casted to SearchObject (Vector Class speciality, as it returns only a JAVA Object Class object). This is better than to access the listOfObjects directly, but sadly I do not use it consistently. Maybe in the cleaning-up, this will be changed.

Parameters:
index -

getObjects

public java.util.Enumeration getObjects()
This method returns an Enumeration of all SearchObjects from a SearchSpace.


resetOutlierStatus

public void resetOutlierStatus()
This method resets the Outlier Status for all Objects in the Search room to have a clean start or to have a new identification of outliers with a separate method. As this zeros all boolean outlier statuses of all objects associated to this Searchroom and also zeros all outlier smooth factors, a current status list should be drawn down and stored somewhere before using this method. ATTN: As this only uses references to Objects associated to a Searchroom, in case more than one Searchroom uses a (fraction) range of objects, this might override the results from other detections for those objects. But it is encouraged to associate objects to only one SearchSpace and use duplications of objects with similar vektors in other SearchRooms.


radiusODSearch

public void radiusODSearch(double d,
                           double p,
                           SearchObject rObject,
                           int kindOfDistance)
BruteForce Radius Search to determine the outlier status of an object rObject of the type SearchObject this method takes d and p as parameters acc. to distance based DB(p,D)-Outlier (Knorr, Ng) and identifies an object as being an outlier, if more than a proportion p of the objects is more than distance D from rObject away. The simplest approach is to make a radius search for rObject and compare its distance to all other objects step by step with D (in this case d). If more than M = N(1-p) objects are within d, than rObject is not an Outlier, else it is. Although this is an approach with O(N^2) for all objects (it is O(N) for rObject), this prunes the search as soon as more than M objects are within d from rObject to get some improvement.


allRadiusSearch

public void allRadiusSearch(double d,
                            double p,
                            int kindOfDistance)
This method invokes the class method radiusODSearch on all objects in the SearchSpace (associated to this Searchroom via the listOfObjects vektor). radiusODSearch does a brute force distance Outlier test based on the parameters d and p for DB(p,d)-Outliers acc. to Knorr and Ng's approach to unify statistical Outlier tests. The result of the Outliertest is stored in the Objects themselves, e.g. each SearchObject knows its Outlier status (set recently, e.g. by this search) and can tell it by using the SearchObject's class method getOutlierStatus() (see there!) Added feature: prints progress on STDOUT for each 10% segment (app.) one hash "#" is printed to show progress if brute force should hit complexity boundaries (e.g. with a lot of dimensions as well as lots of objects). This also prints the parameters d and p and N for better understanding


getAverageDistanceMeasures

public double[] getAverageDistanceMeasures(int kindOfDistance)
Returns the average distances measures for the objects in the SearchSpace, calculating:

mean distance

standard deviation

variance The calculation is time consuming and should only be invoked if the data set is parsed for the first time (to get a feeling on it for statistical choices of parameters p and d for e.g. DB(p,d)-Outliers). It parses the objects matrix upper half to build an array of distances between objects (without doubling and without the distances of objects to themselves) which should be (n^2-n)/2 distances of value.

Returns:
double[3] of mean, variance and standard deviation

getAverageLOFMeasures

public double[] getAverageLOFMeasures()
Returns the average LOF measures for the objects in the SearchSpace, calculating:

mean LOF

standard deviation

variance

Returns:
double[3] of mean, variance and standard deviation

getMaximumOutlierFactor

public double getMaximumOutlierFactor()
This method returns the maximum Outlier Factor of all SearchObjects in the SearchSpace. Attn: Due to initializing, the outlier factors should be greater or equal to zero.


findKdistanceContainers

public void findKdistanceContainers(SearchObject so,
                                    int kindOfDistance)

This method processes a sequential search over the SearchSpace for a SearchObject so (named p here to be in line with the literature).

As a result of the search a structure of k-distance-Containers is build and listed within the SearchObject. Each container for a distance of an object or a number of objects o in relation to p is filled with all the objects within that distance. The containers are sorted in a linked list in the SearchObject by increasing distance. Just imagine it like p being a submarine sending a ping and listing all echos in radiuses (=distance) with the echos stored in a band (=container) if they are on the same radius.

Parameters:
so -

findAllKdContainers

public void findAllKdContainers(int kindOfDistance)
Finds and fills all K distance containers for all objects in the Search Room by invoking the process of finding all k distance containers for one Search Object.


computeLOF

public void computeLOF(int kMin,
                       int kMax)

Some deeper magic to compute all the LOFs for the objects in the searchroom up to MinPtsUB = kMax! The LOF output is only done up from kMin!

This one is heavily documented in the source, so if you are interested on how it is done, have a look at the source for the method.

Parameters:
kMin -
kMax -

computeDKN

public void computeDKN(int dk,
                       int n)
This function computes the D^k_n Outliers according to Ramaswamy, Rastogi and Shim which computes the top-n D^k-Outliers, the outliers (= objects) with the maximum distance to the k-th nearest neighbors. Please be aware that this function requires the findAllKdContainers method has to be run first, else it will simply stop or will not work.

Parameters:
dk -
n -


Copyright © 2001-2009 by Rapid-I