Orca: A Program for Mining Distance-Based Outliers

Orca is a program for mining outliers in large multivariate data sets. An outlier is an example that is substantially different from the examples in the reminder of the data. An outlier may have values for an attribute that are unusually large or small, or it may have an unusual combination of values that are rarely seen together.

Orca mines distance-based outliers. That is, Orca uses the distance from a given example to its nearest neighbors to determine its unusuallness. The intuition is that if there are other examples that are close to the candidate in the feature space, then the example is probably not an outlier. If the nearest examples are substantially different, then the example is likely to be an outlier. Probabilistically, one can view distance-based outliers as identifying candidates that lie at points where the nearest neighbor density estimate is small.

Orca will find the top outliers in a multivariate data set. The key features of Orca are:

Orca has excellent scaling properties on large real data sets. Orca can process 1,000,000 census examples in about 20 minutes on a 1.5 Ghz Pentium 4 computer.
Orca only requires a limited amount of main memory to run. It does not require loading the entire database into memory. The typical memory footprint is about 3 MB.
Orca can explain why an example is an outlier. Orca can analyze the features of an example and determine their individual contribution to the unusallness.
Orca has options to allow users to change the outlier score function and the distance measure.

Instructions

The Orca software package comes with two programs, Orca and DPrep. Orca handles all of the computations associated with finding outliers. DPrep converts data sets that are stored as comma delimited text files into binary files for use with Orca. Further instructions can be found here:

Copyright and Disclaimer

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise and it may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use Orca for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.

Usage Requirements

If you use Orca in your research, please cite the paper.

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule

Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

If you use Orca in applied work, I would appreciate a brief email describing your problem and how you used Orca to solve it. I am trying to keep track of real applications where Orca has proved useful.

Download

Orca is available as a binary executable for Linux (x86), Microsoft Windows, and Solaris (64 bit). Orca is written in C++ and was compiled with gcc-2.9.6 (Linux) and MinGW32 (Windows).

orca.tar.gz binary for Linux (x86), Windows, and 64 bit Solaris. (2.8 MB compressed; 13.3 MB uncompressed). Last updated 2003-9-30.
orca-src-2004-7-27.tar.gz source code for Orca.

The above tar file includes the Adult census data set originally from the UCI Machine learning repository. The following data sets were used in Bay & Schwabacher (2003) to evaluate an early version of Orca:

Color Histogram
Forest Covertype
Normal30D (192 MB)
Person 1990 (223 MB)
Household 1990 (117 MB)
KDDCup 1999

Most of these additional data sets come from the UCI KDD Archive and the Minnesota Population Center's IPUMS Repository of census microdata. The data set Normal30D is generated from a 30 dimensional Gaussian with the covariance matrix equal to the identity matrix.

2003-5-6