Orca: A Program for Mining Distance-Based Outliers

Orca is a program for mining outliers in large multivariate data sets. An outlier is an example that is substantially different from the examples in the reminder of the data. An outlier may have values for an attribute that are unusually large or small, or it may have an unusual combination of values that are rarely seen together.

Orca mines distance-based outliers. That is, Orca uses the distance from a given example to its nearest neighbors to determine its unusuallness. The intuition is that if there are other examples that are close to the candidate in the feature space, then the example is probably not an outlier. If the nearest examples are substantially different, then the example is likely to be an outlier. Probabilistically, one can view distance-based outliers as identifying candidates that lie at points where the nearest neighbor density estimate is small.

Orca will find the top outliers in a multivariate data set. The key features of Orca are:


Instructions

The Orca software package comes with two programs, Orca and DPrep. Orca handles all of the computations associated with finding outliers. DPrep converts data sets that are stored as comma delimited text files into binary files for use with Orca. Further instructions can be found here:


Copyright and Disclaimer

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise and it may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use Orca for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.

Usage Requirements

If you use Orca in your research, please cite the paper.

If you use Orca in applied work, I would appreciate a brief email describing your problem and how you used Orca to solve it. I am trying to keep track of real applications where Orca has proved useful.

Download

Orca is available as a binary executable for Linux (x86), Microsoft Windows, and Solaris (64 bit). Orca is written in C++ and was compiled with gcc-2.9.6 (Linux) and MinGW32 (Windows).

The above tar file includes the Adult census data set originally from the UCI Machine learning repository. The following data sets were used in Bay & Schwabacher (2003) to evaluate an early version of Orca:

Most of these additional data sets come from the UCI KDD Archive and the Minnesota Population Center's IPUMS Repository of census microdata. The data set Normal30D is generated from a 30 dimensional Gaussian with the covariance matrix equal to the identity matrix.

 

2003-5-6