Using DPrep

DPrep converts a data set stored in a comma delimited text file to the binary format required by ORCA. DPrep will scale continuous features to the range [0,1] or normalize them by subtracting the mean and dividing by the standard deviation. DPrep will also randomize the order of the data set with a disk-based shuffling algorithm.

DPrep is called as follows

The data-file is the name of a comma delimited text file storing the data examples. The fields-file specifies the file with a description of the attributes which includes information on which fields to use.

DPrep goes through four stages as follows:

  1. writes a weight file for use with orca,
  2. converts the data set to binary,
  3. scales the data set, and
  4. randomizes the data.
To run DPrep on the sample adult database type

Data File

The data file stores the data in a comma delimited format, with one example per line. For example, the first several records of the adult data set should appear as

Missing values for continuous and discrete fields should be represented with a question mark (?). For example, in the record below the second (workclass) and seventh (occupation) fields are missing their values.


Fields File

The fields file contains a listing of the attributes in the data set and a description of the allowable values. For example, the fields file for the adult data set should appear as follows,

There is one attribute per line and each attribute should have a name followed by a description of allowable values. The attributes can take be defined in four ways listed in Table 1.


 

Options

Table 1 summarizes the options available for DPrep. This is followed by a more detailed description of the individual options.

Table 2: Summary of DPrep options.
Scaling Options
 -snone no scaling of continuous fields
 -s01 scale continuous fiedls to range [0,1]
 -sstd scale continuous fields to zero mean and unit standard deviation
Disk Based Randomization Options
 -rand randomize
 -norand do not randomize
 -i X execute X iterations of shuffling (5)
 -rf X use X temporary files for disk shuffling (10)
 -seed X random number seed X (time based)
Miscellaneous Options
 -m X float point number for encoding real missing values
 -cleanf clean temporary files at end
 -cleand clean temporary during execution
 -cleann do not clean temporay files

 

-snone  no scaling

This option turns off scaling of continuous attributes.

-s01   scale to range [0,1]

This option tells DPrep to scale all continuous features to the range [0,1]. For each continuous feature, DPrep finds the minimum and maximum values. It then scales each feature by subtracting the minimum value and dividing by the range (maximum-minimum).

-sstd   scale by mean and standard deviation

This option tells DPrep to put all continuous variables into standard form by subtracting the mean from each feature and dividing by the standard deviation.

-norand  no randomization

Turns of randomization. That is, DPrep will not randomize the order of examples in data-file and will preserve the ordering in the binary output file. This option should only be used when the data file is already randomized or has no ordering dependencies (such as when artificial data is generated from a known probability distribution).

-i X  number of shufflings

This option sets the number of iterations DPrep uses to randomize the ordering of examples. In each iteration, DPrep randomly assigns each example to a random temporary file and then concatenates the files in random order.

-rf X  number of temporary files

This option sets the number of temporary files to be used during randomization.

Copyright and Usage

This software is Copyright 2003 by the Institute for the Study of Learning and Expertise. DPrep may be freely used for educational and research purposes by non-profit institutions and U.S. Government agencies. Other organizations may use ORCA for evaluation purposes only. All further uses require prior approval.

This software is provided "as is" with no warranties of any kind, either expressed or implied, including, but not limited to implied warranties as to the performance, fitness, and merchantability of the software for a particular purpose.

The entire risk of using the software is with the user. The software is provided without any support or obligation to assist with its use. This software may not be sold or redistributed without prior approval.
 


2003-5-6