Chemistry in silico: Under and over sampling with Weka

Thursday, July 31, 2008

Under and over sampling with Weka

Weka uses the ARFF format for storing data. In the development series (3.5.x) an XML version of the ARFF format was introduced, XRFF. On the surface, there is little reason to use it, the format is far more verbose so file size quickly swells up. There are 3 additional features over the ARFF format:

Class attribute specification
Attribute weight
Instance weight

Typically the class attribute is the last in the file, else you need to tell the classifier which attribute to use. Now set the class attribute to any attribute:

<attribute class="yes" name="class" type="nominal">

Associate a weight to a attribute (within the header section) using metadata:

<attribute name="petalwidth" type="numeric">
<metadata>
  <property name="weight">0.9</property>
</metadata>
</attribute>

Associate a weight to an individual instance:

<instance weight="0.75">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>

You can use the weight associated to an individual instance to simulate under and over sampling. For example, if you have 100 actives in a dataset and 1000 inactives, oversample the actives. This means training on each active 10 times so the model is composed from 1000 actives and 1000 inactives, granted the same actives are used, but this technique has positive effects on skewed datasets. The weight to add for this dataset would be 10 to each active instance.

<instance weight="10">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>active</value>
</instance>