Weka uses the ARFF format for storing data. In the development series (3.5.x) an XML version of the ARFF format was introduced, XRFF. On the surface, there is little reason to use it, the format is far more verbose so file size quickly swells up. There are 3 additional features over the ARFF format:
- Class attribute specification
- Attribute weight
- Instance weight
<attribute class="yes" name="class" type="nominal">Associate a weight to a attribute (within the header section) using metadata:
<attribute name="petalwidth" type="numeric">Associate a weight to an individual instance:
<metadata>
<property name="weight">0.9</property>
</metadata>
</attribute>
<instance weight="0.75">You can use the weight associated to an individual instance to simulate under and over sampling. For example, if you have 100 actives in a dataset and 1000 inactives, oversample the actives. This means training on each active 10 times so the model is composed from 1000 actives and 1000 inactives, granted the same actives are used, but this technique has positive effects on skewed datasets. The weight to add for this dataset would be 10 to each active instance.
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
<instance weight="10">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>active</value>
</instance>