Thursday, July 31, 2008

Under and over sampling with Weka


Weka uses the ARFF format for storing data. In the development series (3.5.x) an XML version of the ARFF format was introduced, XRFF. On the surface, there is little reason to use it, the format is far more verbose so file size quickly swells up. There are 3 additional features over the ARFF format:
  1. Class attribute specification
  2. Attribute weight
  3. Instance weight
Typically the class attribute is the last in the file, else you need to tell the classifier which attribute to use. Now set the class attribute to any attribute:
<attribute class="yes" name="class" type="nominal">
Associate a weight to a attribute (within the header section) using metadata:
<attribute name="petalwidth" type="numeric">
<metadata>
<property name="weight">0.9</property>
</metadata>

</attribute>
Associate a weight to an individual instance:
<instance weight="0.75">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
You can use the weight associated to an individual instance to simulate under and over sampling. For example, if you have 100 actives in a dataset and 1000 inactives, oversample the actives. This means training on each active 10 times so the model is composed from 1000 actives and 1000 inactives, granted the same actives are used, but this technique has positive effects on skewed datasets. The weight to add for this dataset would be 10 to each active instance.
<instance weight="10">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>active</value>
</instance>

Monday, July 28, 2008

Weka Online

Weka is an excellent machine learning/data mining workbench, from the University of Waikato. It is Java-based and available under GNU GPL.

An advantage of being in Java means it can easily run on virtually any platform. On the flip side Java can be limited by the amount of RAM available, this is the case with weka as it has been programmed with a memory-driven approach, not disk-driven. As data sets get larger and larger more RAM is required to run them. Couple this with Weka not being specifically designed for large data sets means it isn't hard to exceed a 2GB RAM requirement.

Now for the technical part, 32-bit hardware and Operating Systems (x86) can only use up to 2GB RAM per single process, regardless how much the machine actually has. To use more than 2GB RAM per process you need both 64-bit hardware and Operating System (x86_64). Thankfully it is increasingly common to have 64-bit hardware as standard on new purchases.

However, if you don't have new hardware another solution has recently become available: Weka Online. They allow you to submit Weka tasks from a web interface on to their 64-bit computer cluster (with 2.5-3.5GB RAM available). Alas, as I write this they have disabled submission while they bolster security due to a malicious attack.

Once this service is back it actually offers more than standard Weka, via their CEO framework, see more here. I've not actually tried the service myself, but the idea is certainly appealing.

Tuesday, July 22, 2008

Condor 7.1.1 supported ports

The development series for condor is dropping support for several platforms from 7.1.1 onwards:
  1. Red Hat 9 (Suitable for openSUSE 10.x)
  2. Solaris 5.8
  3. Mac OS X PowerPC
RHEL 3 binaries should be fine for any Red Hat system (and presumably CentOS). Solaris 5.8 users can use the 5.9 binaries.

It should be noted they are continuing these ports for the current stable series (7.0.x).

Unfortunately the RHEL 3 binaries do not work on openSUSE 10.x (well they run but give Shadow exceptions if you try to do anything useful - like run a job!). Looks like a case to compile from source...

UPDATE: Condor 7.1.1 has been pulled due to numerous problems, look out for 7.1.2.

Saturday, July 19, 2008

Perl on Eclipse 3.4 (Ganymede)

I use Eclipse everyday and the ability to use it for multiple languages is crucial. Perl is one of the languages I use and there is an excellent plugin for it: EPIC. However, after installing the recently released Ganymede (3.4) release I couldn't install it.

There are multiple versions of Eclipse available to download, typically I pick Classic. However, EPIC will not install on to this version. I found using Eclipse IDE for Java Developers worked fine. Hopefully any other plugins you use won't mind you using this version!

Friday, July 18, 2008

Display source code in MediaWiki

You have three options by default:
  1. Display code inline like script.sh, by using <code>script.sh</code>.
  2. Blocks of code can be wrapped with <pre>insert your code here</pre>. This works multi-line but doesn't allow formatting.
  3. Indent your text to enable a <pre> like block, then apply standard '''bold''' and ''italic'' formatting.
The above options generally work quite well, but if you end up with lots of code from different languages and more than a few lines it would be handy to have syntax highlighting. Thankfully an extension to MediaWiki can do this. Use SyntaxHighlight_GeSHi to colour away; it is also used on Wikipedia.

You will need root access to your server and subversion installed, then follow the simple instructions on the extension website. Download the extension and GeSHi then add it to your LocalSettings.php.

Now you have a fourth option, wrap your code with <source lang="X">code here</source> where X is php, java, bash, ruby or one of the other nearly 50 supported langauges!

Thursday, July 17, 2008

Thumbnails and TeX support for MediaWiki on Mac OS X

After you have setup a new MediaWiki installation you will likely want to enable some extra functionality which requires additional software.

First off to create thumbnails of images you need to install ImageMagick, which gives you the incredibly handy convert program.

Second you can add maths support, using TeX, for this you need ocaml, latex and dvipng. Grab the required programs from MacPorts, they are available from Fink as well.
  • sudo port install ImageMagick
  • sudo port install ocaml
  • sudo port install tetex
  • sudo port install ghostscript
Texvc will convert your TeX into whatever MediaWiki wants to display (HTML, MathML or PNG), but it needs to know where several programs are. Given you are probably running your webserver as the www user, who has no $PATH settings how do you tell Texvc where to find the files? In an unusal move, hardcode them! Edit <mediawiki>/math/render.ml, prefixing /opt/local/bin to the four commands
let cmd_dvips tmpprefix = "/opt/local/bin/dvips -R -E " ^ tmpprefix ^ ".dvi -f >" ^ tmpprefix ^ ".ps"
let cmd_latex tmpprefix = "/opt/local/bin/latex " ^ tmpprefix ^ ".tex >/dev/null"
let cmd_convert tmpprefix finalpath = "/usr/local/bin/convert -quality 100 -density 120 " ^ tmpprefix ^ ".ps " ^ finalpath ^ " >/dev/null 2>/dev/null"
let cmd_dvipng tmpprefix finalpath = "/opt/local/bin/dvipng -gamma 1.5 -D 120 -T tight --strict " ^ tmpprefix ^ ".dvi -o " ^ finalpath ^ " >/dev/null 2>/dev/null"
Them recompile texvc with make, ocaml will take over here.

Tell ImageMagick where gs is, by editing /opt/local/lib/ImageMagick-X.X.X/config/delegates.xml, where X.X.X is version number. Replaces every "gs" with "/opt/local/bin/gs", only edit "gs" entries, about half a dozen.

Finally tell MediaWiki to use TeX, by editing your LocalSettings.php with $wgUseTeX = true;

Now math support should be good to go. Thank to this compilation of advice for assistance.

So now this wikitext will produce the following:

== Magical latex in action ==
<math>\left \{ \frac{a}{b} \right \} \quad \left \lbrace \frac{a}{b} \right \rbrace</math>

<math>x \implies y</math> an AMS command

<math>f(n) =
\begin{cases}
n/2, & \mbox{if }n\mbox{ is even} \\
3n+1, & \mbox{if }n\mbox{ is odd}
\end{cases}</math>

== Image thumbnail ==

[[Image:OpenSUSE.png|frame|Full size|center]]
[[Image:OpenSUSE.png|thumb|A thumbnail|center]]

Wednesday, July 16, 2008

Solubility Challenge

UCC has launched a competition in conjunction with JCIM. Essentially their article (DOI: 10.1021/ci800058v) which appeared on ASAP yesterday, details 132 druglike molecules. They report the solubility for 100 molecules and challenge you to predict the other 32 using whatever method you choose.

Submit your predictions by 15th September 2008 upon which the best submissions will be invited to detail their models as JCIM articles.

Full details are on the Goodman group website, including machine-readable files.

Tuesday, July 8, 2008

Subversion with CruiseControl

As we use subversion for our version control we need to do an extra step as CruiseControl only has limited subversion support (e.g. it can't checkout a project, I'm sure it should but has never worked for me).

To give it the power to do so you need to download SvnAnt. Copy the three jar's from the lib folder into the the lib folder in your installation: /cruisecontrol/apache-ant-1.7.0/lib. This way everything that keeps CruiseControl happy is in one place.Now you need to define a property file defining the location of SvnAnt, something like the file svn-build.props:
svnant.version=1.0.0

lib.dir=../apache-ant-1.7.0/lib

svnant.jar=${lib.dir}/svnant.jar
svnClientAdapter.jar=${lib.dir}/svnClientAdapter.jar
svnjavahl.jar=${lib.dir}/svnjavahl.jar
You need to ensure the lib.dir value is valid, depending where you call this file from (in this example /cruisecontrol/project). As you will see we make a wrapper script to grab the code from the repo, before launching the project ant script. The wrapper script may be in /cruisecontrol/project, but defines it basedir as /cruisecontrol/checkout.

A sample script can be found here (Blogger doesn't want to display it). To use it save to /cruisecontrol/project and edit the sample_project and subversion path. Your project will be checked out to /cruisecontrol/checkout, where it is built, tested, compiled etc. For the first time I had to checkout manually otherwise CruiseControl would kick a fuss up.

In your main config.xml call the new wrapper script (/cruisecontrol/project/sample_project.xml) in the schedule section. This way a fresh copy of the code is checked out before the CruiseControl commences the build.

Thursday, July 3, 2008

Start condor on boot with Mac OS X

Once you have condor running on your clients you will want it to load by default when booting. The condor distribution includes linux-based startup scripts, however there are none for the mac. Looking through the mailing list there is a suggestion of scripts to use, but they use Panther (10.3) based technologies, not recommended in Tiger (10.4) and not available in Leopard (10.5).

Delving a bit further I found another way to start condor by using cron.

Create a script to start condor: sudo vim /usr/sbin/start_condor

Enter these contents, and customise to your installation:
#!/bin/bash
# Ensure network is all setup
sleep 100

# Ensure condor environment is loaded
source /opt/condor/condor.sh

# Start condor
/opt/condor/sbin/condor_master

Our condor installation is actually stored on an NFS drive, so the 100 second sleep is to ensure the NFS drives have mounted before the rest of this script runs. I handle the path settings for
$CONDOR_CONFIG, $PATH & $MANPATH in a separate script (condor.sh), alternatively you could specifiy $CONDOR_CONFIG in this script.

Tell cron about your script, and that it should be run on boot:
sudo echo "@reboot root /usr/sbin/start_condor" >> /etc/crontab
I have the condor daemons run as root, hence the root user mentioned in this crontab entry.

Test the script by running direct from the command-line first, if it runs then you should have trouble when rebooting.