Sunday, December 14, 2008

Keeping Workbench trim and fit!

Under the covers the Workbench model data is stored in a set of RDF XML files using the Jena Semantic Web Framework.

At the lowest level, this means that a set of files exists (by default) under "$CTIER_ROOT/workbench/rdfdata" on your ControlTier server for each project you create in Workbench:

$ cd $CTIER_ROOT/workbench/rdfdata
$ ls -lh
total 16M
-rw-rw-r-- 1 anthony anthony 6.1M Dec 8 10:53 Arch_UModules_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 491K Dec 5 17:33 Arch_UObjects_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 6.1M Dec 8 10:53 Arch_UTypes_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 809 Dec 5 14:34 Arch_UXforms_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 490K Dec 8 10:53 Map_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 1022K Dec 8 10:53 Modules_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 56K Dec 5 17:33 Objects_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 1.2M Dec 8 10:53 Types_UPioneerCycling
-rw-rw-r-- 1 anthony anthony 1.4K Dec 5 14:35 Workbench
-rw-rw-r-- 1 anthony anthony 809 Dec 5 14:34 Xforms_UPioneerCycling

A given set of files has the project name appended (in this case "PioneerCycling") and is split into two sets: the primary files and their archives (prefixed with "Arch_").

This would all be largely academic if it were not that managing these files turns out to be critical to the responsive performance of anything but the most trivial projects. It turns out that Jena relies on file level locking to manage updates and in the process repeatedly copies the entire file to temporary "checkpoint" copies. Of course, at the OS level, performance copying files of even tens of MB in size is trivial.

However; streaming the same data through the Jena library turns out to be a significant performance bottleneck; so much so that it really pays to keep the ControlTier repository trim and fit!

The primary way to do this is to navigate to the Workbench administration page, find the "Model Administration (Advanced)" section and run the five file compaction tasks:

This process minimizes the size of the primary data files and can be run as frequently as makes a difference.

Dealing with the archive files is a little more complex.

In normal operation there is no need to track the history of changes to the model so it is reasonable to remove the archive files on a regular basis. The process for achieving this is straightforward:
  1. Shutdown Workbench.
  2. Remove the "Arch_" files from $CTIER_ROOT/workbench/rdfdata associated with the required project(s).
  3. Restart Workbench.
There are a few points to note about this process:
  • It is necessary to do this with Workbench stopped as the file set is cached in the JVM's heap and will simply be re-written otherwise.
  • You may wish to skip the "Modules" archive file since removing it invalidates Workbench's notion of the most recent ("head") version of the packaged modules on the WebDAV requiring that you repackage all Deployment and Package modules - quite a lengthy process.
  • With a project with a stable type model, it is really only "Objects" archive file that has an impact on performance and so it may only necessary to remove this file.
  • As a rule of thumb, only worry about files that are > ~20MB.
  • There have been cases where we've set process up in cron (since Jobcenter/Ctl/Antdepo requires Workbench to be available for normal operation).
Finally, I should note that this whole issue was much more of a problem under ControlTier 3.1 and that we've done a lot to mitigate its impact on performance under ControlTier 3.2 by eliminating unnecessary model versioning. We have not dealt with the fundamental scaling issues in Jena, so it still pays to be conscious of all this.

Anthony Shortland.

No comments: