My PhD thesis

Computational-linguistic approaches to biological text mining, University of London, 2008:

Downloads for MPL, a Meta-Pattern Language

MPL is a language for describing patterns in Stanford-style dependency graphs, with applications in syntactic analysis and semantic relationship extraction. Resources:

Supplementary material for Clegg & Shepherd 2007

In order to facilitate direct comparisons between different systems, I’ve released a patch for the GENIA treebank distribution (beta 2) which will make all the changes necessary to replicate the exact test treebank we used in this paper.

We corrected errors in a few sentences, removed any uncorrectable sentences, and changed the format to make it more typically PTB-like. Although we based our work on beta 1 of the GENIA distribution, this is no longer available, so the patch requires beta 2.

If you use this patch in a published project, please ensure that you credit the GENIA team just as you would for any other use of the GENIA corpus. Please don’t redistribute the patched version of the corpus yourself.


  • Download the file GTB.tar.gz from the GENIA treebank (beta 2) which contains the first 200 abstracts of the treebank only. When we started this project, the second installment of 300 abstracts was not available.
  • Download and install the bdiff package from SourceForge, as this is what I used to create the patch. I used version 0.8 of bdiff, hopefully future versions will be backwards compatible; if not, contact me and I’ll send you the 0.8 distribution.
  • Download and unzip this file which is the patch file in bdiff format.
  • Concatenate all the .tree files in the distribution into one long file. On UNIX-like systems, you’ll need a command like this:
    $ cat *.tree > sentences.tree
    The resulting file should have an MD5 checksum of 8170ae71d7482f2f8fa8849134fa35ab. This is really important; if your MD5 checksum is different, GENIA have changed the files in their distribution, and the patch will no longer work!
  • Run the patch against this treebank file using the bpatch utility supplied with bdiff. On UNIX-like systems, you need to do this:
    $ bpatch sentences.bdiff < sentences.tree > sentences_edited.tree
    This will replicate our treebank file in a new file called sentences_edited.tree—you might wish to check its MD5 checksum as well just to be sure. This should be 7937abf3d0c0a55abaeb421c5076949c.

Q & A

  • Why can’t you just distribute sentences_edited.tree directly?
    This would be easier, but might introduce licensing/copyright issues with the University of Tokyo, and with the NLM (as GENIA is still MEDLINE data, plus some annotations). And the GENIA group, understandably, don’t want to reach the situation where there are lots of different versions of their corpus available from different places; obviously this would create data management problems. This is a compromise acceptable to everyone.
  • Okay—but why can’t you make the patch with the standard UNIX diff utility then?
    diff works on text files, line by line. Since there is a difference on every line of the corpus between the originally-distributed version and our version (because of formatting differences as much as anything else), and the output of diff is a human-readable text file, any diff we made would contain the entire corpus and therefore have the same licensing implications as if we had just put the edited file up for download. However bdiffs are only machine-readable, and cannot be used to retrieve the edited data without the original corpus present in the same form.
  • Don’t other people redistribute MEDLINE data all the time?
    Yes… But it is a bit of a grey area legally, and the NLM’s own policies on redistribution seem to be somewhat contradictory. So, since I Am Not A Lawyer™ I decided to sidestep the whole issue.

Supplementary material for Clegg & Shepherd 2005

The manuscript for “Evaluating and integrating treebank parsers on a biomedical corpus”, as well as some scripts and data files that you might find useful, are available here. Only they’re currently not as they seem to have vanished. I’ll fix this.

%d bloggers like this: