One of the neat things you can do with many modern relational database servers is import and index hierarchically-structured XML data in a similar way as you would with tabular data — without ‘shredding’ it into tables and rows.
Of course, you can also do this with native XML databases or other query engines, but the advantage of using an RDBMS is that you can integrate queries over XML documents with regular SQL queries.
I was writing some custom, task-specific Perl tools for querying the MeSH controlled vocabulary when it occurred to me that a more flexible, general-purpose solution might be possible by loading it all into SQL Server which we use for normal database purposes anyway (just the free version behind that link). It worked pretty well and I thought I’d share my methods in case anyone else would find it useful.
The first thing you need is the MeSH descriptors XML distribution, available here (the link entitled ‘2008 MeSH in XML format’).
Then you import the XSD schema for MeSH into SQL Server. Well, you don’t actually need to, but it will allow SQL Server to validate the XML as it loads (always a good thing) and is supposed to make things faster too. To do this, you need a little bit of T-SQL as follows:
create xml schema collection [MeSH].[DescriptorSchema] as ' paste_the_XSD_schema_in_here '
Remember those single-quotes!
Then you can create, load and index the table to hold the XML. It has two columns, one to hold the unique identifier of each descriptor record, and the other to hold the entire descriptor record in parsed XML format.
First, create the table.
create table [MeSH].[Descriptor] ( ui char( 7 ) PRIMARY KEY, record XML( content [MeSH].[DescriptorSchema] ) );
Then create a temporary table to load the entire MeSH file into, for shredding.
create table [MeSH].[DescriptorTemp] ( data xml( document [MeSH].[DescriptorSchema] ) );
Load the data into this table to start with. Obviously, replace m:\desc2008.xml with the location you saved the MeSH XML distribution to.
insert into [MeSH].[DescriptorTemp] select * from openrowset( bulk 'm:\desc2008.xml', single_blob ) as data;
This is where it starts to get complex. For each descriptor record, pull out the unique identifier, and store the record with the identifier in two columns of a new row in the main table. (CROSS APPLY is very handy for shredding XML.)
with xmlnamespaces( default 'uri:mesh' ) insert into [MeSH].[Descriptor] select T.c.value( './DescriptorUI', 'char( 7 )' ) as ui, T.c.query( '.' ) as record from [MeSH].[DescriptorTemp] cross apply data.nodes( '/DescriptorRecordSet/DescriptorRecord' ) as T( c );
Then we need to index the XML data column so it’s queriable.
create primary xml index DescriptorIdx on [MeSH].[Descriptor]( record );
Finally, clean up.
drop table [MeSH].[DescriptorTemp];
Now it’s ready to use, and I’ll give some handy example queries in the next post.
In theory, adding additional indices to the XML column should improve performance, but I haven’t found a combination of indices yet that really makes my queries much faster. I suspect it depends very much on how your queries work, and in particular how much happens in the XML as opposed to relational parts of the query.