Querying MeSH with XQuery and SQL

This is the third in a series of posts on MeSH headings and how they can be manipulated in XML form in XML databases like SQL Server. Before reading this, you’ll want to read the previous post on loading the data, which in turn refers to the original post where I described the schema.

Most of the queries shown below use a combination of XQuery and Transact-SQL (Microsoft’s dialect). Some of them might well be achievable in pure XQuery, but in practice I found that it was a lot easier to express my intention using hybrid queries.

A similar compromise exists in the storage format. I originally loaded the XML data into a one-column, one-row table — that is, the entire MeSH tree in a single cell. However, many queries turned out to be rather awkward like this, so I chopped up the tree into individual descriptor records, and held each such record in its own row in the table. I also ‘hoisted’ the unique identifier for each descriptor out of the XML and into a separate char column, for ease of access. If I was working on a system that relied on repeated execution of similar queries I’d probably ‘hoist’ more elements out of the XML and into their own columns (or even lookup tables), for performance and ease of access. This is a good compromise between a pure XML/XQuery solution, and converting your XML schema into an entirely relational model.

The queries

First up, a few checks to make sure everything looks happy:

--  Display the schema, reconstructed from SQL Server's internal representation

select xml_schema_namespace( 'MeSH', 'DescriptorSchema' );

-- Count number of descriptors to check we've loaded everything
-- correctly: should be 24767 (April 2008 MeSH)

select count( * ) from [MeSH].[Descriptor];

-- Select a single descriptor as an XML fragment

select top 1 * from [MeSH].[Descriptor];

select * from [MeSH].[Descriptor] where ui = 'D009125';

There’s essentially two ways to retrieve XML data, as XML fragments or cast into native SQL data types:

-- Get all tree numbers (as XML lists)

with xmlnamespaces( default 'uri:mesh' )
select ui, record.query( '
DescriptorRecord/TreeNumberList/TreeNumber
' ) as TreeNumber
from [MeSH].[Descriptor];


-- Get all tree numbers (flattened out)

with xmlnamespaces( default 'uri:mesh' )
select ui, c.value( '.', 'char( 7 )' ) as TreeNumber
from [MeSH].[Descriptor]
cross apply record.nodes( '
DescriptorRecord/TreeNumberList/TreeNumber
' ) as T( c );

The last two queries demonstrated pretty simple XQuery fragments embedded in SQL (the DescriptorRecord/TreeNumberList/TreeNumber bits which define the elements of interest). Of course, these can be as complex as you like. The next one shows how to look for all String elements that occur in the names of descriptors, concepts or terms — or more specifically, which aren’t in the context of qualifiers or related terms (which I’m not covering yet):

-- Get all descriptor/concept/term strings (distinct per descriptor)

with xmlnamespaces( default 'uri:mesh' )
select distinct ui, c.value( '.', 'varchar( 4000 )' ) as String
from [MeSH].[Descriptor]
cross apply record.nodes( '
DescriptorRecord//String
	[ local-name( .. ) != "QualifierName"
		and local-name( ../.. ) != "DescriptorReferredTo" ]
' ) as T( c );

One useful starting point for these queries is to find all descriptors who match a given substring (at the level of descriptor name, concept names or terms). However, somehow Microsoft managed to make the XQuery contains function case-sensitive, without implementing functions to force lower or upper case, so to do case-insensitive string comparisons you have to do them with likeoutside of the XQuery:

-- Get all descriptors matching a query string (at D, C or T level)

declare @query varchar( 50 );
set @query = '%muscle%';
with xmlnamespaces( default 'uri:mesh' ),
T1 as
(
	select ui, c.value( '.', 'varchar( 4000 )' ) as String
	from [MeSH].[Descriptor]
	cross apply record.nodes( '
	DescriptorRecord//String
		[ local-name( .. ) != "QualifierName"
			and local-name( ../.. ) != "DescriptorReferredTo" ]
	' ) as T( c )
)
select ui from T1
where String like @query;

So that’s one of the hierarchies within MeSH, the descriptor-concept-term hierarchy. Traversing the other hierarchy, the tree of TreeNumbers, is easy to achieve using substring comparisons, either in XQuery or SQL. Putting these together, you can extract data from all the descriptors, concepts and terms within a particular branch (or branches) of the tree:

-- Get all strings at or under a given tree number (remove % from
-- end of @query to restrict to exact node without descendents)

declare @query varchar( 50 );
set @query = 'A02.633%';
with xmlnamespaces( default 'uri:mesh' ),
T1 as
(
	select ui, c.value( '.', 'varchar( 50 )' ) as TreeNumber
	from [MeSH].[Descriptor]
	cross apply record.nodes( '
	DescriptorRecord/TreeNumberList/TreeNumber
	' ) as T( c )
),
T2 as
(
	select ui, c.value( '.', 'varchar( 4000 )' ) as String
	from [MeSH].[Descriptor]
	cross apply record.nodes( '
	DescriptorRecord//String
		[ local-name( .. ) != "QualifierName"
			and local-name( ../.. ) != "DescriptorReferredTo" ]
	' ) as T( c )
)
select T2.ui, T2.String
from T2 inner join T1 on T2.ui = T1.ui
where T1.TreeNumber like @query;

Finally, the query that I needed in the first place, which inspired all this hacking. Given an input string, I wanted to get all the strings in subtrees headed by descriptors matching the input string. This involves first identifying all the descriptors that match the input string, at any descriptor/concept/term level, retrieving all of their TreeNumbers, and then extracting the strings associated with these descriptors and any other descriptors that are descendents of these in the TreeNumber hierarchy.

This requires one expression (T1) that models the TreeNumber hierarchy, and another (T2) to model the hierarchy internal to each descriptor, just like the previous query. However, in this case we need two copies of each of these in the SQL join that brings it all together. This is because the query maps from input string -> TreeNumbers -> descendent TreeNumbers -> output strings. So it’s a little more complicated than before:

-- Get all strings at or under tree nodes matching a given query string
-- (remove %s from @query to match exact string only, no substring matches)

declare @query varchar( 100 );
set @query = '%Muscle%';
with xmlnamespaces( default 'uri:mesh' ),
T1 as
(
	select ui, c.value( '.', 'varchar( 50 )' ) as TreeNumber
	from [MeSH].[Descriptor]
	cross apply record.nodes( '
	DescriptorRecord/TreeNumberList/TreeNumber
	' ) as T( c )
),
T2 as
(
	select ui, c.value( '.', 'varchar( 4000 )' ) as String
	from [MeSH].[Descriptor]
	cross apply record.nodes( '
	DescriptorRecord//String
		[ local-name( .. ) != "QualifierName"
			and local-name( ../.. ) != "DescriptorReferredTo" ]
	' ) as T( c )
)
select distinct T2_out.ui, T2_out.String
from T2 T2_out inner join T1 T1_out on T2_out.ui = T1_out.ui
	inner join T1 T1_in on T1_out.TreeNumber like T1_in.TreeNumber + '%'
	inner join T2 T2_in on T2_in.ui = T1_in.ui
where T2_in.String like @query
order by String asc;

I’m not going to give a line-by-line breakdown of how this works, but if you familiarize yourself with the SQL Server XQuery docs it should become clear. Leave a comment if you get stuck though. Have fun!

Andrew.

Advertisements

One Response

  1. […] Now it’s ready to use, and I’ll give some handy example queries in the next post. […]

Comments are closed.

%d bloggers like this: