| Knowledge Management in SQL |
We are maintaining medical concept dictionaries and ontologies in production grade
relational database management systems (RDBMS.) In the past, RDBMS did not
support transitive relational structures and had therefore been unsuitable for
managing knowledge bases. The revised SQL-99 standard, however, may change
this. In this paper we show that modern RDBMS that support recursive queries
are capable of querying transitive relationships in a generic data model. We
show a simple but efficient indexed representation of transitive closure. We
could confirm that even challenging combined transitive relationships can be
queried in SQL.
IntroductionConceptual graphs are an approach to organize categorical knowledge about concepts. Terminologies, ontologies, and conceptual graphs are quite active areas of research in medical informatics. Many research systems to manage these conceptual graphs (e.g., GRAIL[1], PROTÉGÉ-II[2], LUME[3]) are stand-alone systems (often implemented in LISP) that support expressive description logics and inference, but that have limited data management features. These systems mostly depend on the entire conceptual graph to be loaded into memory. Their innovativeness and power notwithstanding, these specialized research systems do not blend very well into a production data management system. Relational Database Management Systems (RDBMS) on the other hand are a theoretically sound and industrially proven technology for managing large amounts of shared data,[4],[5] but RDBMS have not been used for managing knowledge bases. This is because most RDBMS do not support transitive closure queries, which are an indispensable tool for managing concept networks.[6] However, RDBMS are catching up and it is time to revisit the practicality of maintaining extensive medical knowledge in RDBMS. In this paper, we outline an approach for managing conceptual graphs in an industry supported RDBMS. As we are currently migrating the Regenstrief Medical Record System[7] (RMRS) and its medical concept dictionary to industry standard relational data architecture, this work is a study in the possibilities and limitations of RDBMS meeting the needs of generic data modeling and conceptual graphs. The RMRS dictionary contains approximately 25000 concepts, with approximately 1000 concepts added every year since 1978. In addition we maintain the rapidly growing Logical Observations Identifiers Names and Codes[8] (LOINC) and we are referring to external terminologies, such as ICD-9, SNOMED, and Medispan GPI for use in our medical record system. In addition, having been influential in transforming the HL7 Reference Information Model (RIM) to an abstract generic data model,[9],[10] we are investigating the feasibility and costs of straightforwardly implementing this model in an RDBMS. Example ProblemsThroughout this paper we will use an example database defined by the following SQL data definition (see also Figure 1):
CREATE TABLE Concept (
CREATE TABLE
Relationship ( This data model pattern underlies many generic databases found in the literature, including the HL7 RIM. Databases according to this model are directed graphs with labeled nodes and arcs. In our example we call the node “Concept” and the arc “Relationship”. Attributes as RelationshipsThe generic relationship-table handles all relationships between concepts, including those simple n-to-1 relationships that could be represented as direct attributes (self-referential foreign keys) in the concept table without using the relationship table. The advantage of always using the link table is that we can query our concept database for any kind of relationships between concepts. Thus, we can isolate concepts in the database that are never used (and eliminate those concepts), or we can focus the maintenance effort on those concepts that are most frequently referenced. Transitive Closure QueriesThe generalization relationship (‘ISA’) is probably the most commonly used relationship in concept databases. One common use case for querying ISA links is to test whether a given concept matches an expected concept. This requires querying the ISA relationship between the two concepts. The ISA relationship however is transitive. For example, if gastric mucosa ISA columnar epithelia, and columnar epithelia ISA epithelia, the gastric mucosa ISA+ epithelia (we note transitive relationships using a superscript plus (+) after the relationship type symbol.) With the SQL-92 standard,[11] implemented by most commercial RDBMS today, one cannot query the complete transitive closure. Instead one would either write complex queries that allow a transitive chain up to a certain length, or one would need to write a program that would walk the links to form the transitive closure. Programmatically following links is not only cumbersome (requiring programming instead of just ad-hoc queries,) it is also slow since many queries and their responses need to be communicated between the database client and server. In the new standard SQL-99[12] (also known as “SQL 3”) transitive closure can be queried using named temporary tables. For example, to find out if one concept (spec) is the transitive specialization of another concept (gen) we construct the successor set of the transitive ISA relationship and test whether gen is in that successor set:
WITH tmp(id) AS ( Transitive closure queries in SQL-99 use the WITH-clause that defines a temporary table (here named “tmp”) as the union of two queries. The first part constructs a starter set, while the second part joins the current temporary table with other tables to add more items into the temporary table. This process is repeated until no more items to are found to join. The final query takes the completed temporary table and answers the original question – in this case whether gen is among the generalizations of spec. This example shows that when querying the relationship table, a join with the concept table itself is normally not needed. Thus, the slight disadvantage that we found when modeling attribute links with the relationship table is of no concern if we perform extensive graph traversal. InheritanceThe ISA link logically implies that the specialization inherits all relationships from its generalizations. To find all the transitively inherited relationships of a concept spec we construct the successor set of the ISA+ relationship and then join with the other relationships:
WITH tmp(id) AS ( Besides ISA relationships, other important transitive relationships exist, such as PART-OF (e.g., the aortic valve is PART-OF the heart and the heart is PART-OF the cardiovascular system, hence is the aortic valve PART-OF+ the cardiovascular system.) The causal link (CAUSES) is also transitive (“chain of causality”.) Transitive Closure OptimizationTransitive closure queries are expensive, even with support from an SQL-99 compliant RDBMS and for online applications one must consider techniques for optimization. Many transitive closure algorithms had been described in the recent years,[13] yet most of these algorithms are most sensibly implemented in the core of the RDBMS. As a client-side program, performance of these algorithms is severely impaired by the fact that all of those algorithms have to read the entire relationship table at least once, and this is slowed down by both disk access and network traffic. Materialized Transitive ClosureOne way of optimizing transitive closure is to materialize all transitive relationships in an “ancestor-descendant-table.” To do this, we would add a column to our relationship table distinguishing a direct relationship from a cached transitive relationship:
ALTER TABLE Relationship If the trans_ind value is ‘+’ we know that the link is materialized transitive link. One could update the transitive links either on a regular basis or using a database trigger that computes the transitive closure whenever a new relationship is inserted or an existing relationship deleted. One can build the initial transitive closure table using a variant of the query above:
INSERT INTO Relationships The materialized representation is simple, and queries written against the simple relationship table will continue to work against the transitive relationship table. The disadvantage, however, is that the materialized transitive closure can grow very large and may be impractical with large concept databases. Tree Addresses
ISO Object Identifiers (OID) or IP addresses are the two prototypical examples of tree-address encodings: OIDs are simply dotted lists of numbers with no limitation to the size of each number or the length of the list. IP addresses are fields of 32 bit, where the bits are grouped to address each step in the path. Such bit-fields are efficient to compare and operate on but allow only a limited depth and limited breadth in each level of the graph. On the other hand, OID type addresses are less efficient to operate on and can easily grow quite long. The Interval Method
The interval method assigns an integer-interval to each
link in the graph. For simplicity we require the graph to be a proper tree,
i.e., there must be a single root and besides the root every node in the tree
must have exactly one outgoing link. A more general interval method exists that
works for all graphs,13,[14]
but with the tree assumption the representation and algorithms become very
simple and practically useful.[15]
ALTER TABLE Relationship
CREATE INDEX ON The transitive closure query to find all the generalizations of a concept x is then as simple as:
SELECT gen.* This query can be answered rapidly using only two index scans: one quick lookup to find the first ISA link for the concept X followed by an index scan to find only the concepts which include the given concept’s interval. DiscussionThe interval method is a very efficient transitive closure optimization, because unlike tree addresses the intervals’ storage requirements depend only on the number of nodes in the graph, independent of the depth or balance issues. Both tree addresses and the interval method have some limitations. First, the graph must be a proper tree with a single root. Completing a poly-tree (“forrest”) structure to a singly rooted tree is easy, we just add a root concept (with id = -1). For each transitive relationship type we link all concepts that are not the source of such relationship directly to this root node. We require these root links to store the interval numbers for the concepts at the source of each link. In the simple form shown here, the interval method does not work for cyclic graphs and it does not support, multiple outgoing links per concept (no multiple inheritance.) For concept relationships, we believe multiple inheritance to be much less valuable than it might seem from the many uses of the ISA relationship reported in the literature. For example, it is tempting to define heart ISA cardiovascular organ when in fact what’s being said is that the heart is PART-OF the cardiovascular system. We think the ISA link should be restricted to expressing only substantial relationships defined by the very nature of primary concepts. To that end, concepts that essentially are groupings of other concepts should not be defined as abstract concepts at all. For example, one should define the cardiovascular system (concrete concept) but not cardiovascular organ (abstract concept), whose only definition is: an organ that is PART-OF the cardiovascular system. The same argument can be made for diseases; instead of defining myocardial infarction (M.I.) ISA cardiovascular disease, one should say that M.I. is CAUSED-BY ischemia of the myocardium, which is PART-OF the heart, etc. Interestingly, while the anatomy section of SNOMED RT is quite well designed regarding our criterion, the disease section unnecessarily uses the ISA relationship; e.g., M.I. ISA ischemic heart disease. While one can create arbitrary classifications of concepts based on any attribute or relationship, the ISA link should be reserved for substantial relationship only. Hence, we do not believe that abandoning the option for multiple inheritance is a great loss. When we encounter a concept with two ancestors, we ask whether this is due to arbitrary classification. If both relationships appear substantial, we ask whether the two ancestor concepts might themselves be related (thus the multiple ISA links might be a materialized transitive relationship.) Interactions between Relationships
To test for such composite relationships in our representation we combine the intransitive HAS-LOC relationship with the transitive closure query for the transitive PART-OF relationship and the inheritance through ISA links. Given the conceptual graph in Figure 3 to test whether a femur-shaft-fracture (fsf) is a fracture (frac) of the femur (f) we have to find whether fsf ISA+ frac and fsf (HAS-LOC ° PART-OF+) femur, i.e., that there exist two concepts x and y where fsf ISA+ x which HAS-LOC y which is PART-OF+ femur.
SELECT Px.*, Py.* This query is a join of the same relationship table repeated six times. The complexity however is not as high as it seems, since all but x and y are bound to specific concepts, so the search space is not large and all searches are efficient index lookups. Consider the question “is femur-shaft-fracture a fracture of a bone?” The first part of the question is answered as shown above, i.e. fsf ISA+ frac and fsf ISA+ x where x HAS-LOC y. However, answering whether y is part of a bone requires the transitive closure over the combination of at least one PART-OF link and optional ISA links in any order. For example, possible paths between y and bone could be “y ISA+ diaphysis PART-OF+ long-bone ISA+ bone”; and “y PART-OF+ long-bone ISA+ bone.” This query cannot be answered in a simple join but requires recursion again:
WITH tmp AS ( The above query is only a sketch of the working query; in reality we need a longer query statement preventing infinite recursion and testing whether the path really contains at least one PART-OF link.[17] ConclusionWe have shown that modern RDBMS that support recursive queries are capable of querying certain conceptual graphs structures with only reasonable restrictions. We have further shown a simple but efficient indexed representation of transitive closure. The restrictions on the graph structures (tree-requirement) are no severe shortcoming, in part because the total graph is partitioned per each relationship type. Many more transitive closure optimizations are available, yet none of them is as simple as the one shown. Although there is still much potential left for improving the storage and query optimizers for transitive relational structures in RDBMS, we could confirm that even challenging combined relationships can be managed and queried in a commercial grade RDBMS. We do not pretend to have implemented a sophisticated subsumption and classification algorithm; however our current EMR system does not depend on such algorithms and classifiers. For us and at this time it is more important to free the data from specialized applications and into an open industry supported data architecture. AcknowledgementsThis work has been performed at the Regenstrief Institute for Health Care with support in part by the National Library of Medicine (Contract #N01-LM-6-3546). [*] An algorithm for assigning the intervals is described by Kamfonas15, we found a way to assign the intervals using only SQL statements but we still require an explicit iteration. [1] Rector AL, Bechhofer S, Goble CA, Horrocks I, Nowlan WA, Solomon WD. The GRAIL concept modelling language for medical terminology. Artif Intell Med. 1997 Feb;9(2):139-71. [2] Musen MA, Gennari JH, Eriksson H, Tu SW, Puerta AR. PROTEGE-II: computer support for development of intelligent systems from libraries of components. Medinfo. 1995;8 Pt 1:766-70. [3] Schulz S, Romacker M, Hahn U. Knowledge engineering the UMLS. Stud Health Technol Inform. 2000;77:701-5. [4] Codd EF. A relational model of data for large shared data banks. Communications of the ACM. 1970 Jun;13(6):377-387. [5] Date CJ, Darwen H. Foundations for object/relational databases: The third manifesto. Reading, MA; Addison-Wesley, 1998. [6] Zhe L, Ross KA. On the cost of transitive closures in relational databases [technical report CUCS-004-93]. 1993 Feb. New York; Computer Science Department, Columbia University. [7] McDonald CJ, Overhage JM, Tierney WM, Dexter PR, Martin DK, Suico JG, Zafar A, Schadow G, et al. The Regenstrief Medical Record System: a quarter century experience. Int J Med Inf. 1999 Jun;54(3):225-53. [8] Huff SM, Rocha RA, McDonald CJ, De Moor GJ, Fiers T, Bidgood WD, Forrey AW, Francis WG, et al. Development of the Logical Observation Identifier Names and Codes (LOINC) vocabulary. J Am Med Inform Assoc. 1998 May-Jun;5(3):276-92. [9] Schadow G, Russler DC, McDonald CJ. Conceptual integration of guidelines and workflow into the electronic health record. Stud Health Technol Inform. 2000;77:807-11. [10] Russler DC, Schadow G, Mead C, Snyder T, Quade LM, McDonald CJ. Influences of the Unified Service Action Model on the HL7 Reference Information Model. Proc AMIA Symp. 1999;:930-4. [11] International Organization for Standardization: Information technology – Database languages – SQL [ISO/IEC 9075:1992]. 1992, Geneva, Switzerland; The organization. [12] International Organization for Standardization: Information technology – Database languages – SQL – Part2: Foundation [ISO/IEC 9075-2:1999]. 1992, Geneva, Switzerland; The organization. [13] Nuutila E. Efficient transitive closure computation in large digraphs [dissertation]. 1995. Helsinki, Finnland; Helsinki University of Technology. [14] Agrawal R, Borgida A, Jagadish HV: Efficient management of transitive relationships in large data and knowledge bases. ACM-SIGMOD 1989 Int'l Conf. on Management of Data, Portland, Oregon, 1989. [15] Kamfonas MJ. Breaking the relational taboo [online]. Intel Enterp Database Progr Design. URL: http://www.dbpd.com/vault/9811/kamfn.shtml. [16] Horrocks I, Rector AL, Goble C. A description logic based schema for the classification of medical data. 1996. Proceedings of the 3rd Workshop KRDB'96:24-28. [17] All queries discussed here are available from: http://aurora.regenstrief.org/recursive-SQL.html [GS1]Isn’t this inheritance the other way than the ISA inheritance? In ISA the source gets the properties of the target, in this construct here the end of the has-location link is conducted upwards through the part-of links. |