The extensible markup language XML
has become the most popular standard for information representation and exchange for numerous applications. XML documents can be modeled as ordered labeled trees and queries can be expressed in languages such as XPath
to retrieve documents based on both their structure and values. For example, an XPath query //book[year="2005"][author="Jack"]
selects the node <book>
in the XML document shown below.
We have developed a system called PRIX (Prüfer sequences for Indexing and Querying XML) for indexing and querying XML data. In PRIX, queries can be posed as twig patterns that denote a subset of the XPath language. PRIX finds all occurrences of a twig pattern in the data. Every XML document is transformed into a sequence of labels by Prüfer's method that constructs a one-to-one correspondence between between trees and sequences. During query processing, a twig pattern is also transformed into its Prüfer sequence. By performing subsequence match on the set of sequences, and by performing a series of refinement phases that we have developed, we can find all occurrences of a twig pattern in the database. Our approach allows holistic processing of twig patterns without breaking them into root-to-leaf paths and processing them individually.
Sequencing XML Documents for Indexing
In PRIX, a collection of XML documents is transformed into sequences by Prüfer's method. Prüfer's method constructs a one-to-one mapping between a labeled tree and a sequence by removing nodes from the tree one at a time. The algorithm to construct a sequence from a tree with n nodes labeled from 1 to n works as follows. A leaf node with the smallest number is chosen and deleted, and the parent of this node is noted down. This process is repeated successively until only one node is left. The sequence of numbers, noted down during deletion, denotes the Prüfer sequence of the tree. The original tree can be reconstructed from this sequence.
In PRIX, the set of LPS's for the documents are indexed by building a disk-based virtual trie using B+trees
[Rao and Moon 2006]. The set of NPS's are stored in the database (e.g., as records in a heap file).
Example: Consider a tree T with nodes labeled uniquely with postorder numbers. Prüfer sequence can be constructed for T using the node removal method described above. This sequence consists entirely of postorder numbers and is called Number Prüfer Sequence (NPS). If each number in this NPS is replaced by its corresponding tag, a new sequence that consists entirely of XML tags called Labeled Prüfer Sequence (LPS) can be constructed. For tree T, LPS(T) = A C B C C B A C A E E E D A, and NPS(T) = 15 3 7 6 6 7 15 9 15 13 13 13 14 15. Extended LPS and NPS can be constructed for tree T. First each leaf node is extended with a dummy child node and all the nodes are numbered in postorder. Then the extended tree is transformed into sequences. As a result, all the labels in the original tree appear in the LPS and NPS.
Finding Twig Matches in PRIX
The work focuses on ordered twig pattern matching where the nodes in a twig pattern follow the document order in XML. During query processing, a twig pattern is also transformed into its LPS and NPS. We have proved that by performing subsequence matching, followed by (b) refinement-by-connectedness, and
(c) refinement-by-structure, PRIX can find all occurrences of a twig pattern in an XML database
[Rao and Moon 2006]
. There are no false alarms
and no false dismissals
. The key ideas in each of these phases are summarized below. For complete details, refer to our journal paper [Rao and Moon 2006]
Let us construct extended Prüfer sequences for the data trees and twig patterns.
A query twig Q is first transformed into its LPS and NPS. Then following three phases are performed to find all twig occurrences.
- Filtering by subsequence matching
- All subsequences in the data sequences that match LPS(Q) are found. This filtering phase finds a superset of the true matches. A virtual disk-based trie is used for efficiently finding matching subsequences.
- Each matching subsequence identified in phase (1) is tested for connectedness. The NPS of the corresponding data sequence is used to determine if the nodes in the data that match the query twig nodes are indeed connected. Such matches are passsed to the next phase.
- The matches from phase (2) are further refined by examing their structure with the query twig structure. This phase involves testing for gap consistency and frequency consistency.
PRIX can process query twigs with wildcard '*' and axis '//' with some modifications to the refinement-by-connectedness phase.
The subsequence matching phase is I/O bound. To further speed up subsequence matching, PRIX performs two optimizations. The details are provided in our journal paper [Rao and Moon 2006]
Bi-directional subsequence matching - Rather than performing subsequence matching from left-to-right for a query sequence, PRIX chooses a pivot with high selectivity and finds subsequences from the pivot in both directions. The results are merged to obtain all matching subsequences. The intuition here is that since the pivot has higher selectivity that other elements in the sequence, fewer nodes are explored in the virtual trie resulting in reduced I/O cost.
Use of MaxGap metric - During filtering, based on the postorder numbers of the data trees, PRIX avoids exploring certain partial subsequences that are false matches eventually. As a result, the I/O cost is reduced.
You can obtain the initial release of PRIX (version 1.0) by visiting the Download page.
You can find the recent updates to our software on the Changes page. For questions and bugs,
please email email@example.com.
PRIX is under active development. As part of our future work, we plan to extend PRIX to support unordered twig pattern matching.