The Extensible Markup Language XML has become the de facto standard for information representation and interchange on the Internet. In this dissertation, I address the problem of indexing and querying XML in two environments, namely, (a) a traditional environment where data is centrally stored and (b) a growingly popular peer-to-peer (P2P) environment.
In a traditional environment, a core operation is to find all occurrences of a given query pattern in the database. I propose a new way of indexing XML documents and processing query patterns. Every XML document in the database is transformed into a sequence of labels by Prufer's method that constructs a one-to-one correspondence between trees and sequences. During query processing, a query pattern is also transformed into its Prufer sequence. By performing subsequence matching on the set of sequences in the database, and performing a series of refinement phases that I have developed, all the occurrences of a query pattern can be found in the database. Furthermore, I show that all correct answers are found without any false dismissals or false alarms. I present the design, implementation, and experimental evaluation of the PRIX system that I have developed for this purpose.
Coupled with the growing popularity of P2P systems, XML is commonly
used as an underlying data model for P2P applications. Locating relevant data sources across a large number of participating peers is an important challenge. I propose a novel XML indexing scheme based on polynomial signatures for data stored in a P2P network. In the proposed scheme, each XML document is mapped into an algebraic signature that
captures the structural summary of the document. The participating peers in the network collectively maintain a distributed and hierarchical index over the signatures. By virtue of the signature index, the signatures of documents with similar structural characteristics tend to be stored together at the same peer, and a search for document sources is resolved quickly. I present the design, implementation, and empirical evaluation of the psiX system that I have developed for this purpose.