Distributed Cooperative Digital Archives for Scientific and Geospatial Data

Project Award Number: IIS-9876037


Principal Investigator

Bongki Moon
Department of Computer Science
University of Arizona
P.O. Box 210077
Tucson, AZ 85721-0077
Phone: (520) 621-4326
Fax : (520) 621-4246
Email: bkmoon@cs.arizona.edu
URL: http://www.cs.arizona.edu/~bkmoon


Collaborator (Graduate Student)

Quanzhong LiDepartment of Computer Science
University of Arizona
P.O. Box 210077
Tucson, AZ 85721-0077
Phone: (520) 621-2759
Fax : (520) 621-4246
Email: lqz@cs.arizona.edu
URL: http://www.cs.arizona.edu/~lqz

Keywords

Web servers, Scientific and spatial database, Digital archive, XML, Scalability.

Project Summary

With the growing popularity of the internet and the world wide web (WWW), which afford unprecedented access to globally distributed information, there is a fast growing need to make scientific and geospatial data archives accessible through the internet to those who do not have access to special software and/or hardware. The web accessibility will be an essential component of the services that future scientific data archives should provide for clients. This research proposes to build a framework for Distributed Cooperative Digital Archives (DCDA) to make such large-scale scientific and geospatial data discoverable and retrievable on the internet. The DCDA research will enable a collection of independent data archives to share the load to enhance their collective system performance, while allowing individual archives to remain under the control of corresponding data producers. In particular, this research will address four specific components: XML for metadata description, distributed cooperative web server, database support and web integration, and scalable data storage and retrieval. The results from this DCDA research will open the door to building a scalable digital archive with a group of distributed, separate and independent archives, thereby integrating and providing accesses to heterogeneous data sources.

Publications and Products

Publications

  • Philip J. Harding, Quanzhong Li and Bongki Moon, XISS/R: XML Indexing and Storage System Using RDBMS. Proceedings of the 29th Very Large Databases Conference, Berlin, Germany, September, 2003.
  • Quanzhong Li and Bongki Moon, Partition Based Path Join Algorithms for XML Data. Proceedings of the 14th International Conference on Database and Expert Systems Applications (DEXA'2003), Prague, Czech Republic, September, 2003.
  • Taewon Lee, Bongki Moon and Sukho Lee, Bulk Insertion for R-tree by Seeded Clustering. Proceedings of the 14th International Conference on Database and Expert Systems Applications (DEXA'2003), Prague, Czech Republic, September, 2003.
  • Wonik Choi, Bongki Moon and Sukho Lee, Adaptive Cell-Based Index for Moving Objects. To appear in Data and Knowledge Engineering.
  • Quanzhong Li, Ines Fernando Vega Lopez and Bongki Moon, Skyline Index for Time Series Data. Submitted to IEEE Transactions on Knowledge and Data Engineering.
  • Hyoseop Shin, Bongki Moon and Sukho Lee. Tie-Breaking Strategies for Fast Distance Join Processing. To appear in Data and Knowledge Engineering.
  • Quanzhong Li and Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. Proceedings of the 2001 International Conference on Very Large Databases (VLDB), Rome, Italy, September 2001.
  • Quanzhong Li and Bongki Moon. Distributed Cooperative Apache Web Server. Proceedings of the International Conference on World Wide Web, Hong Kong, May 2001.
  • Bongki Moon, Ines Fernando Vega Lopez and Vijaykumar Immanuel. Scalable Algorithms for Large Temporal Aggregation. To appear in IEEE Transactions on Knowledge and Data Engineering.
  • Hyoseop Shin, Bongki Moon and Sukho Lee. Adaptive Multi-Stage Distance Join Processing. Proceedings of the 2000 ACM SIGMOD Conference, Dallas, TX, May 2000.
  • Bongki Moon, H. V. Jagadish, Christos Faloutsos and Joel H. Saltz. Analysis of the Clustering Properties of the Hilbert Space-Filling Curve. IEEE Transactions on Knowledge and Data Engineering, 13(1):124-141, Jan/Feb. 2001.
  • Scott M. Baker and Bongki Moon. Distributed Cooperative Web Servers. Computer Networks, 31(11-16):1215-1229, 1999.
  • Bongki Moon, Ines Fernando Vega Lopez and Vijaykumar Immanuel. Scalable Algorithms for Large Temporal Aggregation. Proceedings of the 16th International Conference on Data Engineering, San Diego, CA, March 2000.
  • D. Gao, J. A. G. Gendrano, B. Moon, R. T. Snodgrass, M. Park, B. C. Huang, and J. M. Rodrigue, Exploiting Main Memory for Efficient Parallel Aggregation for Temporal Databases. To appear in Distributed and Parallel Databases Journal.
  • Hyoseop Shin, Bongki Moon and Sukho Lee. Adaptive and Incremental Processing for Distance Join Queries. To appear in IEEE Transactions on Knowledge and Data Engineering.

Software Deliverables

Project Impact

The algorithmic procedures developed for web document migration and replication can be applied to web-based distributed data management systems. The resource-aware and deletion-aware techniques have been proposed to replicate and replace documents under limited storage on servers. This study provides enabling technologies toward building realistic systems that can be used in practical applications in the real world. We have also demonstrated the effectiveness of the proposed algorithms with respect to load balancing and scalability by designing and implementing a prototype system running on a cluster of workstations.

One of the major issues of the molecular biological community is the need and technical difficulty to setup and maintain distributed databases to be linked, coordinated, and integrated
over the web as the basis for analyzing and interpreting biological organisms. The Distributed Cooperative Apache (DC-Apache) web server system, which is one of the main components of the proposed research, will provide the necessary infrastructure for searching, browsing and sharing biological and genomic data.

Goals, Objectives and Targeted Activities

The main objective of the proposed research activities is to make scientific and geospatial data archives accessible through the Internet. Given explosive data traffic in the world-wide web (WWW), it is crucial to achieve the scalable performance of web servers. The overall performance and resource utilization can be improved by spreading document requests among a group of web servers. This leads to the design and implementation of Distributed Cooperative Apache (DC-Apache) web server. We have developed the DC-Apache system built atop the Apache web servers (version 1.3 based on the pool-of-processes model)
by augmenting them with new functionalities so that individual web servers can cooperate and share work load as a collective unit.

We have also addressed the issue of storage management for more effective document replication under limited capacity. We have evaluated the DC-Apache system with real-world data sets such as Sequoia scientific data and standard benchmark suite SpecWeb99. In all the experiments, the DC-Apache system has demonstrated its ability to achieve high performance and scalability by effectively distributing load among a group of cooperating Apache servers and by eliminating hot spots and performance bottleneck with replicated documents. In particular, the Resource-Aware method, proposed for data replication under limited storage, turned out to be very effective in replicating and replacing documents.

The second major research activity was to develop new techniques to process distance join queries for spatial and multimedia database applications. Additional requirements for ranking and stopping cardinality are often combined with the spatial distance join in on-line query processing or Internet search environments. These requirements pose new challenges as well as opportunities for more efficient processing of spatial distance join queries. We have developed an efficient k-distance join algorithm that uses new plane-sweeping techniques for fast pruning of distant pairs. We have also developed adaptive multi-stage algorithms for k-distance join and incremental distance join operations. Furthermore, we have found that a priority strategy for the tied pairs in the priority queue during distance join processing greatly affects its performance. We have proposed a probabilistic tie-breaking priority method
to address this issue. Our performance study shows that the proposed strategies outperform previous work by up to an order of magnitude for both k-distance join and incremental distance join queries, under various operational conditions.

Area Background

A large class of scientific and geospatial applications may involve browsing large-scale multidimensional datasets, for example, for analyzing remotely sensed data or visualizing the output of scientific simulations. Additionally, due to their huge volume, complex structures and divergent access characteristics, they cannot be easily accommodated by conventional storage systems and/or database management systems. As an example, a large volume of data has been one of the major limiting factors for studies involving remotely-sensed data. In general, the main challenges in scientific data archives include providing search and browse capabilities for large-scale data archives, handling data in a wide spectrum of complex formats such as binary large objects and multimedia data, and supporting for data chunking, subsetting, declustering, and efficient accesses, managing data produced and stored in distributed environments.

Area References

Potential Related Projects

Project Websites

DC-Apache: Distributed Cooperative Apache Web Server

DC-Apache system is a scalable web server solution in order to meet  the explosion of data traffic in the World Wide Web. Our solution takes  the graph-based approach and it is built on the hypothesis that most web sites only have a few well-known entry points  from which users start navigating through the site's documents.  The DC-Apache system can dynamically manipulate the hyperlinks embedded in web documents in order to distribute access requests among multiple cooperating web servers.

XISS: XML Indexing and Storage System

XISS utilizes the extended preorder numbering scheme for XML documents. The extended preorder numbering scheme provides a way to encode the elements and attributes in an XML document, such that the ancestor-descendant relationship can be determined quickly and future insertions can be accommodated gracefully. This numbering scheme also provides opportunities for storing XML data using relational databases, which is demonstrated in the XML Indexing and Storage System using RDBMS (XISS/R).

Online Data

Real and Synthetic Data Sets for XISS Evaluation

This web page includes real-world data sets such Shakespeare's Plays and SIGMOD Record, and synthetic data sets based on News Industry Text Format (NITF) as a DTD.