The Thoth Project

Emmanuel.Desmontils@irin.univ-nantes.fr
Christine.Jacquin@irin.univ-nantes.fr

Searching for information on the Internet means accessing multiple, heterogeneous and distributed information sources. Moreover, provided data are highly changeable: documents of already existing sources may be updated, added or deleted; new information sources may appear or some others may disappear (definitively or not). In addition, the network capacity and quality is a parameter that cannot entirely neglected. In this context, the question is: how to search relevant information on the web more efficiently? Many search engines help us in this difficult task. A lot of them use centralized database and simple keywords to index and to access the information. With such systems, the recall is often rather convenient. Conversely, the precision is weak. Some works in the multi-agent community (Yuwono and Lee, 1996; Ashish and Knoblock, 1997; Cazalens et al., 2000) show that an intelligent agent supported by the web site may greatly improve the retrieval process. In this context, this agent knows its pages content and is able to answer or not to queries. Namely, web sites become "intelligent" and are able to perform a knowledge-based indexation process on web pages. The work presented in this paper is related to this framework i.e. how, a web site can know its web page content and how ontologies can be used for information retrieval purpose? Our project, called Thoth (), plans to respond as far as possible to these questions.

Keywords, which are used to index a site in usual approach, are natural language terms, which are basically ambigous. Working at the word level is not enough to disambiguate a query (Gonzalo et al., 1998) or to index a document (Luke et al., 1996; Fensel et al., 1998). To improve the information retrieval process on the web, we may work at the conceptual level. To this end, ontologies are often used. They are concept hierarchies, which provide the common vocabulary of a specific domain. For instance, a hierarchy as Yahoo! could be considered as a simple ontology (Labrou et al., 1999). In information retrieval processes, these ontologies are used in natural language context (web page are written in natural language). For this reason, an ontology concept is represented by a label which is often a term. These labels are the bridge between words in a page and associated ontology concepts. Therefore, at a label may correspond a single meaning. However, in natural language a term could have several meanings and a meaning could be represented by several terms. Therefore, a linguistic ontology can help to this disambiguation problem. In addition, extracting precise indexes from pages, which could represent the web page content, can improve the retrieval process. Indeed, web pages are written in natural language and included concepts can appear on different forms (synonyms for example). Natural language techniques, as terminological extraction and measuring of word similarities, can greatly help to collect all forms and to determine the associated concepts.

Web site content highlight

In information retrieval processes, the major problem is to determine the specific content of documents. To highlight a web site content according to an ontology, we propose a semi-automated process, which provides a partial indexation by content of a web site using natural language techniques. We argue that such a process is necessarily semi-automatic. The user is the only way to finalize the process to correct errors and sometimes to complete it with manual extensions concerning specialized concepts.

Inputs of this ontology based indexation process are (1) typical ontologies concerning the knowledge to highlight and (2) an HTML page set of a Web site. Briefly, an ontology is a set of concepts which are connected with relations. Our ontologies are enriched using a thesaurus to associate to each concept label the set of its possible synonyms. Ontologies are used to index the site (figure 1), that is to say we plan to associate, with each concept of the ontology, pages were they could be founded. We can call this process the ontology indexation of web pages. Namely, we first extract indexes that point out the site content from the HTML pages using natural language based techniques. An index of a web page is a term (a set of words), which can be a significant concept of this page. Second, we try to define concepts they could represent. Then, we match them with concepts of selected ontologies and a set of measures is computed to evaluate this process. Finally, either the process is ended (measures are convenient for the user) or another indexation process is started with a new ontology (if indexation results does not suit him/her).

Figure 1. The general process

Our process does not change web site data (html pages, news archives...) but create an additional XML file (W3C, 1998). This file refers to a DTD close to the one used in the SHOE project (Luke et al., 1996). The result of our process is composed of ontologies thar refer to web pages. These ontologies are also described in a separated XML file. Then, HTML pages contain only owner's information. Therefore, they are easy to manage without considering the added knowledge. In addition, we can rebuild the added knowledge without modifying web pages. Finally, every web browsers can display HTML pages without the annotation that risks to provide problems. Therefore, our process is not an annotation process (concepts of the ontology are inserted in the text) but an indexation process (ontology concepts point out web pages where they are). This allows us to access directly to concerned pages searching a given concept and not to parse all web pages of the site. It also allows us to characterize answers using a coefficient of convenience.

To detail this process, we also present:

The features of Thoth's ontologies, and the process to disambiguate concept label. This disambiguation process uses a linguistic ontology and some hierarchical heuristics. Experimental results are then presented and commented.
The natural language process to extract well-formed terms from pages, and the generation of concepts associated with a page. Each of them is weighted by a convenience coefficient (which is determined by semantic similarity techniques) and weighted frequency coefficient (which points out the weight of this concept in the page).
The match between the ontology concept and the extracted concepts from pages.

Some results of the general process

To evaluate the performance of our indexation process, we tried it with the Thoth's University ontology and three web sites:

http://www.cs.washington.edu/: the University of Washington Department of Computer Science & Engineering web site (we will call it Washington), which has 1 315 pages and 480 135 candidate concepts (among these concepts only between 250 and 7 810 are selected related to the frequency threshold).
http://www.cookingwithkids.com/: the web site dedicated to the book "Cooking with Kids for Dummies" by Kate Heyhoe was published in March, 1999 by IDG Books (we will call it Cooking with Kids), which has 100 pages and 39 426 candidate concepts (among these concepts only between 2 and 212 are selected related to the frequency threshold).
http://www.sofaandchair.com/: the web site of "get FURNISHED", a virtual online home decor store (we will call it Sofa and Chair), which has 68 pages and 19 597 candidate concepts (among these concepts only between 2 and 136 are selected related to the frequency threshold).

Washington was chosen for its a priori high relevance with the Thoth's University ontology. In the same way, the two other sites was chosen for their a priori weak relevance with this ontology. For each web site, we have processed our indexation with the frequency threshold that goes up from 0 (all senses are accepted in each pages) to 1 (only the most important ones are accepted in each pages). For each process, we turn ours attention to the covering degree, the direct indexation degree, and the global indexation degree. Figure 2 shows results for Washington, Cooking with Kids and Sofa and Chair. Appendix 3 shows an extract of the resulting XML file for the Washington web site with the frequency threshold equal to 0.5.

Figure 2. Indexation results

Appendix

References

The Egyptian God Thoth