Swimmingly Effective Searches
By Michael Kraft
TODAY'S information-centric world is burying the legal community in electronic data overload. Successful "data-mining" is dependent on effective information retrieval. How good are the search tools that are available today?
Basically, there are three types of document retrieval systems:categorization, free-text search, and contextual profiling. Most search engines use a combination of these.
The oldest search technology is categorization. This works like a filing cabinet. The document's author or the computer decides which category best describes the document. When, the searcher defines a query, the system assigns the query term to a category and looks for it in that category or subcategory. (Yahoo is an example most people are familiar with.) If you've ever searched for something in someone else's files, you've experienced the problems intrinsic to categorization. Different people categorize the same document differently about 80 percent of the time.
Convera's Excalibur, Autonomy's namesake product and Brightstation's SmartLogik all use some sort of categorization, though the technology they use to create the categories vary.
Excalibur uses generic and custom-built semantic networks. These networks define the relationships between words, for example, that "bank" is related to "money," to "saving," to "finance," etc. Creating these networks can be labor intensive.
Autonomy and SmartLogik both use advanced pattern recognition technology, referred to later as contextual profiling, to create categories. These systems compute the probability that a given document satisfies a search criterion or, that a given document is a member of a specific category. This automates the categorization process and maximizes the likelihood of a correct classification.
However, no system, whether human or automated, can be guaranteed to correctly classify every document. In practice we find that the correctness of a classification changes with the needs and interests of the person using the categories.
For example, for an inventor, a document might be about obtaining a patent on a particular product. Another searcher's interest in the document, could be focused on a single process or one minor aspect of the patent.
Web portal search tools, like Excite and Alta Vista are examples of free-text search engines. These systems index every word in every document and look for specific words in the search process. This is helpful when you know the exact term used in the target document. However, these search tools lack precision because there are many ways to convey any one idea. For instance, there are more than 120 words that refer to thinking: evaluate, consider, contemplate, and assess are a few of them.
To solve the problem of synonyms, some search tools, like Verity's Portal/One and Inktomi's Ultraseek, add a thesaurus and query expansion. They expand the query term to include its synonyms. Though each provides a standard thesaurus, to work effectively, each organization must customize its thesaurus to match its own interests. Current versions of Portal/One and
Inktomi's product, CCE, provide reasonably good document retrieval solutions, but require extensive customization to be effective. They combine the free text search with categorization. The categories are based on rules, which can be written by the organization or by individual users. Even with the help of the built-in tools, the process of customizing the thesaurus, and the development of rules are labor intensive and expensive. The other side of the synonym problem is multiple meanings for a word. The 500 most-used words in the English language have an average of 23 different meanings each - grizzly bear, bear market, bear gifts, it bears watching. A plain free-text search engine might deliver tremendous numbers of irrelevant documents in addition to the few relevant ones that were requested.
In an effort to solve that problem, Hummingbird's Search Server also combines categorization with free-text search. It organizes and references its data through tables, which are constructed and maintained by the administrator. Data stored in these tables (metatags), author, title and a keyword summary for each document, are entered manually by the document author. When searching for a word within a category, the user is more likely to find the occurrence of the word with the correct meaning, however the original problems involved with categorization still exist. Additionally, studies have shown that depending on document authors to create metatags can result in haphazard and incomplete information.
The third type of search tool uses "contextual profiling" and does not need or use categorization. An example of this is the KnowledgeBox by DolphinSearch. (Although there are some academic or development projects that employ abstract document representations, I am not aware of any other commercially available systems that use similar techniques as DolphinSearch.)
This system learns the meaning of documents and of the words within those documents based on the context in which they are used. Then, it forms semantic profiles to represent those meanings.
To retrieve relevant documents, the system creates a semantic profile of the query term and compares it to the profiles of documents on the network. It does not force categorical decisions on its users. People can search for whatever concepts they want without worrying about how someone else categorized the document.
In essence, DolphinSearch creates ad hoc categories in response to user queries, rather than fixed categories that force users to deal with a rigid system.
The problems of synonyms and multiple meanings are minimized because it learns the vocabulary directly from the documents, reducing the chance of irrelevant meanings for words. Another feature is the development of points of view. The idea behind this is that different departments within an organization, such as human resources, business development, litigation or securities, have different interests and might use the same word in different contexts.
So, by training the KnowledgeBox on documents from a department, it learns the point of view of that user group, and returns documents from among the company's entire network, that are relevant to that specific department.
Michael Kraft is general counsel of Kraft, Kennedy & Lesser, a New York City-based legal technology consulting firm.