Developments in Full Text Searching.
Regular readers of my blog know that I have a long-standing interest in
the potential of full-text search technology for both litigation
support and knowledge management applications in large law firms. With
the seeming explosion of new companies offering advanced full-text
approaches, I have been trying to sort out what is really new and what
really works, so I have asked an expert.
A couple of years ago I met Sharon Flank of DataStrategy Consulting. Sharon has a PhD from Harvard in computational linguistics, meaning she is an expert on full-text search technology (her bio
is at the web site). The company offers technology due diligence and
product strategy and technology planning consulting, especially in
information retrieval, natural language processing (NLP), and
visualization.
It occurred to me that she was the perfect
person to ask about the underlying developments. Last week, I sent her
an e-mail message asking the following:
Has there
been any conceptual break-through – at an algorithm level – in
full-text and semantic analysis in the last 10 years? From 1990 to 1995
I looked at many products: PLS, Verity, Excalibur, Conquest, Fulcrum,
and others I cant now remember. It seems to me that those products did
much of what current products currently do, except perhaps the
extensive auto-classification (though that was less of a requirement
back then). Clearly, the ability to process large volumes has gone up
and user interfaces have improved. Im not close to the computer
science but am curious if the underlying advances have been
significant, perhaps even quantum, or merely incremental. Thanks in
advance for any thoughts.
Sharon was kind enough
to send back the following reply about the current state of natural
language processing and full-text search:
There are several underlying important developments over the last decade or so:
Incorporating
user feedback to refine search results, usually indirectly rather than
explicitly, making results better through machine learning. [Amazon.com
is the most-often cited example of this with its if you like A,
youll also like B.]Assessments based on usage or
referral. This is what makes Google so useful and popular. This
approach gives higher rankings if other web sites point to a target or
if that target gets a lot of hits.Various approaches to
using taxonomies. The better applications use taxonomies as a
navigation guide but dont force it or require administrators to
implement it. Vivisimo.com is an example of interesting, automated
clustering approach.Better handling of phrases. Google
automatically parses phrases and deals with search terms as phrases.
This now seems natural but in the AltaVista days, you couldnt tell a
Venetian blind from a blind Venetian [example courtesy of Prof. George
Miller, Princeton Univ. – too good not to cite].Context-sensitive
search is now an emerging trend. Systems track what users have
previously searched for and infer interest in the same domain to refine
search result. So if you look for line and a system knows youve just
looked for tacklebox, then it infers you mean fishing line. Or if
you search for bagels and the system knows you are in 20009, it tells
you that you can buy them at Comet Liquors (which happens to sell
bagels).More generally in natural language processing,
the statistical and linguistic approaches are converging in a new way:
use massive amounts of data (i.e. the Web) to get statistical answers
to deep linguistic questions, like How do we figure out what the most
likely referent is for the pronoun they? Or How do we determine the
correct sense for ambiguous words? These things arent in search
engines yet, but you can expect to see more intelligent features
coming out of this approach.Looking at this list, you can
see that the conceptual changes (breakthroughs?), with the exception of
better phrase handling, are primarily focused around Web searches. When
dealing with one-of-a-kind document collections behind the corporate
firewall, many of these developments turn out not to add much to older
approaches. So, at least for enterprise search, I too remain partial to
some of the older products you mention, though I am disappointed that
most of the old-time vendors have not updated their approaches beyond
adding taxonomy support.
I appreciate Sharon
taking the time to provide this insight. The bottom line for litigators
and litigation support professionals: you need to keep your eye on
emerging technologies and not necessarily take a one-size-fits-all
approach to managing large volumes of documents. [Strategic Legal Technology]