Developments in Full Text Searching

Developments in Full Text Searching.
Regular readers of my blog know that I have a long-standing interest in
the potential of full-text search technology for both litigation
support and knowledge management applications in large law firms. With
the seeming explosion of new companies offering advanced full-text
approaches, I have been trying to sort out what is really new and what
really works, so I have asked an expert. 

A couple of years ago I met Sharon Flank of DataStrategy Consulting. Sharon has a PhD from Harvard in computational linguistics, meaning she is an expert on full-text search technology (her bio
is at the web site). The company offers technology due diligence and
product strategy and technology planning consulting, especially in
information retrieval, natural language processing (NLP), and
visualization.

It occurred to me that she was the perfect
person to ask about the underlying developments. Last week, I sent her
an e-mail message asking the following:

“Has there
been any conceptual break-through – at an algorithm level – in
full-text and semantic analysis in the last 10 years? From 1990 to 1995
I looked at many products: PLS, Verity, Excalibur, Conquest, Fulcrum,
and others I can’t now remember. It seems to me that those products did
much of what current products currently do, except perhaps the
extensive auto-classification (though that was less of a requirement
back then). Clearly, the ability to process large volumes has gone up
and user interfaces have improved. I’m not close to the computer
science but am curious if the underlying advances have been
significant, perhaps even quantum, or merely incremental. Thanks in
advance for any thoughts.”

Sharon was kind enough
to send back the following reply about the current state of natural
language processing and full-text search:

“There are several underlying important developments over the last decade or so:

  • Incorporating
    user feedback to refine search results, usually indirectly rather than
    explicitly, making results better through machine learning. [Amazon.com
    is the most-often cited example of this with it’s “if you like A,
    you’ll also like B.”]
  • Assessments based on usage or
    referral. This is what makes Google so useful and popular. This
    approach gives higher rankings if other web sites point to a target or
    if that target gets a lot of hits.
  • Various approaches to
    using taxonomies. The better applications use taxonomies as a
    navigation guide but don’t force it or require administrators to
    implement it. Vivisimo.com is an example of interesting, automated
    clustering approach.
  • Better handling of phrases. Google
    automatically parses phrases and deals with search terms as phrases.
    This now seems natural but in the AltaVista days, you couldn’t tell a
    Venetian blind from a blind Venetian [example courtesy of Prof. George
    Miller, Princeton Univ. – too good not to cite].
  • Context-sensitive
    search is now an emerging trend. Systems track what users have
    previously searched for and infer interest in the same domain to refine
    search result. So if you look for “line” and a system knows you’ve just
    looked for “tacklebox,” then it infers you mean “fishing line.” Or if
    you search for bagels and the system knows you are in 20009, it tells
    you that you can buy them at Comet Liquors (which happens to sell
    bagels).
  • “More generally in natural language processing,
    the statistical and linguistic approaches are converging in a new way:
    use massive amounts of data (i.e. the Web) to get statistical answers
    to deep linguistic questions, like “How do we figure out what the most
    likely referent is for the pronoun ‘they’?” Or “How do we determine the
    correct sense for ambiguous words?” These things aren’t in search
    engines yet, but you can expect to see more “intelligent” features
    coming out of this approach.

    “Looking at this list, you can
    see that the conceptual changes (breakthroughs?), with the exception of
    better phrase handling, are primarily focused around Web searches. When
    dealing with one-of-a-kind document collections behind the corporate
    firewall, many of these developments turn out not to add much to older
    approaches. So, at least for enterprise search, I too remain partial to
    some of the older products you mention, though I am disappointed that
    most of the old-time vendors have not updated their approaches beyond
    adding taxonomy support.”

    I appreciate Sharon
    taking the time to provide this insight. The bottom line for litigators
    and litigation support professionals: you need to keep your eye on
    emerging technologies and not necessarily take a “one-size-fits-all”
    approach to managing large volumes of documents.   [Strategic Legal Technology]

    Leave a comment