Which gets me to the subject of the Semantic Web. The Semantic Web is about people, and specifically about making people's voices clearly audible and indelible:
- Audible – There are six billion of us on the planet. Some people would have you believe that you should never ask any of us for advice, because we lie. But today I can't even hear your lies. The Internet has made it immensely easier to connect with expertise from other humans who want to share it, but we are still largely shackled by cultural, geographic, social, and technological constraints limiting who we can consult for advice. Today I get most of my lies from whichever barbarians have clawed their way to the top of the local and national media outlets. But sometimes when I see an advertisement for an interesting new product, I want to be able to pick up my remote control and click on “connect me to five people who hate the product and ask them why”. I am sure that there are at least five people who want to give me a perspective different from the one being broadcast, so why can I not hear their voices?
- Indelible – Few people think about the noble role that librarians play. Our ability to collect, organize, and preserve the voices and observations of those who came before us is critical to our continued survival as a species. The story of Babel is a metaphor for what later happened at Alexandria; a reminder that we all suffer when we lose our ability to pass lessons to future generations. It is possible for a single person to memorize the Quran and pass it on to others, but word-of-mouth is not enough to perpetuate the bulk of knowledge that enables the planet to support six billion people today. Without written language and our knowledge stewards, we would have to eliminate many billions of people, because we wouldn't be able to maintain the capabilities that support them all. Again, the Internet has had a profound impact on our ability to preserve our collective memory, but we are still very fragile. A true librarian has vivid memories of Babel and Alexandria (when we also considered ourselves invincible), and lives the motto “never again!”. The first lesson of history (that we must learn and never repeat) is that history lost is humanity lost.
The key point here is that the web, and especially the semantic web is about capturing and communicating human knowledge. For people who have trouble understanding that “knowledge” is a truth-neutral word, it is fine to say that the semantic web is about capturing and communicating human voices. The web v1.0 was great, but still has many problems. (For example, you would think that the web would do a good job of documenting the history of the web, but the feedback loop created by copy/paste historians virtually erased Eric Bina from the history of the web initially, while elevating Marc Andresson to status of a god.) Most of the web's ability to filter voices is still based on information extraction from raw, unstructured text. Innovations like weblogs have made it easier for normal people to communicate their voices to the world, and people continue to simultaneously evolve the web's ability to filter voices, as demonstrated by Mark Pilgrim's cool use of cite with trackback. So it is obvious to me that the current web is evolving to become more semantic anyway. In fact, I would argue that people like Dave Winer (who overtly disparages certain semantic web technologies while producing code that gives people voice) have done more to advance the semantic web (the web of renmin voice) than many of the semantic web advocates.
RDF is often a whipping-boy, but a red-herring in this discussion. To know why, you need to understand that RDF is simply a syntax for exchanging knowledge representations, and not even a particularly ambitious or cutting-edge syntax. When you want to represent the quantity “five”, you can use Roman numerals, Arabic numerals, or some other symbolism. When you want to represent the statement “the author of http://www.netcrucible.com/blog/ is joe@cisco.com“, you can choose from a variety of different symbolisms. Unsurprisingly, different symbolisms for KR are suited for different tasks (try doing long addition with Roman numerals). The book “Knowledge Representation: Logical, Philosophical, and Computational Foundations“ by John F. Sowa discusses the characteristics of KR techniques, from the “symbolic” logic of Aristotle to the current computer-oriented techniques specialized for specific problem domains. RDF isn't even as expressive as Aristotle's original logic, so it's really not accurate to disparage RDF for being “too AI”. It is true that many RDF proponents are “too AI”, but RDF itself is a pretty modest contribution to the field of KR (and KR is no more a subset of AI than Computer Science is). When people bring RDF into the semantic web debate, they are usually arguing from one of a few implicit biases:
- “Formal KR is overkill for the web” – Most of the examples that qualify as “semantic web” today rely on plain old English recognition, and the rest rely on extracting predefined semantics from markup, meta tags, and so on. Written English itself is nothing if not a KR syntax, and it is certainly fair to argue that we don't need a more formal syntax. Sure, things would be really cute if all web pages were written using Prolog instead of HTML (RDF is not too different from Prolog data syntax), but we cope with HTML just fine. In fact, Dare Obasanjo has remarked to me long ago that XML is not really different from Lisp's s-expressions — a point elaborated in the paper by Jerome Simeon — so in a sense, Mark Pilgrim and the XHTML advocates are lobbying to have people write their web pages in Lisp instead of HTML. The fact that RDF and XML are more alike than they are different (like Lisp and Prolog) is something that Ashok Malhotra never tires of pointing out (and the only issue I've ever seen him get truly agitated about). Nevertheless, pragmatists will argue that XML is the right balance, the web has always been an ad-hocracy, and that the web does not need a full-blown formalized KR language. It's not as if RDF is wasting energy trying to push the boundaries of KR, though, and it just might pay off for the web one day. I don't think it is wise at this point to lobby for everyone to use RDF, but I also don't think it is smart to lobby that nobody should. Already people are dissatisfied with the status quo, and are looking for ways to encode knowledge more interchangably. Perhaps XML will be able to do everything that people want, or perhaps RDF's foundation in KR will ultimately propel it to dominance. Who knows?
- “RDF is too complicated” – This also is a very potent argument. The primary serialization for RDF is XML. This really starts to hurt your brain when you realize that RDF and XML are almost the same thing. Too much meta and your mind can't bootstrap. And the two main non-XML serializations that exist are named “N3” and “N-Triples”, but bear no resemblance to one another — a prank that lends credence to the allegations of gratuitous complexity. Furthermore, the existence of multiple serializations leads people to the understandable misconception that RDF is not simply a syntax for exchanging knowledge representations. Even when people understand that RDF is about KR, they don't seem to expect that the serialization syntax is independent of the KR model, and therefore twice-removed from the knowledge being represented. RDF's relationship to XML paired with the existence of non-XML serializations is just too “meta” for most people. And that only scratches the surface of why people consider RDF to be complex (you can find other reasons described by searching Google). I personally think that the model of RDF is admirably simple, and the syntax is not so terrible once you understand what RDF is trying to do and use it for that. But when considered in context of the big picture, I have to agree that RDF is waaay too difficult to approach for most renmin.
In any case, it should be clear that the semantic web is about recording and communicating human knowledge, and the whole field of knowledge representation (including RDF) is about the same thing. It is true that memory and cognition are tightly intertwined, but it is still quite useful to build a digital library of semantic information without first knowing all of the details of the digital “brains” that will process the information. The web from the very beginning was a digital library of human knowledge (or lies, it doesn't matter), and all sorts of useful new applications have evolved as people continue to innovate new techniques for representing and mining that knowledge. Give people a taste of a universe where all voices are audible and indelible, and they never want to go back. Millions have already tasted the fruit, and the current state of evolution and innovation on the web is inseperable from that urge to eat more and never go back. Deep in the soul, that is what people are working for, and I wouldn't bet against them. Like a forest fire that has consumed its first acre, like a pile of nuclear material at critical mass, or like free-enterprise in China, the semantic web is. [Better Living Through Software]