Watson – To Talk of Many Things

Watson.

People who don’t build software for a living have a quite understandable attitude that you should write your program, fix all the bugs, then ship it. Why would you ever ship a product that has bugs in it? Surely that is a sign that you don’t care about quality, right?

Those of us who work on non-trivial software projects naturally see this a little differently. There are many factors that determine quality of software, and some of them are counter-intuitive. For example, the first thing to realize is that you are not trying to get rid of every bug. What you are trying to do is maximize the known quality of the product at the time you ship it. If the last bug you have in the product is that a toolbar button sometimes doesn’t display correctly, and you fix it, then you ship the product, how embarrassed will you be a little later when you find out that the code fix you took to get the toolbar to display causes your application to crash when used on machines with certain video cards? Was that the right decision? Fix every bug?

The discipline of software testing is one that is rarely taught in university, and when it is taught, it is taught in a highly theoretical way, not in the way that it needs to be done to successfully produce high quality products on a schedule demanded by commercial software.

I often interview university students applying for jobs at Microsoft. They're usually excellent people and very talented, although we have to turn away most of them. One thing that always intrigues me is when I am interviewing people who want to be a developer or tester. If I ask them to write a little code, most of them can. Then I ask them to prove to me that their code works. The responses I get to this vary so widely it is remarkable. Usually people run the example values I gave them through their algorithm and if it comes out correctly, they say it works. Then I tell them, “OK, let's say I told you this code is for use in a firewall product that is going to be shipped to millions of customers, and if any hackers get through, you get fired”. Then they start to realize that they need to do a little more. So they try a few more numbers and some even try some special numbers called “boundary conditions” – the numbers on either side of a limit. They try a few more ideas. At some point they either say they're now sure, or they get stuck. So I then ask them if they think it is perfect now. Well, how do you know?

This is a central problem in large software projects. Once you get beyond about 100 lines of code, it becomes essentially impossible to prove that code is absolutely correct.

Another interesting detail here is that people constantly confuse the fact that they can’t find a problem with there being no more problems. Let's say you have a test team of one person, and a development team of 10 (at Microsoft, we have a 1:1 ratio of testers to devs, so we'd have 10 testers on that team). But what if our poor lone hypothetical tester goes on vacation for 2 weeks, and while he is away the developers fix the few bugs he has found. Is the program now bug free, just because there are no known bugs in it? It is “bug free” after all, by the definition many people would use. In fact a good friend of mine who runs his own one-man software business told me once his software is in fact bug free, since whenever any customer tells him about a bug, he fixes it. I suppose his reality is subjective. But you see the point – whether you know about the bugs or not, they're in there.

So how do you deal with them? Of course, you test test test. You use automated tests, unit tests that test code modules independently of others, integration tests to verify code works when modules are put together, genetic test algorithms that try to evolve tests that cause problems, code reviews, automated code checkers that look for known bad coding practices, etc. But all you can really say is that after awhile you are finding it hard to find bugs (or at least serious ones), which must mean you're getting it close to being acceptable to ship. Right?

One of my favorite developments in product quality that is sweeping Microsoft and now other companies as well is an effort started by a couple of colleagues of mine in Office, called simply “Watson”. The idea is simple. A few hundred (or even thousands) of professional testers at Microsoft, even aided by their expert knowledge, automated tools and crack development and program management teams cannot hope to cover the stunningly diverse set of environments and activities that our actual customers have. And we hear a lot about people who have this or that weird behavior or crash on their machine that “happens all the time”. Believe me, we don’t see this on our own machines. Or rather, we do at first, then we fix everything we see. In fact, the product always seems disturbingly rock solid when we ship – otherwise we wouldn’t ship it. But would every one of the 400 million Office users out there agree? Everybody has an anecdote about problems, but what are anecdotes worth? What is the true scale of the problem? Is everything random, or are there real problems shared by many people? Watson to the rescue.

As an aside, I suspect that some people will read this and say that open source magic solves this, since as the liturgy goes so many eyes are looking at the code that all the bugs are found and fixed. What you need to realize is that there are very few lawyers who can fix code bugs. And hardly any artists. Not so many high school teachers, and maybe just a handful of administrative assistants. Cathedral, bazaar, whatever. The fact is that the real user base is not part of the development loop in any significant way. With 10^8 diverse users, a base of people reporting bugs on the order of 10^4 or even 10^5, especially with a developer user profile cannot come close to discovering the issues the set of non-computer people have.

The Watson approach was simply to say: let's measure it. We'll match that 10^8 number 1 for 1. In fact, we'll go beyond measuring it, we'll categorize every crash our users have, and with their permission, collect details of the crash environment and upload those to our servers. You've probably seen that dialog that comes up asking you to report the details of your crash to Microsoft. When you report the crash, if that is a crash that someone else has already had, we increment the count on that “bucket”. After a while, we'll start to get a “crash curve” histogram. On the left will be the bucket with the most “hits”. On the far right will be a long list of “buckets” so rare that only one person in all those millions had that particular crash and cared to report it. This curve will then give you a “top N” for crashes. You can literally count what percentage of people would be happier if we fixed just the top 10 crashes.

When we started doing Watson, it was near the end of OfficeXP. We collected data internally for a few months, and started collecting externally for the last beta. We quickly learned that the crash curve was remarkably skewed. We saw a single bug that accounted for 27% of the crashes in one application. We also saw that many people install extra software on their machine that interferes with Office or other applications, and causes them to crash. We also saw that things we thought were fairly innocuous such as grammar checkers we licensed from vendors for some European languages were causing an unbelievable number of crashes in markets like Italy or the Netherlands. So we fixed what we could, and as a result OfficeXp was pretty stable at launch. For the first time, that was not a “gut feel” – we could measure the actual stability in the wild. For follow-on service packs for Office XP, we just attacked that crash curve and removed all the common or even mildly common crashes.

For Office2003 (and OneNote 2003), the Watson team kicked into high gear with more sophisticated debugging tools so we could more often turn those crash reports into fixes. They also started collecting more types of data (such as “hangs”, when an application either goes into an infinite loop, or is taking so long to do something that it might as well be in one). We fixed so many crashing bugs and hangs that real people reported during the betas that we were way down “in the weeds”, in the areas where crashes were reported only a handful of times – hard to say if that is real or just a tester trying to repro a bug they found. So again we know how stable the 2003 generation of Office applications is – we can measure it directly. What a difference from the guesswork that used to go on. In fact we know for a fact that Office 2003 at launch is already more stable than any previous release of Office ever, even after all service packs are applied, since Windows now collects Watson data for all applications.

Microsoft offers this data freely for anyone writing applications on Windows. You can find out if your shareware application is crashing or its process is being killed by your users often – sure signs of bugs you didn’t know you had. The Windows team will also give you crash dumps you can load into your debugger to see exactly what the call stack was that took you down, so you can easily fix your bugs.

This is an example of what BillG calls “Trustworthy Computing”. You may think that's a bunch of hoopla, but in many ways it is quite real – things like Watson are used to directly improve the quality of our products. [Chris_Pratley's WebLog]

Leave a comment