The art of building software: My favorite bug

Thursday, March 18, 2010

My favorite bug

This is the story of my favorite bug. I created this bug on an early version of the NetFlix site, over a decade ago when they were a relatively small, unknown start up. Now I stream NetFlix on my XBox, and DVDs in the mail seems... so... last century.

Way back when NetFlix was a tiny start up in Scotts Valley, CA, I walked down to their offices from my house up the street, and walked out with a job as a senior web developer. They soon moved just over the hill to Los Gatos, CA, where I got to implement this fabulous bug I want to tell you about.

When you browse the NetFlix web site and look at a movie, you see all this information about the movie such as the description, actors, director, and so on. NetFlix used to get this data from the Internet Movie Database. I don't know if they still do.

Anyway, my task was to set up a process for automatically getting the latest content from IMDB and bringing that into the Oracle database so it could be displayed to the user on the site. My basic approach was a standard "Extract, Transform, and Load" or ETL process. I set up a job to retrieve the latest data from IMDB via FTP, then I wrote a program to do some validation and pre-processing on that data, and loaded it into a few new tables I'd created in the Oracle database. Finally, I kicked off a big PL/SQL script I wrote to process the newly inserted rows by updating the actual movie reviews, actors, and so on - where they really lived in the database.

When I ran a test of the PL/SQL script against a test copy of the production database, I noticed that it took several hours to run, during which time the database was so over-taxed that actual end user response times would have been unacceptable. So I came up with an idea to process one row and then sleep for a few seconds, then do another row and sleep, and so on. That way the content would still be imported, it's just that it would take a few days for the process to finish. And more importantly, database performance would remain acceptable during the process. Sounded good.

So this passed QA and was put into production. Over the next day, people gradually began seeing the new site content, one new title every few seconds, and they were pleased to see this fresh content. What they didn't realize until some time later, was that each time my PL/SQL script updated a movie, it set the available inventory level on that movie to 0. This effectively took it out of stock as far as the web site was concerned, so that movie was no longer able to be rented through the web site. Over time, the entire inventory was being taken off line, unavailable to be rented. That was their sole revenue stream, mind you.

At some point before the entire inventory was destroyed, we figured out what was going on and ultimately ended up restoring an Oracle database backup and deleting the PL/SQL script I'd written.

Over the next few days myself and others worked to understand the root cause of what happened. How could this have passed through QA? Well it turns out that NetFlix used Oracle Financials, and that was running on the same Oracle database server. Oracle Financials was not present in the QA test setup. Oracle Financials saw this movie content update as essentially meaning a new movie was in the inventory, so its available inventory starts off at zero until you tell it how many you've got. So Oracle Financials was taking the titles out of inventory.

I had no idea Oracle Financials was even in the picture, and I guess our QA team didn't either. The bug fix for this was really simple once we knew how to get Oracle Financials not to view this as a new title. And eventually the new content got out on the site and all was good.

Over the next few weeks we talked about how we could prevent something like that from happening in the future. I'll never forget this really bright programmer there named Kho telling me that really good programmers just don't write bugs to begin with. Then he proceeded to show me all this bug free software he'd written. Once every few years, I seem to somehow write a huge block of code, and it just compiles and runs, bug free. And I am amazed. It can happen. I don't know if it's just luck or whether this can be cultivated. Maybe Kho is right.

No comments:

Post a Comment