Amazon engineers knew about the Tōhoku earthquake before most of the western world did. They knew because within seconds of the earthquake, their pagers started going off (yes, almost everyone at Amazon has a pager). Their internal monitoring detected a giant drop in the total number of orders that should have been getting processed in Japan and seconds later alarms were going off left and right.
I remember greatly enjoying the details of this when one of my good friends, former Amazonian, and current Thinkfuser Mark Golazeski told me that story. It reminded me of another that I heard when I was working at Google.
Back in the day, Google was colocating space in a datacenter and, as with all things at Google, they had exhaustive logs recording every little detail. One day, their monitoring triggered an alarm when one of the servers reported abnormally high temperatures. As they watched the logs, they were confident that this wasn't a glitch in their monitoring systems. This was supported by more evidence as they noticed surrounding servers starting to report abnormally high temperatures as well. Suddenly, some of the servers started going offline. They watched as it spread server to server. Was it a virus? No. They concluded that their rack was on fire.
The Google engineers called the staff at the data center to let them know that there was a fire and they were watching it spread in real-time. The staff laughed and countered that they were sitting in the data center and would surely know if it was on fire! They told Google that their monitoring software was busted. After a couple more seconds of back-and-forth, sure enough the data center's fire suppression system kicked in. The data center staff hurriedly said 'Oh my god, sorry!' and immediately ran off the phone without another word. That's the story of how Google engineers knew a data center was on fire before the data center knew.
This philosophy of having intense monitoring around every system is something we heavily embrace at my startup, Thinkfuse. It's great for monitoring. It's great for debugging. But it's biggest strength is that it's great for turning horrible customer experiences into phenomenal ones. Let me explain.
In our early days, we would constantly break features while working on new ideas. We warned our users that they were using a prototype that might break frequently, but they started using us for real work anyway and came to depend on us.
We decided that we wanted to be very proactive about errors on the site, so we set up a system to directly email us every time an error occurred. We then personally responded to every single error that any user encountered, often within minutes. Our users loved these responses and it gave them a lot more confidence in using Thinkfuse, despite it being a fairly buggy prototype at the time. They also became much more forgiving of future problems, so much so that almost all of those users are still using us today (2 years later!).
Great customer service buys you a lot of slack. I can't understate that. The reaction that a user has when you email them about an error they experienced, when they never even contacted you about it, is amazing. It turns a horrible situation that might cause the user to walk away and never come back into a situation where they're shouting from a mountain top about how great you are. One of my favorite tweets was written by Scott Kveton, the CEO of UrbanAirship, when we reached out to him last year after we caught an error. Here is his reaction:
He still uses us to this day.
Our policy of reaching out to users is so invaluable, even today with hundreds of companies using us and 35% month-over-month growth, we still respond personally to every single error we detect as quickly as we can.
With a team of five, this policy has a secondary benefit of forcing us to be better about automated testing and code quality. We couldn't do it if errors scaled with our growth.
Businesses store critical data in our system and we try our best to make sure that errors never happen. The product has also matured quite a bit in the past two years. If a problem does slip by though, we make sure that the customer knows we're paying attention and on top of it. Reaching out to them first is the best first step to building credibility and turning a poor customer experience into a great one.
About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at firstname.lastname@example.org.