I love living in Las Vegas — I can get a corned beef reuben at 11:30 pm after a nice relaxing session of high-speed shooting in Gears of War with my friend. In fact, I enjoy this late-night meal so often that I’m known by name to both the wait staff and the check-out counter group at the restaurant.
Last night they lost a customer … two of them. My friend and I won’t return. The situation was mostly due to a series of unfortunate poor decisions on the part of the staff, but the trigger for the whole fiasco was bad software engineering.
Multi-user distributed systems always have tension in the area of locking information. If the information is not locked while used it becomes very confusing for users and there’s a very real risk of a lost update.
However, the naive solution to this problem is a terrible solution: hold long-lasting locks.
Let’s take the simple use-case of ordering a meal using a point of sale system (such as the one used at the restaurant last night).
The wait-staff takes an order and appends items to the order until the customer’s meal is complete. This is a multi-step operation as customers can change their orders, add deserts, and so on. When the meal is complete, the wait-staff gives the customer a check which lists the items ordered, sub-total, tax, and total. Most importantly to the software system the check has a barcode which represents the order for lookup at the register.
The checkout operation requires the clerk to scan the check, which brings up the order, allowing them to enter tip and swipe credit card and loyalty program card. The system processes the charge to the credit card and prints out the carbon receipt in which the customer signs and keeps a copy. This completes the transaction.
This use-case is so common in restaurants most people don’t think about it. You would expect the developers of the vertical application that this restaurant used to understand it well, though.
So, what are some exceptional cases that could occur to interfere with this simple use case?
Suppose the client application on the point of sale terminal crashes after the check was scanned and the record selected for update (and thus locked) but before the card is swiped.
At that instant, the lock should be released or when the application restarts the operation should transparently continue from where it left off.
That’s what you’d expect. In fact, what this system did on reboot was reported that the check can’t be closed out because it’s locked. And it provided no way to unlock it nor to resume that final operation. The check was in limbo.
And it remained locked. So they booted up the next terminal. And let many other customers pay and leave, while we stood waiting. They would re-scan our check after each customers, but it remained locked.
And remained locked.
And remained locked.
For 15 minutes we stood waiting.
Finally, they realized getting a manager might be a good idea. At this point, I assumed we’d simply be comped and apologized to. Nope, they rang up a new check so we could pay one that wasn’t locked. I guess they left a paper note to void the other.
Well, we paid and left.
What should have happened?
From the technical issue, the long-term lock was egregiously bad design. They should have either:
- allowed the lock to be over-ridden and canceled (perhaps with a manager code)
- dropped the lock on loss of terminal (obvious, though many web-sites hold locks because HTTP is stateless and they don’t know if there’s a person still active or not which is part of why web design is fun)
When doing the system for Palo Verde we used an explicit lock model, where users would lock the record by selecting edit and any other user who viewed the record was provided a visual indication it was locked. We used both solutions.
We allowed a manager to over-ride a lock. You can’t block a record from change just because someone left it open on their desk or stepped out.
And of course, if our application lost connection (this was not a web application so we had reliable connectivity) we unlocked any locks held by that lost worker.
This isn’t hard, you’d expect it to be common, but I guess it’s not as common as I would have thought.
The other aspect that should have happened was that their checkout person should have gotten the manager immediately when we were prevented from paying. That is a training issue.
But their training weaknesses and operational policy issues wouldn’t have hampered them had the system been written correctly.
That error cost them at least two customers — and we’ll tell our friends why we no longer go there.
It’s said that a single disgruntled customer can cost the loss of seven more. I guess 14 customers could be lost besides us. That’s a lot of money to throw away because some development team decided they were going to use long-term locks.
Keep the Light!