This particular tale involves a company that offers subscriptions to its customers on a monthly, recurring basis. Customers purchase different types of plans and select various options associated with their subscriptions. As in many cases, the billing system was not originally designed to handle the emergent requirements and the winds of change placed on their business as years rolled by. As such, the payment system had been patched/tortured numerous times throughout the years to accommodate the newly discovered features and the functionality required to support the ever-changing business needs.
Businesses have needs
A few years ago, the company decided to offer their customers an option to upgrade their subscription plans to a different tier. If a user was interested in the upgrade they would click a button to be entered into an upgrade queue. Before each evening’s billing, the list was checked for any subscriptions that had reached their billing cycle date (i.e., are about to be billed). The subscription's plan would then be updated and removed from the upgrade queue. The upgraded subscription would then be billed at the new plan amount. The development history was more complicated, but suffice it to say, changes took place and bugs were found and fixed. This cycle repeated until the system seemed to work well from both business and technical perspectives.
Time affects all things, even unchanging code
Fast forward three years. The company received a complaint from a user who had signed up for the upgraded plan three months prior and was still not in the program. Development was notified and asked to investigate the account. It looked fine. The database was checked, and the subscription was, in fact, flagged to be upgraded. The Oracle process was investigated and the selection query run by hand to see if the subscription would be upgraded. Everything indicated that it should; however, the user had been billed 5 days earlier without being upgraded. Checking the upgrade queue table in detail provided some evidence. There were around 3800 accounts waiting to be upgraded. Past trends indicated this was far too many subscriptions to be in the queue at one time. Something was definitely wrong.
When designing solutions, one must always expect that problems will happen down the line. All of the Oracle database jobs keep logs of when they start, each step they get through, and the final state of the job; so if something ever fails, it is easy to quickly identify the what, where, and how. The summary logs were reviewed.
# -- JOB_ID -- JOB_NAME -- TIMESTAMP -- COMPLETION_STATUS -- COMPLETION_TIMESTAMP
1 -- 60468396 -- PROCESS_UPGRADE -- 20XX-12-25 20:28:39 -- OK -- 20XX-12-25 20:29:08
2 -- 60493968 -- PROCESS_UPGRADE -- 20XX-12-26 20:28:42 -- OK -- 20XX-12-26 20:29:11
3 -- 60519486 -- PROCESS_UPGRADE -- 20XX-12-27 20:28:46 -- OK -- 20XX-12-27 20:29:39
4 -- 60545237 -- PROCESS_UPGRADE -- 20XX-12-28 20:28:50 -- OK -- 20XX-12-28 20:29:15
5 -- 60570885 -- PROCESS_UPGRADE -- 20XX-12-29 20:28:51 -- OK -- 20XX-12-29 20:29:54
6 -- 60596687 -- PROCESS_UPGRADE -- 20XX-12-30 20:28:54 -- OK -- 20XX-12-30 20:29:34
Everything had completed fine -- no errors. The ‘dba_jobs’ table was queried to verify the Oracle schedule for the job.
# -- INTERVAL -- WHAT
1 -- "sysdate + 1", "process_upgrades;"
Everything looked good at first. However, after looking through the previous log it became clear, "sysdate + 1". Looking back at the start times in the logs, they had been increasing by a few seconds each run. How could this be?
"sysdate + 1" is "1 day from now" in Oracle speak. When a job is scheduled to happen and when it actually starts (calculating the next run schedule) are not the same. So when Oracle calculates the next start time it would sometimes be a few seconds later especially if the database was under heavy load. Each day there was a chance for this to creep forward a little bit. The job to process the evening billing was scheduled to run at 19:45, but the job’s start time had drifted to almost 20:30.
The list processing is designed to take place "before each evening billing." When the upgrade runs, it checks to see who is to be billed that evening. Time does not typically travel backward; therefore, when the upgrade process runs after the billing process, naturally, it will not find anyone who needs to billed because... they have already been billed! Thus the accounts would never be upgraded, and never leave the queue.
Bugfixing only fixes the immediate problem
This is a simple fix. If the desire is to run at 19:30 the next day, the interval is set to trunc(sysdate + 1) + 19.5/24. This will always set the next time to 19:30 "tomorrow" regardless of what time of the day it is run. We then searched to make sure we didn’t have this same problem anywhere else. We did. Thankfully most of these were sysdate + 1/24 (an hour), so they did not have any significant business impact from the others. These were fixed anyway to avoid leaving broken windows.
Now for the damage. The process moves forward, on average, 3 seconds every run. For every minute, that equals 20 days. For an hour, that is (60*20=1200 days / 365) roughly 3.25 years. That's close enough to match the originally implemented start time. The billing process starts at 19:45, but does a lot of processing beforehand. By checking against billing logs, it is estimated that the subscriptions are actually checked at about 20:15. So, from the time the job actually starts until the time the subscriptions are checked (20:30 - 20:15) equals 15 mins at 3 seconds a day... approximately 300 days. This has been broken around 10 months.
Looking at the bottom line: the upgrade costs about $5/month. Assuming a 100% success rate and an average retention rate of 3 months will give us a good upper bound. So 3800 subscriptions * $5 * 3 months, means potentially a $57,000 loss of revenue. Luckily, some of the lost revenue can be recovered by processing the accounts still in the queue. A quick check shows 300 of the requested upgrades are still active accounts so (300*5*3) $4500 can be recovered; however, it is still a potential $52,500 loss due to a bug in the code from three years ago.
Consider your options, don’t knee-jerk
This begs the obvious question of what could have been done (or could still be done) to prevent this. A different engineering approach is probably the first thing that comes to mind.
Perhaps instead of using Oracle's job scheduler, the process could have been scheduled as a cronjob. Cronjobs are much more familiar to developers and probably an easier method of scheduling to understand. This type of time shift error would not have occurred using this method. Of course, that would mean this particular database job's scheduling would no longer be in the database or under the DBAs’ direct control. Given none of the other jobs of this nature are scheduled with cron (they all use the Oracle scheduler), this process would become a special snowflake, which opens the application to other risks.
Another option would be to modify the two processes to be made dependent on each other and have the billing process fail or delay until the upgrade job has successfully run. This ties the business logic together in a way which guarantees they either both work, or neither do. This sounds like an obvious win but there are severe negatives to consider as well. If any new problem occurs with the upgrade process the billing process will no longer run. If this delays into the next day it could have serious cash flow implications or even cause unrecoverable loss of revenue/subscriptions. Even though billing is monitored this still increases risk. A partially complete billing is better than no billing for this particular client’s needs.
Still another alternative might be to modify the billing process itself to be updated to handle the queue processing as well. This approach puts all of the logic into one location, which is nice, however the billing system is already quite complex and difficult to manipulate without breaking it. Modifying billing to incorporate this new feature would add even more complexity. In general, keeping processes segmented and less complex decreases the chances of other subtle bugs creeping in, so we wanted to keep these two pieces separate. If the upgrade queue processor needs to be changed in the future, having to fiddle with the core billing system to do it is never a great situation.
Monitoring is insurance against future mistakes
In the end, we decided against these additional changes; we already had a working solution and the goal was to make the software more reliable. It wasn't feasible to spend resources to recreate the wheel without a compelling advantage. What we did do... implement better monitoring. More specifically, business logic monitoring. The previous solution worked well before we encountered the timing error. Since the error was now fixed, the solution would continue to work well. No matter how well you engineer a solution, nothing in a changing environment is immune to an edge case error, or changing assumptions/dependencies. In a constantly changing environment, bugs cannot be totally avoided, but one can plan for increased visibility when they do occur.
The business rule was to change the plan for the account on its next cycle, and remove it from the list of subscriptions to be upgraded. The cycles were 30 days long so a monitor could be added to verify nothing in the list is ever more than 30 days old. If so, then the monitor would alert for investigation. That's it. That simple check, maybe 15 minutes of someone's time to validate the execution of the business rule, would have been worth more than $50,000. If there is a problem, the liability is now limited; and if an error does occur the business can attempt to recover with a higher likelihood of success due to faster response times.
Additional monitors can (and should) be added to check other aspects of this business process. Are items being entered into the list daily? Are the number of those newly listed accounts outside of a standard deviation for the normal trend? These simple additions, together, can catch all sorts of unimagined engineering failures down the road for a small investment of time. Monitoring is an incremental process. Each time a failure occurs the underlying business goal should be considered and an appropriate monitor put in place to validate it is being met. Code only exists to cater to the goals of the business. Do not monitor the code, monitor the rules and goals of the business.