Failures are stepping stones to success! But definitely not in the production environment 🙂
No doubt you have written a great algorithm. Your code has produced amazing results in the development environment. And accordingly, you have received a big applause from your boss and colleagues. But after deploying it to the production environment, the code has broken down the next day itself.
Don’t worry it’s not an uncommon thing. It happens with many programmers and no one likes to be in such a situation. Indeed it spoils the entire weekend plan and counters all the appreciation received. Above all the fragile code is a project risk from the sponsor’s point of view.
What makes the code fragile? An Illustration
Most of the time it happens because your code is interdependent but not correctly managed. In other words, your interdependent code blocks are unnecessarily spread across your logic. This could make some code blocks fail silently and make the algorithm return the wrong results.
To illustrate, here is an example from one of my very early projects in python. I was working on a billing module for a telecom operator. This module validates the outstanding payments and sends notifications to the people who are nearing the due date.
This module contains four functions:
- Get Active Customers: This function visits the CRM database and retrieves the list of all the customers who have used our service during the last billing cycle.
- Find Outstanding Customers: This function validates each customer against a Payment API. It filters out the paid customers from the active customers and prepares a dataframe of outstanding customers to whom the notification has to be sent.
- Add Contact Details: This function retrieves the contact details of each customer using a Profiler API. It populates the outstanding dataframe with the email and mobile number details.
- Send Notifications: This sends the email and message sms notifications to the list of customers provided in the dataframe.
Our main method which calls these functions is as shown below:
Our implementation worked perfectly well in the development environment. But in Production.. oops it failed.
What went wrong?
The second function and the third function both have some dependency on an external API. In case of any network failure in the production, these APIs are bound to return errors. When they fail, obviously the functions calling them will also fail and return an error.
The same thing has happened in our case. The function that finds outstanding customers has failed. Due to this, the paid customers are not removed from the active list. But the contact details function has proceeded as usual and added the details of all the customers to the dataframe.
How could it have been better?
In order to discover the root cause of this problem, we have spent a lot of time troubleshooting. Obviously, it is not just a lack of exception handling process. It is mainly due to the fact that we’ve not made our statements transactional, where they are needed to be.
The code should guarantee that the system sends only the correct notifications. Or doesn’t send any notifications at all. Which in turn says that Do it correctly or don’t do it at all. There should not be a situation where the system sends the wrong notifications. There is always a high possibility that one or the other dependencies of your code may go down in the live environment. You have to make your code Resilient to these changes or interruptions.
The solution for it is to make steps 2 and 3 of the main method transactional. Which means that they either execute together or fail together. If one fails, the other also fails. The statement succeeds only when both the functions succeed.
What’s in Pandas
With the Pipe method of the Dataframe, we can run a series of data transformations together in a transactional way. To repeat, running in a transactional way means either:
- Run all of these transformations together or
- Don’t execute any of these transformations.
This approach guarantees that there will not be any partial execution of the transformations. This means to say that there will not be any wrong results returned. Thus our code becomes resilient to any environmental failures.
How to Implement? Illustration Continued
Accordingly, we have modified our code to make use of the Pipe method. The modified code now looks as follows:
Now let’s see how it behaves in production. The function that sends notifications, doesn’t run unless the contact details are added. The contact details are not added unless the outstanding customers are filtered. So this ensures that only the outstanding customers would receive the notifications.
The function to find outstanding customers could fail due to network issues. In that case, no other functions in the chain will execute. This ensures the notifications are not sent to the paid customers.
Unmanaged dependencies create fragile code. The dataframe Pipe method runs a series of transformations together as a single statement. It organizes the code better. It manages well the dependencies and makes the code resilient. The illustration shows the strength of code before and after using the pipe method.