A Simple Method to Estimate Missing Time Series, Resampler Interpolate

A step by step guide to use pandas in exploratory data analysis


If you want to reorganize the time series data to a different frequency and then replace the resulting NaN values with a scientific estimation then resampler interpolate is the best choice.

Developer Notes

Resample Interpolate function provides a scientific estimation for the missing values arising out of resampling. It performs this estimation by using both the linear & polynomial regression equations.

Photo by M. B. M. on Unsplash

Let’s see it in detail with an example. Implemented this module for a retail client who runs in-store and e-commerce.

Business Case

The price monitoring system of our web store regularly extracts the product data from various competitor websites. It extracts the raw data at a fixed frequency, once every 90 minutes.

The business analysts analyze this data to understand the competitor’s price-changing strategies.

Missing Values From Source
Sometimes the competitor website returns null values due to technical issues. Those missing values have to be filled with the best estimation from the rest of the values.

Though the data extraction is at a fixed frequency of 90 minutes, the data analysis would be at a higher frequency like at 60 or 30 minutes.

Missing Value From Resampling
Resampling the data to a higher frequency is resulting in the introduction of more null values. When the raw data contains no record for the new interval, the resampler assigns a null value to that interval.

Requirement
These null values, either from the source or from resampling, have to be replaced with a scientific estimation. Either by using a linear graph or by using a polynomial curve, the module should find the best fit value for every missing interval.

Let’s see how we can solve this problem using Resample Interpolate method.

Solution

  1. Firstly, get the raw data received as per the baseline frequency. Competitor’s price for the Moisturizer product at every 90 minutes.

Linear Interpolation

  1. Address the null values by using interpolate method. By default, this method performs a linear estimate. The estimation for the first null value is 11.50 which is the mean of its precedessor and successor. In the same way the estimation for the second null value is 11.75.
  1. This dataset has a frequency of 90 minutes. One record for every 90 minutes. Say, the business analyst wants to see this dataset at a higher frequency, say at a 60 minutes interval. If you directly resample this dataset to a higher frequency, you’ll see more NaN values in the resultant.

Polynomial Interpolation

  1. So, you tell the resampler to interpolate the missing values. That is to make an estimation by using a first order polynomial regression. This regression model internally creates a polynomial curve. The missing value will be derived as a best fit point on that curve.
  1. All NaN values are filled up except the last row. The problem with this method is that the curve ends at the last known value. So it cannot predict the value beyond that. So, use a forward fill method to address the last element.

Linear vs Polynomial
The choice of linear vs polynomial is business context-driven. polynomial regression considers more independent variables than the linear estimate. In the first interpolation, the missing values were due to some system failures. What we were doing in that context is a simple approximation (not simulation) of the missing values. In the second interpolation, we are simulating more intervals, so it is better to consider as many inputs as possible from the population. So, we have chosen the polynomial regression method.

  1. Now the table is ready. Resampled at a higher frequency, the intermittent values are interpolated. And the last value is forward filled. We can now plot the graph
  1. That’s it. This is what we need. This provides a better picture to the business analyst on how the prices are being changed by the competitor.

Trend: Every one hour they are increasing the prices till 12 pm and then decreasing it every one hour till 2 pm.

The business analyst can quickly change the resampler frequency from 60 minutes to another level, say 30 minutes, and continue the trend exploration.

Result

We have replaced the missing values of the source data using the Interpolation technique. Resampled the data to a higher frequency and interpolated the missing values using a polynomial estimate. This made the dataset flexible and suitable for exploratory data analysis with consistent values.

What Next?

If you would like to share your business case on how you’ve used this method, please send the details to pub@additionalsheet.com. Our editorial team shall get back to you in adding it to this publication. Please contribute to the integration of knowledge towards building a new generation of fast learners. An example speaks a thousand paragraphs.

Also Read

An Excellent Method for Instant Data Transformation, DF GroupBy Transform

How is GroupBy Rank used in Pricing Intelligence?

Write a Comment

Comment