If you want to reorganize time series data at a different frequency and replace any NaN values with the next available value then resampler aggregate is the best choice.
When you are resampling a distribution to a higher frequency distribution, then there is a chance that some intervals may result in NaN values. In such a case, replace the NaN values with the corresponding value from the next interval.
Will there be any impact on the values of the original distribution?
No. Only the new records introduced due to the increased frequency will be modified by this method. That too, only if these records contain NaN values.
If there are more NaN values in many consecutive intervals, will this replace all of them with the next available value?
Yes. By default, it replaces all of them with the next available value. But you can set the limit by passing an argument. Specifying how many such consecutive NaN values have to be exactly replaced.
Let’s see this with an example
Our analytics server aggregates the weekly sales of each product and sends the details to the server that generates the sales reports. The store manager should look at this report, explore the data to optimize the re-order plan. An optimum re-order plan is essential to ensure there is no lost demand and at the same time ensure working capital is not locked in excess inventory.
To perform this exploratory data analysis, the store manager has to resample the original data at various lower and higher frequencies. Missing data has to be handled scientifically and automatically during this analysis.
Explore original data
- First, get the raw data received as per the baseline frequency. A number of products are sold at the end of each week.
Here the time interval is seven days. The number of products in an interval of 7 days is reported in a tabular form for 10 weeks.
2. Let’s plot a graph to see its trend
Explore Resampled Data
- Now we want to see the trend at a higher frequency of 5 days. Because some suppliers deliver the products once in 5 days. If we have to go with these suppliers, how should our re-order quantities look like? Plot the graph
Oh!! this shows unrealistic. Why there are some zero quantities at some weeks in the middle? That’s because there are some additional intervals introduced when we have resampled the data. The resampler has no data available for these periods from the original dataset. So it just reports 0 for these periods.
- Look at the resampled data table
Does it mean that we would have not sold any items during this period? Obviously, that doesn’t reflect the real trend of sales.
Explore the Resampled and Backfilled Data
- Here comes the real need to Backfill function. Ask the resampler to replace the unknown values with the next known value.
You may say, as we have introduced 3 new values, the total number of sales in this table doesn’t tally with the original dataset. You are right. The resampled table doesn’t tally on the total count. The purpose of resampling is to show the most realistic count at each interval.
- Now plot the trend. It looks more realistic.
Likewise, we can do resampling at different levels of frequencies. Viz: Some other suppliers may deliver once in every 3 days. You can check the pattern from their perspective by resampling the same dataset with a frequency of 3 days.
That’s how it supports the open-ended exploratory data analysis in a simplified way.
Resampling is done at different frequencies – higher and lower without a need to re-extract different datasets. Backfill ensured the NaNs are filled with a more realistic value and made the dataset flexible and suitable for exploratory data analysis.
If you would like to share your business case on how you’ve used this method, please send the details to email@example.com. Our editorial team shall get back to you in adding it to this publication. Please contribute to the integration of knowledge towards building a new generation of fast learners. An example speaks a thousand paragraphs.