Correlation is just an opinion until it is statistically measured. Yes, really.
My roommate never keeps his desk clean. He proudly says the chaos correlates with the state of random thoughts in his mind. I normally don’t argue nor advise further as I do not know how to measure the correlations of imagination. One thing I’m sure of is he often uses the scientific metrics in a very casual way. Sometimes it’s pretty, but many times… !?.
He works as an assistant store manager at a nearby retail store. One evening he came back early from work, rang the doorbell. The moment I opened the door, he started crying like a child. Like a child who lost his candy in the school bus. Suppressing my laughter, I’ve tried to console him for a while.
Once he settled down on the sofa I’ve asked him, what happened? He was about to restart crying, but I offered him some water to drink. After a sip of water, he started explaining the problem.
“Earlier this month I gave an idea to my manager that has generated a lot of sales. But when the time has come to give me the incentive at the end of the month, he denies it. He says that the growth in sales is in general and not linked to my idea.:
Ok – What was your idea? I asked
We sell batteries (AA) at our store. Expired batteries incur a loss, so we need to sell them fast. But the sales are not up to the mark for the last few months. In the ideation meeting, I told the store manager to place the batteries in the stationery section as well along with the present electronics section.
Overall the battery sales have increased by 12% this month. But the manager says, that’s a general rise in demand for the batteries and not specific to the new point of purchase I’ve suggested.
He said it in a voice of desperation. I thought I could solve this problem and asked him to stay calm. Continued to solve it along with him.
The statistical problem here is:
Finding out where does the rise in sales better correlate to?
You and your manager differ in concluding that. Your manager thinks the rise is more correlated to the general rise in demand. You think the rise has a strong correlation with the new point of purchase .
Let’s put the data together
Give me the sales data for the first 10 days of this month and that of the last month. How many batteries you have sold per day at each of these points of sale?
Last month you had only one point of purchase (electronics section) and this month you have two points of sale (electronics and stationery sections). We are now organizing the data into 3 series as follows:
# Number of items sold from display point1 last month series1 = pd.Series([180, 257, 220, 238, 159, 306, 400, 353, 407, 220]) # Number of items sold from display point1 this month series2 = pd.Series([198, 305, 259, 158, 277, 296, 424, 443, 445, 210]) # Number of items sold from display point2 this month series3 = pd.Series([315, 230, 335, 133, 263, 235, 169, 200, 280, 140])
Series 1: Number of batteries sold during the first 10 days of last month. From the first display point, the electronics section.
Series 2: Number of batteries sold during the first 10 days of the present month. From the first display point, the electronics section.
Series 3: Number of batteries sold during the first 10 days of the present month. From the second display point, the stationery section.
A well organized data is problem half solved. I could see a ray of hope in the face of my friend. Let’s continue.
Let’s restate the problem, now in correlation terms:
Now the problem is to find out which pair of these series objects are better correlated?
As per your manager:
There is a weak correlation between series 1 & series 2. Indicating that there is a change in the regular demand for the batteries.
A strong correlation exists between series 2 and series 3. Indicating that the items at the second point of purchase were sold as per the regular demand only. The same additional quantity would have been sold from electronics section. Even if the batteries were not placed at the new point of purchase, the stationery section.
In your view:
There is a strong correlation between series 1 & 2 to show there is no big difference in the demand.
A weak correlation exists between 2 & 3 to show the new demand is not linked to the regular demand.
Why late, let’s find it out
c21 = series2.corr(series1) c32 = series3.corr(series2) print("Correlation between series 2 & series 1: " + str(round(c21,2))) print("Correlation between series 3 & series 2: " + str(round(c32,2)))
And the output was looking like this.
Correlation between series 2 & series 1: 0.85 Correlation between series 3 & series 2: 0.05
Correlation between series 2 & series 1: 0.85Correlation between series 3 & series 2: 0.05
There is something for my friend to cheer up. It shows a strong correlation between series 1 and series 2.
Result went like this
And the winner is my crying roommate. The items sold from display point 1 are exhibiting similar behavior more or less as of last month. This shows that the manager’s assumption that there is a general hike in the demand is not correct. If there is a change in demand, it should reflect at the display point1 as well. But it is showing strongly correlated to the previous month’s data.
Weak correlation between series 3 and series 2. So it’s not a general demand that is reflected at display point 2. It’s a new demand created at that point. Not correlated with the regular demand served at display point 1.
When correlation is not measured, it just stays as an opinion. Pandas series provides a beautiful yet simple way of measuring the correlation between two lists of numeric values. Turn your opinion into proof with evidence by instantly calculating correlation.