Time and Tide
Thus, armed with our Nuts and Bolts, we can easily arrange and rearrange our data to form observations about our experiment. The first thing we can do is figure our exactly how many of our traders made money as a whole:
Our test period ran from January 1, 2000 to February 27, 2007, during which there were a total of 1,767 trading days. Each day, one of our hypothetical traders made a purchase, and liquidated their holdings 180 trading days after. Of the total 1,767 traders, 503 registered a loss, and 1,264 made money. In other words, of the total 1,767 trading days in our sample, 503 of those days would have produced a loss in the next 180 days, while 1,264 would have produced a gain. This observation is nothing ground breaking or spectacular. We know now after the fact that we have been in a bull market since 2003, which was mostly captured by the time frame of our sample, and it is only logical to conclude that majority of trading done in the bull market would have resulted in a gain. There were 2.51 wins to losing trades or a 72% win rate, in terms of number of trades.
However, the value of our experiment now lies in disproving our null hypothesis. Were we to group our days by month, this how the results would look:
Now we’re seeing some value in our observations. Each month has a different mix of winners and losers. Take note that despite the fact that all many of these months occurred within the span of the bull market—they all have different performances.
Now before we cry EUREKA and gush over the data, I want to take this time to introduce a not so popular topic of significance testing. Statisticians and scientists when gathering their observations, often test them for significance—or in so many words, the likelihood that their observations arose purely out of chance.
Consider a coin which we can assume to have a 50/50 chance of landing heads or tails—or in other words, unbiased pure randomness. Even with such a random scenario, it is possible to come up with 9 heads or 9 tails in succession. So what might initially appear to be a pattern, might actually be just due to dumb luck.
In statistics, there are numerous tests to see if your results are just due to chance variations. I won’t go into a lengthy discussion of them here, but one test in particular: the chi-square non-parametric test, is appropriate for our discussion. Going back to the table above where we see that each month has a different ratio of winners to losers from each other, and from the ratio of the total sample, if our null hypothesis were true, then all months would have the same ratio of winners and losers. Since we know the ratio of winners to losers for the entire sample (i.e. 2.51), then if our null hypothesis were true, this is what our table should have looked like:
This is what we call our “expected distribution”, which we now compare with the “actual distribution” we got earlier. The chi-square non-parametric test will tell us that given an actual distribution vs. an expected distribution, what are the odds that the differences in the actual distribution occurred just out of dumb luck. Fortunately again for us, we do not need to be math majors to compute chi-square. Our modern excel spreadsheet can do it for us automatically.
The excel spreadsheet command for this is =CHITEST([actual range], [expected range]), where the actual range is the selection of cells from our first table, and the expected range is the selection of cells from the second table.
The resulting figure from the chi-square test is: 0.00000000000000000000000000015, or in the other words the odds of our monthly distribution resulting in such a different mix of returns just out of chance is 1 in 6.79 billion-billion-billion. So there’s still a chance—but it’s so remote that it’s practically non-existent. Is that significant enough for you?
Next: Not All Days Are Equal…


Isn’t the chitest non-parametric test used to determine the relationship between two nominal variables? I don’t see either data being nominal or any variable violating the normality assumption so wouldn’t it be more appropriate to use a parametric test which are viewed to pack a lot more statistical muscle? Also, wouldn’t a significance level lower than what was expected only lead to the conclusion that there is a relationship or for hypothesis testing that there was a difference between what was expected and what was observed?
Thanks for your feedback, chupitero. Actually a parametric test (e.g. t-test, etc.) may be performed, but the problems lie in the experiment design:
1. there are no two populations that can be compared against each other
2. we don’t have a proper assumption on the expected mean of the distribution. arguably for price data, the variance of the distribution could be infinite, so we can’t predict an “average price”
As a workaround, chitest was used but by transforming the months into categories, and converting the days in the month as frequencies–in effect converting the experiment into nominal data. This avoids the need to presume a mean for the distribution (i.e the actual price values), while dwelling on the actual occurrences which satisfy the criteria.
The chitest in excel computes the alpha level that the test is significant at: and at such a low level it does suggest that there is a difference between the expected and actual distribution–which disproves the null hypothesis that Seasonality has no effect on returns.
I do agree that parametric tests are more powerful, and I appreciate any suggestions on how we can adjust the experiment to accomodate such a test.
Isn’t the second population the win-loss ratio? If we have the data for each 180 day trial then we can compute the mean can’t we? Personally though I would prefer using %gain for each 180 day period instead of a win-loss ratio. As a guide I think we can use the methodology found in various studies on the january effect in the US stockmarket.
In the experiment, there’s only one population of 1,767 180-day trials which result in either a Win or a Loss. We then categorize them by month to see if the month (i.e. seasonality) has an effect on the distribution of wins and losses.
Are you suggesting we take the winning trades as one population, and the losing trades as another population, and then check for differences in their means? We already know this, since the winning trades will have a positive return, and the losing trades will have a negative one. This won’t help us in isolating the effect of the month.
As an aside, another way I thought earlier was to treat each monthly batch of trades as a separate population, then check for differences in the means. But your standard parametric tests won’t accomodate more than two populations–since we’ll have 12 sets of trades to check with each other. So maybe another method (like ANOVA) might do the trick?
My mistake, there’s only one population. I still think we should use % gain instead of a win-loss ratio though but sample size needs to be increased from just 7 years. I don’t see why we can’t use the more common t statistic instead. The standard p value of .05 for significance tests still leaves a 5% probability that the result was due to chance though but I think the most important thing is how we explain the results. How do we explain the market anomaly? Is the market just egregiously inefficient or is there some other hypothesis?
I agree that the % gain should also be tested for significance. My current dilemma is how to redesign the experiment to be able to use other tests to isolate the effect of the month-of-the-year variable. The Chi-test only tests for frequencies, which is in itself, incomplete.
Meanwhile, as to the more illuminating question: “Why?” This study opens a pandora’s box of other possible things:
1. A cyclical component. Which is in itself an anomaly. And agree that this should be extended further in the past. In the next run I plan to take it at least 10, maybe 20 years to see if the dynamic persists. If it doesn’t then it begs another question: why does it work from 2000?
2. Seasonality is tricky since we are likely capturing some other behaviour that just happens to fall in the proper time. This is either an effect on prices in October or an effect on prices in 180 days hence March/April, or both.
3. For companies: culprits I’d like to check is corporate fiscal years, taxation periods, quarterly earnings. The activity around these items are usually September and April–so may contribute to something.
4. Just out of the blue: Fund rebalancing activity:
a. The Ghost Months phenomenon which is associated with low fund activity occurs usually from May to September–which is incidentally the anti-thesis of the ideal period indicated by the study which is October to April. The data shows poor performance of purchases made in January to March–the 180 day sell periods of these months would fall within the Ghost Months.
b. For the same reason, the “window dressing” myth which occurs around December to January could also account for the strong finish of our ideal period.
5. Macroeconomic cycles, weather patterns are also good “oddball” candidates to check.
The problem as always, with any of the above, is the availability of data to test against.