Article By: Ria Persad, President, StatWeather <click for bio>

Whether you are forecasting the weather, stocks, elections, or sports, the question of when to go with a consensus is critical to maximizing gains. This article will present some rules of thumb and the quantitative reasoning behind them.

Let's say that weather forecaster A is usually too warm by 2 degrees. Forecaster B is usually too cool by 2 degrees. It would make sense that the average of forecasts A and B would be more accurate than either of them alone, because their errors cancel.

What if forecaster A consistently hit the nail on the head, but forecaster B consistently was wrong? Averaging these two forecasts would water down the accuracy gained by only going with forecast A. In this case, just going with the better forecast would be the better plan.

So on a very basic level, if we have forecasts that demonstrate “comparable” accuracy, then combining them can result in some cancellation of errors and, hence, a better forecast. However, if one forecast is consistently the “winner”, then combining it with a far inferior forecast is not a good strategy.

However, determining whether the heuristic skill of two forecasters are “comparable” is not always a straightforward task.

What if the average error of forecaster A is +/-5 degrees, and the average error of forecaster B is +/-6 degrees, but forecaster A consistently runs a -1 degree bias (forecaster A is consistently off-center by 1 degree)? Who would you say is the better forecaster? Should a consensus be used? In this event, forecaster A's predictions can be offset by +1 degree. This is what some climate modelers do if they see a warming or cooling trend---they will offset their forecast for bias to center their forecasts. But even with this calibration, should forecast A and forecast B be combined for a better forecast?

In Statistics, there are tests which determine whether one distribution is significantly different from another, or if one group of forecasts is statistically “more” or “less” accurate than another group. Perhaps the difference between the two sets of forecasts is small enough that it could simply be due to random variation. It is not a trivial question, and at best we can come up with a probability that determines whether or not the two forecasters show “comparable” accuracy or if their differences are statistically significant according to a certain threshold of significance.

Empirical studies show that where performance of forecasts is “comparable” or within a certain degree from each other, then a greater number of combined forecasts renders the greatest accuracy. If fewer forecasts are used in the consensus model, then each forecast is “mission critical” and has to “do its job”, so to speak. One bad forecast can ruin the bunch. But if there are a great many forecasts in the mix, one single bad forecaster is not going to impact the whole quite as much.

It then boils down to an optimization problem involving (1) the number of forecasters and (2) the performance of each forecaster in terms of error, which requires a historical analysis of performance of each forecaster. Theoretically, an ideal situation is going with a single forecaster that is right 99.999% of the time. A second-best scenario would be to go with a consensus of one million forecasters using one million independent methods who are each right 80% of the time (note the key being “independent methods”, so that they all don't have the SAME bias nor are they all erroneous at the SAME time), so that the average of their combined forecasts approaches 100% accuracy.

Most of us do not have the luxury of either of these extremes, so the work is in finding the “happy balance” somewhere in between. In addition to our signature long-range forecasts, at StatWeather we measure the accuracy of numerous forecasters over time—over decades, in some cases—and have developed a system of metrics which categorizes the skill of forecasters and models. (We only track what is publicly available or where others give us permission to do so.) This then means that at any given time, we can generate the optimal combination of forecasts and/or models to produce a consensus forecast that is the most optimized for accuracy at any given location. Many industries such as energy trading, utilities, commodities, and risk management benefit from this optimization.

StatWeather's computer program is able to say, based upon past performance metrics, what the most accurate consensus or combination of forecasts will likely be. It might be a subset of forecasts or models, or it might be a single model, or perhaps all models in combination at any given time for any given location.

In my last blog article on Energy Central entitled, “Exactly How Accurate Are Weather Forecasts?”, we established that just going with two or three private vendors doesn't necessarily optimize accuracy. Sometimes publicly available forecasters are more accurate than private vendors, and vice-versa. In any kind of hedging strategy, there is always risk. The key is to have the resources to be able to gain insight into that uncertainty and quantify the risk. Replacing subjectivity with analytics—or attempting to simulate a human-based process of evaluating different forecasts by a more robust, predictable system— is at the heart of any algorithmic system that identifies arbitrage and maximizes returns.

For more information about Consensus Forecasting or Long-Range Forecasts, please contact service@statweather.com.

Another great article. I really enjoy the concepts StatWeather is producing.

Dave K.

It's a pleasure reading your articles.

You explain complex concepts in an easy-to-understand way.

Nenad