The difficulty in testing the profitability of trend-following rules stems from the fact that the procedure of testing involves either a single- or multi-variable optimization. Specifically, any trading rule considered in Part 3 has at least one parameter that can take many possible values. For example, in the Moving Average Crossover rule, MAC

*(s,l)*, there are two parameters: the size of the shorter averaging window*s*and the size of the longer averaging window*l*. As a result, testing this trading rule using relevant historical data consists of evaluating performance of the same rule with many possible combinations of*(s,l)*. When daily data are used, the number of tested combinations can easily exceed 10,000. Besides, there are many types of moving averages (SMA, LMA, EMA, etc.) that can be used in the computation of the average values in the shorter and longer windows. This further increases the number of specific realizations of the same rule that need to be tested. The main problem in this case is not computational resources, but how to correctly perform the statistical test of the outperformance hypothesis. In the preceding blog post we considered how to test the outperformance hypothesis for a single specific rule. Testing the outperformance hypothesis for a trading rule that involves parameter optimization is much more complicated. In this blog post we review two major types of tests that are used in finance to evaluate the performance of trading rules that require parameter optimization: back-tests (or in-sample tests) and forward tests (or out-of-sample tests). We also describe common pitfalls in testing trading rules.## Back-Testing Trading Rules

In our context, back-testing a trading rule consists in simulating the returns to this trading rule using relevant historical data and checking whether the trading rule outperforms its passive counterpart. However, because each trend-following rule has at least one parameter, in reality, when a back-test is conducted, many specific realizations of the same rule are tested. In the end, the rule with the best-observed performance in a back-test is selected and its outperformance is analyzed. This process of finding the best rule is called “data-mining”. The problem is that the performance of the best rule, found by using the data-mining technique, systematically overstates the genuine performance of the rule. This systematic error in the performance measurement of the best trading rule in a back test is called the “data-mining bias”. The reason for the data-mining bias lies in the random nature of any statistical estimator. Specifically, recall from the previous blog post that the observed outperformance of a trading rule consists of two components: the true outperformance and the randomness: The random component of the observed outperformance can manifest as either “good luck” or “bad luck”. Whereas good luck improves the true outperformance of a trading rule, bad luck deteriorates the true outperformance. It turns out that in the process of data-mining the trader tends to find a rule that benefited most from good luck. Mathematical illustration of the data mining bias is as follows. Suppose that the trader tests many trading strategies. Suppose in addition that the true performance of each trading strategy equals the performance of its passive benchmark. This means that for all trading strategies the true outperformance is zero and, consequently, the observed outperformance is a zero-mean random variable. The trader uses the significance level of*p=0.05*, that is, 5%. The test of a single strategy is not data mining; when the trader tests a single strategy, the probability of “false discovery” amounts to 5%. In other words, when an active strategy has the same performance as that of its passive benchmark, the probability that the trader finds that the active strategy “beats” its passive benchmark equals 5%. Now suppose that the trader tests*N*such strategies. We further suppose that the*N*observed outperformances are independent random variables. The probability that with multiple testing at least one of these*N*strategies produces a p-value below the chosen significance level is given by (we skip the details of the derivation) If in a single test*p=5%*and*N=10*, then*p*. That is, if the trader tests 10 different strategies, then the probability that the trader finds at least one strategy that “outperforms” the passive benchmark is about 40%. If_{N}=40.1%*N=100*, then*p*. That is, if the number of tested strategies equals 100, then with the probability of almost 100% the trader finds at least one strategy that “outperforms” the passive benchmark. The selected best strategy in a back test is the strategy that benefited most from luck. To deal with the data-mining bias in multiple back-tests, one has to adjust somehow the p-value of a single test. Researchers have proposed different methods of performing a correct statistical inference in multiple back-tests of trading rules. The majority of these methods are rather sophisticated; practical implementation of these methods requires a deep knowledge of modern statistical techniques. The main advantage of back-tests is that they utilize the full historical data sample. Since the longer the sample the larger the power of any statistical test, back-tests decrease the chance of missing “true” discoveries, that is, the chance of missing profitable trading strategies. However, because all methods of adjusting p-values in multiple tests try to minimize a Type I error in statistical tests (probability of false discovery), this adjustment also greatly increases the probability of missing true discoveries (Type II error in statistical tests). That is, when a trading strategy with genuine outperformance has a bad luck to be a part of a multiple test where many poor performing strategies are tested, the outperformance of this superior strategy may not be statistically significant._{N}=99.4%## Forward-Testing Trading Rules

To mitigate the data-mining bias problem in back-testing trading rules, instead of adjusting the p-value of the best rule, an alternative solution is to conduct forward testing. The idea behind a forward test is pretty straightforward: since the performance of the best rule in a back test overstates the genuine performance of the rule, to validate the rule and to provide an unbiased estimate of its performance, the rule must be tested using an additional sample of data (besides the sample used for back-testing the rules). In other words, a forward test augments a back test with an additional validation test. For this purpose, the total sample of historical data is segmented into a “training” set of data and a “validation” set of data. Most often, the training set of data that is used for data-mining is called the “in-sample” segment of data, while the validation set of data is termed the “out-of-sample” segment. In this regard, the back-tests are often called the “in-sample” tests, whereas the forward tests are called the “out-of-sample” tests. To illustrate the forward testing procedure, suppose that the trader wants to forward test the performance of the Momentum rule*MOM(n)*. The forward testing procedure begins with splitting the full historical data sample*[1,T]*into the in-sample subset*[1,t]*and out-of-sample subset*[t+1,T]*, where*T*is the last observation in the full sample and*t*denotes the split point. Then, using the training set of data, the trader determines the best window size*n**to use in this rule. Formally, the choice of the optimal*n**is given by where*M(n)*denotes the value of the performance measure (computed over*[1,t]*) which is a function of the window size. Finally, the best rule discovered in the mined data (in-sample) is evaluated on the out-of-sample data. In practical implementations of out-of-sample tests, the in-sample segment of data is usually changed during the test procedure. Specifically, after a period of length*s≥1*, at time*t+s*, the trader can repeat the best trading rule selection procedure using a longer in-sample period*[1,t+s]*. Afterwards, the procedure of selecting the best trading rule can be repeated at times*t+2s*,*t+3s*, and so on. Notice that, since the in-sample segment of data always starts with observation number 1, the size of the in-sample window increases with each iteration of the selection of best rule procedure. Observe the following sequence of steps in the out-of-sample testing procedure. First, the best parameters are estimated using the in-sample window*[1,t]*and the returns to the best rule are simulated over the out-of-sample sub-period*[t+1,t+s]*. Next, the best parameters are re-estimated using the in-sample window*[1,t+s]*and the returns to the new best rule are simulated over the out-of-sample sub-period*[t+s+1,t+2s]*. This sequence of steps is repeated until the returns are simulated over the whole out-of-sample period*[t+1,T]*. In the end, the trader evaluates the performance of the trading strategy over the whole out-of-sample period. The great advantage of out-of-sample testing methods is that they, at least in theory, should provide an unbiased estimate of the rule’s true outperformance. An additional advantage is that the out-of-sample simulation of returns to a trading strategy, with subsequent measurement of its performance, are relatively easy to do as compared to the implementation of rather sophisticated performance adjustment methods in multiple back-tests.