False discoveries everywhere

Since John Ioannidis published a paper in 2005 provocatively titled Why Most Published Research Findings are False, the general public and also researchers have gained a greater awareness of the unreliability of scientific discoveries based on “a single study assessed by formal statistical significance, typically for a p-value less than 0.05.” To be sure, scientists in fields such as genomics and others have long been aware of the need to adjust their hurdles for statistical significance to account for the very large number of tests they conduct. It would be fair to say, however, that general awareness of this issue has taken longer to develop in finance and economics, although some academics and practitioners were grappling with the issue of multiple testing years ago. It will probably take longer still for the retail investor who thinks running stock trading backtests on his online broker’s website makes him a trading “ninja.”

Continue reading

Posted in Uncategorized | Leave a comment

Is two still the magic number?

When doing data analysis, we have come to regard two as the threshold that a t-statistic must clear in order to declare a variable statistically significant. As most readers will know, this critical value ensures a 5% level of significance given a reasonably large number of observations. In the context of investments, we might be looking at whether a variable such as a firm’s book-to-market ratio is related to its stock returns or whether the time series of returns from some trading strategy or portfolio manager is reliably different from zero.

In my previous post on data snooping, I discussed some of the problems that arise from testing multiple hypotheses. In particular, if we conduct a large number of tests, we are very likely to find relationships or patterns in the data simply by chance. Yet, we do this all the time in finance when developing trading strategies or evaluating investment managers. So while two might be the appropriate threshold for a t-statistic when conducting a single test, it will leave us with too many strategies or managers bound to disappoint when we conduct a wide search.

Continue reading

Posted in statistics, stock market forecasting | Tagged , , , | Leave a comment

Data snooping in a nutshell

Data snooping is pervasive in financial research, both in academia and in industry. In my experience, the level of awareness about data snooping varies widely among practitioners. All too often, however, huge amounts of time and effort are wasted by following a flawed research process due to misunderstanding about this critical issue. In this post, I give a non-technical introduction of what it is, how it can occur, and what can be done about it. (Economists often use the term data mining to mean the same thing. However, since there are now respectable conferences in data mining or knowledge discovery, I use the term data snooping here with its clearly negative connotations.)

Hal White (2000) defined the term as follows:

Data snooping occurs when a given set of data is used more than once for purposes of inference or model selection. When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than to any merit inherent in the method yielding the results.

In machine learning terms, this means that data used for training (even indirectly) can no longer be used to provide an unbiased assessment of learning performance.

Continue reading

Posted in machine learning, stock market forecasting | Tagged , , | 1 Comment

Noise in asset returns

One of the goals of this blog is to discuss various approaches to forecasting asset returns taken from both the economics and machine learning fields. Before diving into specific models and techniques, however, I begin by discussing the issue of noise in financial markets.

Let’s start by decomposing returns into expected and unexpected returns. We have

[1]\hspace{0.5cm} R_t = E_{t-1}(R_t) + \epsilon_t

where R_t is return in period t, E_{t-1}(R_t) is expected return in period t conditional on the information available at t-1, and \epsilon_t is unexpected return.

The first thing to notice is that \epsilon_t is the noise component and cannot be predicted. The expected return E_{t-1}(R_t) is the component that we hope to model, but is unobserved. We could use various models, but a simple choice for modeling E_{t-1}(R_t) \equiv \mu_{t-1} is to use a linear function of a predictor variable x. Thus, we have

[2]\hspace{0.5cm} \mu_t = \beta x_t + \nu_t

where I assume a zero intercept for simplicity and \nu is an error term. We want to run a linear regression, but since \mu_t is unobservable, we must first identify observable proxies. The usual approach, of course, is to use realized returns r_{t+1} to proxy for expected returns \mu_t. Most of the time we follow this procedure without thinking much about it. Why do we bother with pointing out the distinction between expected and realized returns here?

Continue reading

Posted in stock market, stock market forecasting | Tagged | Leave a comment

The Hedgehog and the Fox Redux

Many fund managers will be aware of Philip Tetlock’s book “Expert Political Judgment” published in 2005. In the book, Tetlock analyzes forecasts collected from 284 experts over twenty years. While he focuses primarily on the ability of political experts to predict future events, his research has clear applicability to making investment decisions and has been widely covered in the financial press. Using Isaiah Berlin’s prototypes, Tetlock argues that the fox, who knows many little things, is often a better forecaster than the hedgehog, who knows one big thing. More damningly, he finds that the average expert performs only slightly better than random guessing. A more detailed review of the book is available here from the New Yorker.

Some of the results are likely influenced by the incentives that forecasters face. A hedgehog ideologue who holds a strong, non-consensus view is often more useful to a fund manager than a redundant fox who has, perhaps correctly, adjusted his forecast towards the consensus view. However, the fund manager himself must aggregate viewpoints from various analysts along with other information to arrive at his own forecast and may be well-advised to act more like the fox who is skeptical of grand theories and is willing to combine diverse ideas and sources of information.

Continue reading

Posted in Uncategorized | Tagged , , | Leave a comment

Is out-of-sample testing of forecasting models a myth?

When working with forecasting models, a well-known observation is that in-sample performance is usually better, often much better, than out-of-sample performance. That is, a model generally produces better forecasts over the data that it was constructed on than over new data. Researchers usually attribute this result to the process of data mining, which leads to in-sample overfitting. As many different variables and model specifications are tried and discarded, the final model is likely capturing the idiosyncratic features of the in-sample data. Another independent cause of poor out-of-sample performance is the presence of structural instability in the data.

A popular strategy to deal with the problem of overfitting, especially in machine learning, is to use cross-validation. The data are split, once or several times, and the model is estimated using part of the data and validated using the remaining data. Then we select the model with the best performance over the validation sample(s). Cross-validation works when the observations are independent and identically distributed. This property is likely to hold in many engineering applications such as recognizing objects in images. When modeling financial and economic time series data, however, the observations are generally dependent and this property does not hold.

Continue reading

Posted in machine learning, stock market | Tagged , , | 1 Comment

R/Finance 2014 Conference

R

Last week I attended the R/Finance conference held in Chicago. About 300 developers, academics, and practitioners gathered at the two-day conference to discuss the latest applications of the R open-source programming language to finance. I’ve mostly coded in Matlab, but the growing popularity of R and the number of people developing leading edge statistical packages in R have made it impossible to ignore.

In case you’ve been hiding under a rock, here is a 90 second video that gives an introduction to R. The R language was originally developed by statisticians and has been perceived as being more suited for small data problems. (Machine learning people, of course, tend to prefer Matlab.) However, I think this view is increasingly obsolete. A number of talks at the conference focused on using R in high performance applications (by using calls to C++, for example) or with very large datasets (by using Hadoop at the backend).

For me, the most interesting presentations were those on applying R to market forecasting over the business cycle. The scarcest resource in developing trading strategies is the researcher’s time. If R can increase your productivity given its large library of packages, it’s well worth looking into.

The conference presentations have just been made available online. You can find them at the conference website.

Posted in software | Tagged , | Leave a comment

Big Data and Economics

Lorie
Lorie and Fisher. Big Data circa 1960.

For much of its history as a discipline, economics has been trapped in a Small Data paradigm. Macroeconomists analyzed output and inflation using annual or quarterly data spanning several decades at best. Microeconomic datasets varied more widely in size, but even large cross-sectional surveys were rarely big by today’s standards. While the econometric methods developed were often highly sophisticated, they generally focused more on estimation of structural equations and testing economic theories than on prediction. Searching for patterns outside a theoretical framework was largely dismissed as data mining.

Thanks to the efforts of Jim Lorie and Larry Fisher in developing the CRSP database at Chicago in the 1960s, financial economists have enjoyed a relative wealth of high-quality data. Furthermore, there has always been an interest in prediction within finance. However, empirical stock market research, at least in academia, has typically focused on discovering anomalies in returns by sorting stocks along a single dimension such as market capitalization or book-to-market ratio. Occasionally researchers would double or even triple sort stocks into portfolios, but generally would not explore additional dimensions. Haugen and Baker (1996), which used a few dozen variables to predict stock returns, was a notable exception. More distressingly, the result of generations of researchers “spinning” the CRSP tapes in a collective data mining exercise was that many empirical findings discovered about stock returns were unreliable out-of-sample.

In recent years the data used in economic research have grown tremendously in both volume and variety. High frequency financial datasets are notable examples. Others include unstructured data including text from news and social media and audio from management conference calls discussing company earnings. While being somewhat late to the party, economists have increasingly adopted a Big Data paradigm. Reflecting these developments, NBER devoted it’s summer institute last year to Econometric Methods for High-Dimensional Data.

More evidence for the growing interest of economists in Big Data comes from Hal Varian’s article Big Data: New Tricks for Econometrics in the latest issue of the Journal of Economic Perspectives. Varian, a long-time economics professor at Berkeley before becoming the chief economist at Google, writes that “my standard advice to graduate students these days is go to the computer science department and take a class in machine learning.”

He makes a number of observations including (1) economists immediately think of a linear or logistic regression when confronted with a prediction problem despite there often being better models, (2) economists have not been as explicit as machine learning people about quantifying model complexity costs, and (3) cross-validation is a more realistic measure of performance than in-sample measures commonly used in economics. While these judgments are perhaps a bit too sweeping, they raise important issues that I will explore in future posts, particularly from the perspective of a financial markets investor.

Posted in Big Data, machine learning | Tagged , , | Leave a comment