Data mining

I can’t remember who I was talking to recently, but we were discussing data mining. Data mining is bad. Real bad. Everyone knows this, but why is it so bad?

Researchers in any discipline are suppose to check their biases at the lab door. They are to be objective observers of reality. The facts are the facts and the researcher’s job is to uncover them.

In the physical sciences, I imagine this part of the job is a little bit easier than in social sciences. Its easy to separate your personal opinions from empirical observation when the thing being observed is inanimate. There are vested interests, of course. The scientist may feel very strongly about which way the data should come out. He would find it strange to have a result that refuted a law of thermodynamics or violated the speed of light. Not that physicists and chemists can’t be passionate about their work, it just seems more likely for them to be able to separate their feelings from their observations. The picture gets a little more clouded if the scientist realizes if his experiments come out the ‘wrong’ way he may lose his funding. Someone who spent his whole career experimenting on ether, may be less than excited with experiments that show the atomic theory to be correct.

In softer sciences the object of study isn’t inanimate. Worse, the lab rats for economics, sociology and political science are living breathing people with feelings and such. People, even the most autistic-like math-nerd theoretical economist types, tend to have strong feelings about how people act and should act as individuals and how they act and should act collectively.

Worse still, the motivation for most social scientists to enter their field is most likely tied up with their biases. The public economics researcher went into that field because he felt strongly about the role of the government in improving peoples lives. Certainly, the research he decides to do, the decision to study the impact of education on life outcomes or the effects of immigration on the working poor, stemmed from his particular opinions and experiences, his biases.

Checking your biases at the door becomes a lot harder when your so emotionally invested in the outcome. This is why social scientists have to be especially cautious about researcher bias.

Unlike in the physical sciences, most of the data in social science isn’t acquired experimentally. In experiments, researchers control in the environment to very high degree. When a variable is tweaked, the experiment is set up such that it is known exactly the impact of the tweak. Thus, experimental data is context free and so repeatable. Additionally, experiments are generally engineered to reduce the noise to signal ratio. Very precise instruments measure to a very precise degree.

Most data in the social sciences is historical, entangled and dirty. They are path dependent, mired in context and contain a lot of noise. On the plus side, there’s a lot more data essentially because everything is data. The government collects data. Business’ collect data. Everyone collects data. And even where there was no data before (wages in 15th century Cairo), the clever researcher can discover it ( ). The social scientist’s job is to find the data, clean them, and disentangle them to find patterns and relationships. How does he do this?

First, before even looking for data or looking at it after found, he should have a theory of the data. What relationships should be found? What patterns are expected? How might one find such patterns? Next, he should delineate a research strategy, identify the data he will need and outline the methods that will be used to test the theory. Then, and only then, he should go to the data.

Data mining reverses this method. One starts with the data set and precedes to find relationships and patterns. This sounds innocuous at first. If the patterns are there, then why should this matter? Well, because of the sheer volume of data, the researcher will no doubt ‘find’ the relationships that correspond to his biases. Significant results (those that we have a 95% confidence in) are more likely to be insignificant if one is looking through piles of data to find the pattern he wants to find. If 1 out of 20 times, perceived patterns are just the result of randomness, then the more one ‘mines’ for patterns the more likely they’ll be fooled. If you find 10 patterns that have 5% chance of being caused by chance, you’re almost 40% likely to find at least one of them is random. If you dig for more and more patterns, you’re more and more likely to find what you’re looking for whether or not its is caused by randomness.

So, don’t data mine is the mantra. Just don’t do it.

However, on the off chance that a significant pattern is hiding in the data, it seems odd to dismiss data mining out of hand. The problem with data mining isn’t data mining itself, its the bias the data miner brings to the job.

Its a wonder why data mining rates as such a sin in economics as to be on par with a certain Victorian era sin that caused hair to grow on the palm of the hands. Both sins are much committed but never discussed except to say they’re bad. Is this sin always bad? Are there healthy ways to commit this sin?

This paper asks us to consider the difference between classical statistics (the kind of statistics that is more concerned with summarizing data) and econometrics (the kind of statistics that is more concerned with cleaning and disentangling data). Admitting those differences means that we must admit the researcher brings bias to his or her work. As such, we should have standard tests/corrections for researcher bias.

Here are some points that stuck out at me:

Point # 1: Researchers will always respond to incentives and will be more skeptical than standard statistical techniques suggest.

The most well-studied example of researcher initiative bias is data-mining (Leamer, 1978, Lovell, 1983). In this case, consider a researcher presenting a univariate regression explaining an outcome, which might be workers’ wages or national income growth. That researcher has a data set with k additional variables other than the dependent variable. The researcher is selecting an independent variable to maximize the correlation coefficient (r) or r-squared or the t-statistic of the independent variable all of which are identical objective functions… If the true correlation between every one of the k independent variables and the dependent variable is zero, and if all of the independent variables are independent, then the probability that the researcher will able to produce a variable which is significant at the ninety-five percent level is 1- .95k . If the researcher has ten potential independent variables, then the probability that he can come up with a false positive is 40 percent… Lovell (1983) provides a rule of thumb “when a search has been conducted for the best k
out of c candidate explanatory variables, a regression coefficient that appears to be significant at the level alpha_hat should be regarded as significant at only the level alpha = 1- (1-alpha_hat)^(c/k).”

Point # 2: The optimal amount of data mining is not zero.

[E]conomic models have an uncountable number of potential parameters, non-linearities, temporal relationship and probability distributions. It would utterly impossible to test everything. Selective testing, and selective presentation of testing, is a natural response to vast number of possible tests.

Point # 5: Increasing methodological complexity will generally increase researcher initiative bias.

[New methodologies] clearly increase the degrees of freedom available to the researcher… A second reason for increased bias is introduced by the existence of particularly complex empirical methodologies. Methodological complexity increases the costs of competitors refuting or confirming results.

Point # 8: The search for causal inference may have increased the scope for researcher initiative bias.

The most worrisome aspect of causal inference is the prevalence of instrumental variables drawn from observational data selected, or even collected by the researcher. In that case, the scope for researcher effort is quite large, although some of this freedom is restricted when the researcher focuses on a clean public policy change that is directly targeted at a particular outcome. The problem will be most severe when the variables are general characteristics that may have only a weak correlation with the endogenous regressor.


0 Responses to “Data mining”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: