Prediction and Explanation in Social Sciences: Introducing a Set of Useful Methods.

During my Ph.D. studies, I have noticed that there is a growing number of methods that we can learn to answer research questions in social sciences. The area of computational social sciences has mixed complex social systems problems with advanced statistical and computational techniques allowing researchers to answer questions that would otherwise be impossible to answer. The amount of information generated by social media users is immense, therefore, without big data methods to extract and transform data, and the computational power to analyze massive social networks the study of this kind of phenomenon would be very limited. In this context, there have been scholars pointing out the importance of understanding how these new tools might improve the quality of the research in social sciences by differentiating how to use them to explain or predict problems of interest (Hofman et al., 2017). This is the initial post of a series that I am going to publish to explain a set of methods that I have studied and collected from different sources; the idea is to make the material available in one place. The selection you will find in this blog does not pretend, by any means, to be exhaustive instead its purpose is to serve as an introduction for researchers of social sciences interested in learning analytic tools that can be used to infer predictions or cause-effect processes.

This initial post contains a review of the concepts of prediction and explanation. I use the classification, based on statistical modeling, made by Shmueli (2010) to explain the differences between how we model both concepts. These differences are based on their scientific goals which are empirical prediction and causal explanation. Using Shmueli’s classification and adding the vision of social scientists about how prediction and explanation have been studied so far, I try to generate a common ground to understand why the analytical techniques that I explain in later posts can be useful for explanations or predictions. The classification made by Shmueli (2010) distinguishes between explanatory modeling and predicting modeling, but it also adds the notion of descriptive modeling that is useful considering the state of the art in social sciences.

Explanatory modeling refers to the set of statistical methods used for testing a causal theory. In this case, researchers based on their knowledge of the subject under study elaborate theoretical models which are applied to data to be tested. I acknowledge that causality is a complicated concept to define. We can find definitions from Aristotle who proposed that there are four fundamental elements to answer the question “why?” (cause) -material cause, formal cause, efficient cause, and final cause- until more recent efforts from social scientists such as Henry Brady (2011) whom based on a revision of philosophy and statistics states that causation should satisfy (1) a constant conjunction of causes and effects, (2) the absence of effect when the cause is not present or counterfactual analysis, (3) the presence of effect after a cause is manipulated, and (4) the activities and processes linking causes and effects should be explained by a mechanism. However, even though there is no consensus about a definition of causality, for the type of analysis described here the notion of causality is given by the theoretical implications specified in the model. Shmueli defines “explaining as causal explanation and explanatory modeling as the use of statistical models for testing causal explanations.” A powerful technique to test causal explanations is the framework developed by Judea Pearl called Structural Causal Modeling, it will be reviewed in detail in one of the following posts of this series.

To illustrate how social sciences develop explanations, let us consider the diagram in figure 1 elaborated by Chaffee (1991) to represent explanations as an iterative process in the field of Communication.

Figure 1

We can observe that the idea is to elaborate a meaning analysis that creates theoretical constructs rather than measurable variables. Normally, these constructs represent an explanation of the phenomenon analyzed via a diagram that shows the causal relationships between them. Because constructs are abstractions, researchers need to establish connections between these and reality. The connection between constructs and observable measurements is called operationalization. Once the operationalization is done it is possible to measure the variables of interest to generate the data needed to evaluate the research hypothesis in a statistical model. In summary, “explanatory modeling refers here to the application of statistical models to data for testing causal hypotheses about theoretical constructs.”

Predictive modeling following is defined “as the process of applying a statistical model or data mining algorithm to data for the purpose of predicting new or future observations.” The idea is to predict an output value given the input values. This is clear for time series analysis when we try to forecast k values in the future (t+k) based on previous observations.

The concept of prediction in social sciences has been historically ignored because the efforts of scientists have been concentrated on finding causal explanations to social phenomena. But recently scholars like Hofman, Sharma, and Watts (2017) have made the argument that the new tools provided by the current computational nature of social science should help to consider a more extended use of predictions in the field. For the proper incorporation of prediction, the authors highlight three important problems that must be addressed. First, it is needed to define standard measures for prediction. These measures allow improving the comparability of results by establishing consensus on the problems to be solved. When there are specific problems defined the use of the “common task framework” should help to evaluate claims about prediction. In the absence of common tasks and data, researchers must differentiate clearly between exploratory and confirmatory analysis. Second, theoretical limits to prediction must be better characterized, this way it is possible to set expectations of what can be predicted or explained. It has been shown that there are events intrinsically impossible to predict in a meaningful sense which means that the predictive accuracy is subject to a limit and those limits need to be comprehended. Third, predictive accuracy and interpretability must be recognized as a complement to evaluate explanations. This is based on noticing that simple models not necessarily generalize better than complex models, there is increasing evidence that the trade-off between accuracy and interpretability may be less severe than we thought, and the concept of understanding can be used to refer to the feeling of having made sense of something (interpret) or having accounted for observing regularities (predict).

Descriptive modeling is the third type of modeling. Its objective is to summarize and represent the data in a compact fashion. The difference with explanatory modeling is that in these models the underlying causal implications are absent or incorporated less formally. On the other hand, compared with predictive modeling, descriptive modeling does not aim to generate a prediction of future values. A typical example of this situation would be when we stumble upon scientific papers whose main objective is to fit the best regression model to capture the association between dependent and independent variables rather than produce causal inference or predictions. I would like to add that even though there are a group of well-defined statistical techniques to make causal inference, we can find an important number of social science papers that without claiming causation they show association based on theoretical grounds (explanation). The common phrase “correlation is not causation” makes clear the fact that path, regression, or structural equation models (SEM) do not necessarily allow us to compute the causal effect between variables in the statistical sense, but they do present an explanatory association within the model. There is not a standard rule to determine whether a social science paper develops an explanatory or descriptive model but in general, the presence of path analysis or SEM diagrams indicates the existence of relational (sometimes causal) implications derived from theoretical reasoning.

As I mentioned at the beginning of this post, I am going to explain a set of methods to serve as an introduction for researchers interested in learning analytical tools that can be used to develop explanations and predictions in social sciences. The methods I collected are originated in different areas such as causal inference, time series analysis, and information theory. In table 1 I present a summary of the important elements for each method.

Method NameExplain or predictVariableN of variablesStationarity
Structural Causal ModelingExplainNumerical/CategoricalManyNo
Instrumental VariableExplainNumericalManyNo
Granger CausalityExplain/PredictNumericalTwoYes
Transfer EntropyExplain/PredictCategoricalTwoYes
Modes Information FlowExplainCategoricalTwoYes
ε-MachinesPredictCategoricalOneYes
ARIMA modelsPredictNumericalOneYes

Table 1: Methods summary

While column 2 in table 1 shows a classification based on the main concepts I have discussed in this post, I also add more information in columns 3 to 5. For the cases of columns 3 and 4, the classification is straightforward, but I would like to define stationarity in case you are not familiar with the term of column 5.

From the list of methods that need stationarity in their data, we can classify Granger Causality and ARIMA models as methods used to make time series analysis while Transfer Entropy, Modes of Information Flow, and epsilon-Machines are analytical tools of the field of Information Theory. For all the methods previously mentioned the data need a temporal structure which means that we have a succession of data points of the form t1, t2, t3, … called time series. Stationarity in intuitive sense means that the statistical properties of a time series do not change over time. There exists a practical weak-sense of stationarity which only requires that the mean and variance do not change over time. Since stationarity is a required assumption of different methods some techniques allow us to transform non-stationarity data into stationarity. These details will be discussed when needed in the following posts for each method.

Now that I have introduced the main concepts of explanation and prediction, and how they have been developed and understood in social sciences. I will move forward with the introduction of methods in the following posts. I argue that with the proper comprehension of the virtues and limitations for each analytical tool described the analysis of research questions in social sciences can be enriched.

References

Brady, H. E. (2011). Causation and Explanation in Social Science. Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199604456.013.0049

Chaffee, S. H. (1991). Explication. Thousand Oaks, CA: Sage.

Hofman, J. M., Sharma, A., & Watts, D. J. (2017). Prediction and explanation in social systems. Science, 355(6324), 486–488. https://doi.org/10.1126/science.aal3856

Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330

Pablo M. Flores