A New Era in Social Science Research

How digital analytics are revolutionizing social sciences

By Hugo Osorio

MAY 2026

Imagine it is February 2020. Alarm bells about a new virus in China are ringing, but health authorities in the United States and Europe still believe it will remain a localized outbreak. Only research laboratories can test for the disease, and contact tracing is limited to recent travelers. There are not enough tests, not enough labs, and your local doctor is likely not prepared to diagnose and treat COVID-19. Now imagine that you begin monitoring digital behavior independently. You notice that someone in your city searches for "fever symptoms," "body aches," and "home remedies" on Google. Days later, their mobility data, (the kind Google Maps aggregates anonymously) shows them visiting a pharmacy, then disappearing from their usual commute for three weeks. But you cannot confirm this person had COVID-19. In a world without tests or functioning surveillance systems, that digital trace might have been the earliest signal you had. Cool right?

The team behind searchingcovid19.com aggregated Google searches, from "What is Coronavirus?" to "How to use Zoom?", creating an intuitive window into how a society processes a crisis in real time. I invite you to visit their website: it is one of the clearest illustrations of how search trends can reveal what is happening in the world before official data catches up. This idea however was not a first of its kind. A well-known earlier attempt was Google Flu Trends, which tried to predict influenza outbreaks by identifying searches like "flu symptoms" and "fever remedy," then running them through a statistical model to estimate case counts and compare them against official CDC data. Yet this pioneering project in computational social science failed dramatically in 2013, overestimating infections by nearly double. Digital data is valuable, but it should not be used in isolation. Digital traces are powerful and work best when combined with traditional data sources. This is an important axiom.

An interesting question would be, for example, how social mobility, that is, how much people stop moving around, correlates with socioeconomic status, how do you think those two variables would relate? This is exactly what Weill et al. (2020) explored, using Google's COVID-19 Mobility Reports, and SafeGraph data, an aggregator of mobile phone location. People higher in the socioeconomic distribution showed a reduced mobility compared to the ones lower in the distribution of income. Perhaps a sign of white-collar remote work compared to a grocery store cashier?

The last example illustrates how digital data “big data” has the potential to reveal social patterns of enormous value for social science research.

As you may already know, we are currently living through an explosion of data of all kinds. The Internet of Things creates a constant stream of information through sensors embedded in every type of device. This has been used from toll cameras to nowcast business cycles (Askitas & Zimmerman, 2011) to forecast users’ mood and app usage (Suhara et al, 2017; Wang et al, 2016). Private digital traces such as financial transactions, social network interactions, friendship networks, call records, etc. are collected by private companies (Meta, Google, Amazon, etc) and offer an unprecedented window into human behavior (Blazquez and Domenech, 2017; Rampazzo et al., 2018). We also have more geospatial data than ever before. For example, we can use nighttime satellite imagery of geographic areas to estimate economic activity. I invite you to look at the comparison between the two Koreas, or the radical transformation visible in India and China over just two decades. (Yingyao Hu & Jiaxiong Yao, 2022)

This technological progress is especially relevant for the study of demography and population economics. An article by Ridhi Kashyap explores the impact of this “data revolution.” Historically, demographers have inferred partner preferences from marriage records, but dating apps like Tinder or Hinge, which aggregate data and interactions from their users, allow us to directly observe the dynamics of partner search. We can test qualitative theories about preferences and competition. The key word: proxy. By aggregating Google searches about abortion and legal restrictions in a given area, an inverse relationship was traced, suggesting that people search online for information that is not always available (or is illegal).

Rampazzo et al. (2018) approximated the Mean Age at Childbearing (MAC) by aggregating advertising data from Facebook, finding a high correlation with traditional data from the United Nations Development Programme. In countries where the state apparatus lacks adequate civil registration systems (which are also where the most contemporary questions about fertility arise), the use of these new data sources expands both the possibilities and the scope of inquiry.

In the end, it is not just about having more data, but about being closer to human behavior. Traditional censuses are “reactive,” meaning that within the context of an interview, opinions, movements, and behavior are difficult to capture (think of an anonymous survey versus one with names, surnames, age, and all the questions that are part of an official census) (Nina Cesare et al., 2018). Kashyap (2021) also points to the readiness of this data. These “sensors” are always on, unlike the 5 or 10-year intervals between censuses. Add to that the ability to geolocate this information, and the possibilities are enormous.

However, not everyone in academia is convinced, especially in disciplines with strong theoretical foundations: economics. The main issue is the existence of selection bias. If you want to study populations at the extremes of the distribution, infants and the elderly, it is difficult to assume they use smartphones or interact with the internet, limiting their digital presence. Another problem is that the data is not parametric. Taylor et al. (2024) suggest that aggregated data is not organized like census data (which often requires a dictionary to be used properly). The samples are also much larger, which renders traditional significance analysis useless, since with billions of observations, everything becomes statistically significant.

Since much of this data comes from private sources or is aggregated by profit-driven companies, the provenance of this data is often unknown or not open to academic scrutiny which limits the reproducibility of experiments. The algorithms that aggregate and collect data are usually trade secrets and can be modified at discretion. Companies usually have financial incentives, which may conflict with academic objectives of conducting impartial and open research. The struggle to maintain academic independence becomes even more pronounced when research can be influenced by a company’s commercial goals. (Breen & Feehan, 2024)

Considering this, we should advocate for the establishment of regulatory frameworks for open access to data relevant to the study of the social sciences. From my personal standpoint, if these data are already being used to target ads, sell us products, recommend content, and, as social media replaces traditional media, to shape the narrative of reality for the population, then they could also be used to advance research on human groups - from measuring market sentiment in alternative ways to understanding the dynamics of an epidemic (both viral and of ideas).

While privacy concerns remain central to computational social science, widespread adoption of digital services suggests that users are often willing to trade personal data for functionality and convenience, effectively normalizing large-scale data collection. In practice, the scale and entrenchment of data collection in the digital economy have outpaced individual privacy demands, raising the question of whether meaningful resistance to data extraction remains feasible at the user level. For example, a UUID allows platforms to track and link behavioral data to the same user or device over time, enabling algorithms to group individuals by inferred interests without necessarily relying on their real-world identity.

The path forward requires institutional frameworks: agreements between academics, governments, and tech companies that allow researchers to access anonymized, aggregated data streams without compromising individual privacy. It needs to become standard practice.

The social sciences have always evolved alongside the tools available to observe human behavior. The telescope changed astronomy. The survey changed sociology. Big data, used carefully and critically, can do the same for demography, economics, and beyond. The data is already there. We just need the will to use it responsibly.

Disclaimer

I want to be clear about something. I am not arguing that we should abandon surveys, censuses, or traditional fieldwork. A Tinder dataset cannot tell you anything about the partner preferences of a seventy-year-old widow in rural Peru. A Google search trend cannot capture the lived experience of a child under three. For populations at the margins of the digital economy, traditional methods remain irreplaceable, and we should invest in them accordingly.

What I am arguing is simpler: we are leaving an enormous amount of signal on the table. Every day, billions of people generate behavioral data that is already being collected, stored, and monetized by private companies. Amazon knows when a household is expecting a baby before the parents announce it. Spotify infers your emotional state from your listening patterns. TikTok's algorithm understands your political leanings before you do. This data exists. The question is not whether it will be used, it already is, but whether researchers working in the public interest will have access to it.

Works Cited

Askitas, N., & Zimmermann, K. F. (2013). Nowcasting business cycles using toll data. Journal of Forecasting, 32(4), 299–306. https://doi.org/10.1002/for.1261

Blazquez, D., & Domenech, J. (2018). Big Data sources and methods for social and economic analyses. Technological Forecasting and Social Change, 130, 99–113. https://doi.org/10.1016/j.techfore.2017.07.027

Breen, C. F., & Feehan, D. M. (2025). New data sources for demographic research. Population and Development Review, 51(1), 539–573. https://doi.org/10.1111/padr.12671

Cesare, N., Lee, H., McCormick, T., Spiro, E., & Zagheni, E. (2018). Promises and pitfalls of using digital traces for demographic research. Demography, 55(5), 1979–1999. https://doi.org/10.1007/s13524-018-0715-2

Hu, Y., & Yao, J. (2022). Illuminating economic growth. Journal of Econometrics, 228(2), 359–378. https://doi.org/10.1016/j.jeconom.2021.05.007

Kashyap, R. (2021). Has demography witnessed a data revolution? Promises and pitfalls of a changing data ecosystem. Population Studies, 75(sup1), 47–75. https://doi.org/10.1080/00324728.2021.1969031

Rampazzo, F., Zagheni, E., Weber, I., Testa, M. R., & Billari, F. (2018). Mater certa est, pater numquam: What can Facebook advertising data tell us about male fertility rates? arXiv preprint arXiv:1804.04632.

Suhara, Y., Xu, Y., & Pentland, A. (2017). DeepMood: Forecasting depressed mood based on self-reported histories via recurrent neural networks. In Proceedings of the 26th International Conference on World Wide Web (pp. 715–724). ACM Press.

Taylor, L., Schroeder, R., & Meyer, E. (2014). Emerging practices and perspectives on Big Data analysis in economics: Bigger and better or more of the same? Big Data & Society, 1(1), 1–10. https://doi.org/10.1177/2053951714536877

Wang, Y., Yuan, N. J., Sun, Y., Zhang, F., Xie, X., Li, Q., & Chen, E. (2016). A contextual collaborative approach for app usage forecasting. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 1247–1258). ACM Press.

Weill, J. A., Stigler, M., Deschenes, O., & Springborn, M. R. (2020). Social distancing responses to COVID-19 emergency declarations strongly differentiated by income. Proceedings of the National Academy of Sciences, 117(33), 19658–19660. https://doi.org/10.1073/pnas.2009412117