How big data is changing history

Technology is transforming how we see the past

Historian Richard Bulliet appears to have spent much of the 1960s and 1970s poking holes into punch cards. He was studying the growth of Islam across the Middle East and was taking family information from biographical compilations, and making it machine readable – essentially creating a massive family tree. “I coded this data in ways that enabled me to search and recover repeated technical terms or family relationships as these matters evolved over five centuries.”, he told me. It was made possible thanks to the structure of Islamic names containing the father and paternal grandfather’s name (Both “ibn” and “bin” mean “son of”), as well as some long dead elders having the foresight to keep detailed records.

Having compiled his data, he was able to then compute it and draw new conclusions that would shift the thinking in his field.

“As it turned out, the study of conversion to Islam up to that point had focused on the first two centuries of Islam. My analysis pointed to the early third century [of Islam] as the midpoint in a process that spread out over four or more centuries”, he says, adding that. “The excellent fit between my proposed chronology and the political, institutional, and intellectual history of Middle Eastern Islam led to my chronology becoming generally adopted in the field.”

This new insight was all thanks to quantitative analysis rather than traditional historical methods. “I don’t believe there was any way of arriving at this result in a more convincing fashion by the traditional technique of collecting rare, and often hyperbolic, textual references and guessing at the impact of legal or governmental actions”, he explains.

Big Data

Fast forward 50 years and data has changed everything. “Big Data” is the increasingly passé buzzword that technology pundits like to use to describe the on-going revolution in computing brought about ever increasing storage space and processing power.

To talk about “Big Data” is to talk about the future - with much speculation going into how algorithms will improve how Netflix recommends a film, how Uber will plan our journeys or how Facebook will decide what is the most important thing for us to see when we wake up in the morning. But data isn’t just used to predict the future: This sort of data-driven, quantitative analysis of large datasets can also be used to look back in time too.

Huge swathes of our past are slowly but surely getting digitised as old books and scanned and organised. It stands to reason that surely once the historians get to work it could completely transform our understanding of our past? If Richard Bulliet was able to change our understanding of Islam in the digital stone age, just what are the datasets and tools of the 21st century capable of?

The Correlates of War

The Correlates of War Project is an attempt by historians and political scientists to take a quantitative look at the history of war and conflict. The project hosts a stack of spreadsheet-readable data on wars through history.

How this data is presented is obviously going to be contentious. Unlike scientific measurements, it isn’t easy to stuff a complex web of human behaviours and interactions into a box and describe it as a number, as there are too many questions: What’s the difference between a really small war and a mass killing, for instance? And is it possible to define whether it is the state or a non-state actor operating in countries that lack a strong central government? Not to mention, umm, how exactly do you define what a country is, in a meaningful sense, over a long period of time? The historians compiling the data have obviously had to apply some human judgement - but have also stuck to a series of rules (such as the minimum number of deaths for an event to count as a war). These rules ensure at least some consistency – and this consistency means that the data can be crunched.

The CoW data has already been used in at least one high-profile work. In 2012 Steven Pinker released a tome entitled The Better Angels of Our Nature in which he argued that contrary that what it might feel like, war and violence are in long term decline. CoW provided the empirical data to back-up many of his claims in the first part of the book, and enabled Pinker to take a longer view of history than would be possible through traditional historical methods.

Tudor Letters

The act of structuring and linking data too can also create valuable new historical insights. In 2012, Ruth and Sebastian Ahnert built a dataset based on the connections between letters sent by dissident Protestants during the reign of Catholic monarch Mary I. They created a database of whom letters were sent from and to, as well as connections described in the letters (such as when writers described who told them something).

Taking this metadata, they were able to build up a picture of the social relationships that made up this underground movement. This network analysis revealed not only some of the hidden key figures in the movement, but also the much larger role of women as key players. Essentially, the Ahnerts had accidentally proven Edward Snowden’s warning about collecting metadata using 500-year-old test subjects.

This was only a relatively small dataset - but now the same principles are being applied to the much larger “Cultures of Knowledge” project, which aims to build a similar, but more massive dataset for letters sent during the Early Modern period.

Analysing Literature

Any data-driven future though is likely to be an incremental change, owing to the huge challenges in assembling and interpreting and building the datasets required. Scanning in thousands of pages of data is time consuming but can be done - but the challenge is in structuring the data so that it can be easily quantified, and then ensuring that the data that does exist is meaningful. Annoyingly, the monks of old didn’t think to ensure their records were comma separated, so making sense of the messes of old data is hard.

“The tasks at hand can be a challenge to seasoned machine learning, text mining and Natural Language Processing experts as well”, says Dr Christof Schöch from the Department of Literary Computing at the University of Würzburg.

“The main source of complexity is that language, and literary language or historical documents even more so, are multi-layered semiotic systems and artefacts”, he explains, “a lot of information is left implicit and that require a lot of contextual knowledge to be understood.”

But these difficult challenges shouldn’t put us off – as Dr Schöch points to literary history as one of the areas that could be transformed by new techniques. This is an area where we’re already seeing the fruits, perhaps because digitisation is relatively straightforward thanks to the foresight of Johannes Gutenberg in inventing movable type.

Dr Schöch says that “This area will certainly be profoundly changed by quantitative approaches, as they are based not on a small and nonrepresentative sample of canonical texts, but on a large and potentially complete record of literary production for a given period and language.”

Of the approximately 129 million books ever published, as over October last year, Google alone reckons it has already digitised 25 million.

“As we continue developing computational techniques to model and identify ever more complex literary phenomena (such as metaphorical language, character networks, narrative techniques, plot structures), we will be able to observe their distribution across literary genres and their evolution over time in ever increasing detail.”

The pay-offs to this could be numerous – but one that he thinks this big data analysis could do is paint a more detailed picture of the European cultural identity.

The Future

So what does the future hold? Dr Chris Sparks is a digital historian, but he sees some barriers to digital becoming a new normal. “Most historians probably continue to think of Digital Humanities as a niche pursuit”, he told me – arguing that institutional factors could limit the reach of data-driven approaches in the future “Career progression in mainstream UK history departments remains tied to publication in traditional formats (ie monographs and articles), which digital humanities and big data projects don't always lend themselves to.”

But assuming that the history establishment can learn to adapt, he thinks that the techniques that are in their infancy today could provide an illuminating window on not the distant past, but when we look back on today, saying “I think that there could be some really ground-breaking social and cultural history of the early twenty-first century written using big datasets.”

Though this isn’t without one major challenge that could perhaps make the 21st century even harder to analyse than the past. Whereas books and documents can be locked away and eventually scanned, our lives take place digitally. So if historians are going to look back on now, they’re going to have to rely on the likes of Facebook and Google opening up their data.

It seems possible that in the not too distant future we could be in the unusual situation of having a wealth of new history being steadily unlocked by new computational power - with Big Data doing for ancient history what it is doing for the present… while simultaneously, the near past will remain locked away and inaccessible. Historical Big Data will be able to tell us plenty about the past - but perhaps not too much about ourselves.