Philippe Roman Chair in History anProfessor Matthew Connellyd International Affairs 2014/15, Professor Matthew Connelly, is teaching ‘Hacking the Archive – HY447’ this academic year. He is currently a professor in the Department of History at Columbia University. In our first 2015 NetworkEDGE seminar on 14th January, Professor Connelly be talking about his course, which uses big data from various International History databases and teaches students new tools and techniques to explore various the vast array of material available online. Students are encouraged to rethink historical research in the digital age as older primary sources are increasingly becoming available online alongside newly declassified information and ‘born digital’ electronic records. The seminar is free to attend but places are limited so will need to be reserved via the staff training and development system or by emailing  All our talks are live streamed and recorded for those who can’t make it.

I caught up with Matt to find out more about his innovative course and his fascinating historical research and asked him a few questions.

Jane: I’ve heard a lot about digital humanities – can you tell us what this is and why it might be a useful way of approaching the study of a subject like history?

MC: “Digital Humanities” is an umbrella term that can be summed up as the use of computational tools and visualisations to assist in humanities research and presentation. Though there are common practices, tools, and methodologies for doing this across disciplines, history as a subject is particularly suited to these approaches. In fact, I believe it will become increasingly important in years to come.

Consider the traditional activity of the historian. She would find a topic and the appropriate archives that hold relevant documents for her research. She would then read some finding aid, talk to an archivist familiar with the collection, and search through the archival boxes looking for documents, maps, and photos. This works pretty well for something like studying World War I. But what if there were no finding aids? What if she could not find any archivists with deep knowledge of the collections? What if — and this is the most important point — the “relevant documents” for her research were part of collections that numbered in the millions or even billions?

All three of these questions are not hypothetical. Archivists and historians together are beginning to struggle with an avalanche of electronic records, especially for periods in which most of the documents were “born digital” — a kind of activity that began to pick up in the 1970s.

Consider our own collections at the Declassification Engine project. We have 1.4 million State Department Cables spanning a mere 5 year period, from 1973-1977. At the US National Archives website there is a search engine and a list of FAQs, but no finding aid. If our historian were studying the end of the Vietnam War, she would have no easy way to grasp the scope and nature of the collection so as to find the most important and relevant cables. Now imagine that our historian is looking into the social history of the post-9/11 period. How will she go through the 300 million messages exchanged on Facebook each day? Using the traditional methods to do this would be like drinking water from a fire hose.

This is how digital tools can help. Using Natural Language Processing (NLP) and Machine Learning, we can come up with ways to explore large collections of documents beyond a simple keyword search. Topic Modeling, for instance, allows you to see the main subjects in a collection, and quickly find the documents most representative of the subject you are interested in. And with entity extraction, you can quantify and chart mentions of people and places, something that was all but impossible with traditional methods. But it is important that historians have at the very least some rudimentary understanding of how such tools work if they are to use the results as evidence for their arguments. And of course we’ll need the help of data scientists and developers to develop the tools in the first place.

Jane: Your course ‘Hacking the Archive’ sounds quite different in approach to other history courses, can you tell us a bit more about what students have to do?

MC: The course aims to create a kind of laboratory for exploring large historical datasets. The students don’t have to have prior knowledge of programming or statistics, but they do need to be willing to learn how to do some rudimentary things that might involve both: scraping websites, working with databases, etc.

With the help of my colleague Daniel Krasner, a very talented data scientist and engineer, we introduce students to the concepts and methods of coding, creating datasets, and developing visualizations. Final projects can include a digital archive or a new tool for analysing such collections. We encourage students to use these methods on some topic they are already interested in, which can assist in their eventual dissertation work. Their content knowledge also helps us understand whether the tools we are creating really work.

Jane: You mentioned some of your students have co-authored a paper with you, so they are really part of your research – can you tell us what benefits you think this brings to the student and the teacher?

MC: If you look at the sciences, professors co-author papers with their students all the time. They do this because the research is impossible without a lot of teamwork. Developing datasets of documents and creating digital tools also requires a lot of different skills and content knowledge. For example, digital collections have to be parsed, processed, and stored in a database before they can even become useful for analysis. This is an enormous task.

The benefit of co-authorship for students is that they are recognised for their hard work. And even if they have to do a lot of grunt work, they can see how it contributes to interesting results and novel analysis. And they can often tell the professor whether the analysis is actually correct, because of the kinds of choices that they made in parsing the text, disambiguating the data, etc. Students can also teach us new techniques or point out new technologies that we might not have known about. All this constitutes excellent training both for skill development and critical thinking. I therefore think you’ll see this kind of work become much more common in history, such that history will increasingly be recognised as a data science.

Jane: Tell me more about some of the key historical events you have explored using this data mining approach, and can you tell us what new historical insights this approach might have uncovered?

MC: The single largest collection that we currently work with is the set of State Department cables from 1973-1977. Though limited in timespan, it permits us to take a fresh look at some well-studied historical events.

For example, we had a project run by Shawn Simpson and David Allen, a statistician and a historian, that looked at “bursts” in cable traffic to and from the embassy in Vietnam leading up to the fall of Saigon. What we found was at first not that surprising: there was a dramatic burst of activity, one of the biggest in the entire period. But as one of our Steering Committee members Richard Immerman pointed out, the burst started earlier than when the crisis was publicly acknowledged. Moreover, traffic analysis also shows a tremendous but more sustained “burst” related to refugees from all across Southeast Asia. This is a much less-studied episode than the Vietnam War itself, but in terms of the sheer amount of time and attention it required from the State Department it was no less important (and was certainly important to these refugees).

These methods can thus lead us to reevaluate the relative significance of historical events and trends, but it’s just the beginning. The work on burstiness has forked off into a new project. Two statisticians, Rahul Mazumder and Yuanjun Gao, are working on a method for doing traffic analysis by geography and subject. For example, the graphs of different types of events have different structures. A coup, for example, is usually quite sudden, while bursts about United National General Assembly sessions are cyclical. Now we can start to do a lead and lag analysis with media coverage to determine when policymakers are setting the agenda and when they are simply responding to events.

We hope that computational methods might uncover events that have previously gone unnoticed by scholars, perhaps because most of the relevant records are still classified. For instance, a large portion of the cables in the NARA releases are “withheld,” meaning that there is no full text available but there is some metadata, such as the sender, receiver, date, subject, etc. This presented an opportunity to use text analysis to try and see what types of features are predictive of a document being withheld. Sasha Rush, a computer scientist who worked on the project, found that cables with the word “boulder” in the subject line were 129 times more likely to still be classified. It turns out that “boulder” refers to a classified surveillance program in which the FBI investigated visa applicants with Arabic-sounding names between 1973 and 1975.

Historical research has always aimed to uncover heretofore understudied events and trends, and at the same time try to correct the intrinsic bias in the public record created by official secrecy. The data mining and computational approach will be more and more crucial for doing this given the large volume of digital information generated beginning in the 1970s.

Jane: As a history graduate myself, I wonder how do you find historians take to learning about computing programming? Do you have to give your students lots of help and support and if so, how do you do this?

MC: At a panel during the American Historical Association this past week, Fred Gibbs, who is a historian at the University of New Mexico who works a lot with digital projects, was asked whether historians need to learn programming in order to do this kind of work. His response was “If you want to study Russian history, you have to learn Russian.”

Now of course, some people have been able to say insightful things about Russian history using non-Russian sources, and you can also learn a lot from works in translation — much as historians have learned a great deal from search engines without knowing exactly how they work (or don’t work.) Similarly, if you are lucky enough to be part of a team with talented programmers, developers, and data scientists, you have a lot to contribute even if you don’t write a line of code. As a historian, you can identify promising questions for computational research, understand and explain the nature of the “data,” and interpret the results.

But I find it extremely helpful to have at least a basic understanding of coding, the kinds of errors that are common, and the kinds of problems that are impossible for computers to solve.  I benefited from participating in a terrific program at Columbia called LEDE, which is run out of the Brown Institute of Media Innovation at the Journalism School. It was designed for journalists, so all the exercises use the kinds of data and “data practices” that are especially crucial for journalists. I would love to see more programs like this one, only for students of history, literature, public health, etc. But even when we pick up coding skills, historians may well find that they can work best when they work with data scientists and developers in multi-disciplinary teams. That’s why Daniel and I have really enjoyed teaching this course at the LSE. We need more like it.

LTI welcome this innovative approach to teaching history and are delighted to welcome Professor Connelly as our first NetworkEDGE speaker of 2015 on 14th January. We hope many of you will join us either in person, or via the live stream and look forward to a lively discussion next week!