With the advances in web analysis, Adam Crymble hails the opportunity for historians to turn to the Internet as a rich source in itself. But are historians trained to take advantage of this new opportunity? Corpus linguistics, data manipulation, clustering algorithms, and distant reading will be valuable skills for dealing with this new body of historical data.
The second talk of our 2014 Autumn programme took on the challenge of a new type of source for historians: the Internet. Not online sources and databases, but the Internet itself. The first archived copies of the UK web have started to find their way into scholarly hands. Historians now have the ability to look at webpages as sources in themselves, just as we have previously read manuscripts as a window into the past. The web is a corpus rich in details about what we were like and what we thought was important, not that long ago. For a cultural or social historian, it’s a dream.
Peter Webster introduced the UK Web Archive, which is hosted by the British Library, and contains snapshots of the UK-web (.uk sites) dating back to the 1990s. A team of historians have been given access, to see what they can make of this new (and huge) resource. I want to emphasise the experimental aspect of this project, because in many respects I think we learned more about what these scholars couldn’t achieve than what they did achieve.
That’s not a failing in the quality of the scholars themselves. They managed to do exactly what we could hope from them: to test the limits of the historian’s method on a large, messy, digital archive. They’ve done us a great service in finding some of those limits. The question now ahead of us is what we’re going to do about it?
Two of the scholars were on hand to share their experiences. Gareth Millward, whose project explored hyperlinking behaviour towards the website of the Royal National Institute of the Blind (RNIB) in those early days of the web, and tried to uncover why people were casting those hyperlinks. Also Richard Deswarte, who used the archive to explore manifestations of Europhobia online, looking particularly for indicators that people in Britain were using the web to express dissatisfaction with the country’s continued role in the EU.
The projects themselves took on interesting questions, which were appropriate, given the type of source. Most interesting for me – and a significant part of both presentations – was the discussion of where they had problems using the corpus. Both scholars complained of noise that made it difficult to identify unique or meaningful mentions. In Millward’s case the noise came in the form of an advertisement in the Guardian for a talking watch that was endorsed by the RNIB. The ad appeared on hundreds of pages, though it really only represented a single match for Millward’s purposes. Deswarte too had trouble with a rotating banner on a newspaper website that dramatically overemphasized the number of meaningful links to an article about Europhobia.
Both also noted the sheer number of hits they were getting, and Millward in particular emphasized his attempts to get the list down to a size where he could conduct a close reading. He had failed to do so, and is still left with a collection of 39,000 hits. However, both he and Deswarte reflected on that failure, and evoked the language of social scientists and their ideas about representative sampling that they felt would have been appropriate if given the opportunity to tackle this challenge again. That reflection is significant, because it shows both Millward and Deswarte recognized the limits of the historian’s skillset for a project such as this.
However, I think we can push those limits further. The very notion that we would do a close reading of the Internet is one that I think only historians would suggest. It shows how deeply the value of close reading is held in the profession, even if it proves entirely inappropriate. We need to move on from that belief: that you can only know something if you’ve read it carefully. If we hold on to this mentality we’re going to lose our chance to discover anything at scale. We’ll be unable to pursue the longue durée that Guldi advocated for in our previous seminar.
Sitting in the audience I couldn’t help but think that the solution wasn’t in sampling and close reading. It was in corpus linguistics, data manipulation, clustering algorithms, and distant reading. Skills that are so rarely taught in our history programmes, but that this experiment made clear need to become part of our disciplinary tool kit. And if not our toolkit, then we need to engrain the value of collaboration. If you can’t do it, find someone who can that wants to work with you.
The day of the lone scholar intent on close reading are numbered. The UK Web archive has showed us that. So what are we going to do about it?
Presentations from the event
The UK Web Archive is available to search now. In addition there are a variety of related research projects such as the Big UK Domain Data for the Arts and Humanities (BUDDAH) Project. Analysis into the sustainability of the dataset can be found on the website for the Analytical Access to the Domain Dark Archive (AADDA), and examination of the potential value of the UK Web Domain dataset can be found on the Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research website.
This piece originally appeared on the Digital History Seminar blog and is reposted with the author’s permission.
Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.
Adam Crymble is a convenor of the Digital History seminar at the IHR and a lecturer of digital history at the University of Hertfordshire.