New Digital Project Opens Up 300 Years of Books for Data Analysis
Source Newsroom: University of Nebraska-Lincoln
Newswise — In the 19th century, Britain was the world's superpower, boasting a global empire of 10 million square miles and 400 million royal subjects. And British authors of the era reflected this supremacy, peppering prose with words of command and certainty -- ones like always, never and forever.
At the same time in Ireland, writers echoed a different perspective in their books. With the Irish under the thumb of British rule, the nation's scribes frequently used words that displayed inability or frustration -- ones like almost, nearly or perhaps.
Matthew Jockers knows this to be a fact because it bears out in his computer-generated data: The University of Nebraska-Lincoln assistant professor of English has combined computer programming with digital text-mining to produce deep thematic, stylistic analyses in 19th-century literary works. He calls the data-driven process macroanalysis, and it's opening up new methods for literary theorists to study classic literature.
"But what we don't know is what happens after the turn of the 20th century," Jockers said. "The 20th century, as we know, is when the British Empire deteriorates and the Irish gain independence. So do each country's authors remain as they were in the previous century? Or if they do begin to change their approach, in what ways do they go about it? That's the kind of question we can address -- with access to proper data, that is."
Now, thanks to an exclusive agreement between UNL and private company, BookLamp, Jockers and research collaborators from several U.S. universities have the tools to begin uncovering the answers to that question -- and many others. This new research collaboration will ultimately allow scholars to access and analyze book data from the 18th, 19th and 20th centuries.
BookLamp uses digital tools to compare books by theme and writing style, suggesting other books a reader might like based on how closely they match previous reads. To power their algorithm, BookLamp works with publishers across the industry to analyze thousands of titles in its Book Genome Project, which it launched in 2003.
"We can learn a lot about ourselves by looking at the writings that have been published over the years as a whole, at a scale that's been difficult to do in the past," said Aaron Stanton, CEO of BookLamp. "We're not providing access to data for individual books, but instead information that can help answer larger questions about changes in society over time."
Jockers, who also is a fellow in UNL's Center for Digital Research in the Humanities, said that in scholarly circles, the arrangement signifies a big step forward: For years, digital researchers have had a difficult time gaining access to the results of digitally text-mined books from the 20th century, thanks to copyright and access issues. While BookLamp will not directly provide scholars with book texts or book-level data, it does provide corpus-level "anonymized" data that allows researchers to ask questions about key thematic and stylistic structures.
An example may be to query how often female writers used keywords related to traditionally male professions in the 1920s compared with, say, the 1980s, to track the changes in women's literary roles over time, researchers said.
"Nearly everyone who does this kind of work focuses on the 19th century, because that's all that's been available in the digital format, outside of copyright," Jockers said. "So unfortunately, we've been kind of stuck in time for a while. But this arrangement will help us clear that hurdle and we'll be able to look more deeply into more modern works."
Jockers leads the collaboration with digital literary scholars at Stanford University's Literary Lab as well as Arizona State University. It starts with a two-year project involving data from BookLamp, as well as data from 18th- and 19th-century novels already compiled in Stanford's Literary Lab.
Organizers have dubbed the effort the "Unfolding the Novel" project. Ultimately, they will consolidate 300 years of high-level book data to study long-term literary trends and patterns.
And in the 20th century, those patterns explode into a multitude of modern genres and open up a swarm of new research questions, Jockers said.
With the BookLamp-provided summary metadata, researchers could query information from a range of years -- the 1950s, for example -- and learn how many times a particular word was used in any of the new genres of the time, from detective stories to romance to science fiction. The text-mined results would shed new, data-supported light upon the various themes and styles authors employed in that decade.
One of the project's initial queries will be to examine the words and stylistic elements that best allow scholars to distinguish between male and female writers, Jockers said. For example, in the 19th century, male authors were far more likely to use male pronouns than female ones. This indicates their stories were more masculine than those written by women authors, who used male and female pronouns more evenly during the same period.
"We're interested to learn what happens to this tendency in the 20th century," he said. "This is, after all, the period of liberalization, so the theory would be that women would begin writing more female-centered work. And, if these movements had any effects on the males, we should start to see a greater attention to the other gender in works by 20th-century men, as well. It will be interesting to see."
The work of understanding and organizing data from 100 years of literature is long and difficult, Jockers said, much less 300 years of literature. But he said he thinks that he and his collaborators are inaugurating a game-changing, information-rich era of literary scholarship.
"The potential uses of this information are huge," he said. "BookLamp has been a spectacular partner in the effort; they are genuinely interested in many of the same questions we are, and they are passionate in the pursuit of knowledge.
"The possibilities are practically endless."