Multilingual BookNLP: Building a Literary NLP Pipeline Across Languages
David Bamman’s BookNLP project, which provides large scale text analysis and is currently only available for English texts, will receive funding from the National Endowment for Humanities (NEH) to expand its scope to include German, Japanese, Russian and Spanish languages.
Natural language processing (NLP) uses computers to analyze, understand, and attain meaning from human language. Bamman’s BookNLP is a trained system that computationally analyzes the linguistic structure of a text. It’s unique because it is designed for literature and lends itself to analyzing characters and their actions. Most NLP programs are optimized for newspapers or online news articles and don’t easily lend themselves to the analysis of fiction.
Bamman said that the idea to develop a BookNLP for languages beyond English came about through conversations he’s had with researchers working with texts in German, Russian, and a range of other languages; they’ve used BookNLP in English and wanted a similar thing to use in their own language.
“People in the computational humanities want to use these tools to understand something about literary history or literary theory, and if they took tools that were optimized for news, it would just not work quite as well,” Bamman said.
BookNLP has proven to be a game-changer in the world of the computational humanities: it’s been used to measure the amount of attention given to characters as a function of their genders, to analyze the relationship between character and literary genre, and to characterize locations in novels written by African American authors in the Black Book Interactive Project. By expanding its scope of language, Bamman hopes to facilitate new research in the computational study of literature.
The project will be broken down into three year-long phases, advised by a panel of experts (two in each language) with deep knowledge of their language and its literature, as well as extensive experience using computational methods to drive literary inquiry.
In Year 1, Bamman and a graduate student researcher will work on building a common Multilingual BookNLP architecture that can be used as the customizable foundation for all languages. This first phase will focus on building a functioning BookNLP system for each language within a common infrastructure. In Year 2, the focus will be on improving the multilingual systems; and Year 3 will center on documenting the processes used in the first two years of the project to enable others to build and train BookNLP systems for additional languages, and measuring the reasoning about character using BookNLP across several languages.
“I know a lot of people use BookNLP for analyzing English literature, and it was clear that if I were to expand it to other languages, it could have an even bigger impact,” Bamman said. “Given that there’s a demand, and that I have an interest myself in comparative analysis across languages, it seems like a natural next step.”