David Bamman, a scholar of natural language processing, digital humanities, and computational social sciences, is joining the Berkeley School of Information this fall as an assistant professor.
Bamman’s work applies natural language processing and machine learning techniques to empirical questions in the humanities and social sciences. “My work develops statistical models of the world through what people say about it through text,” he explains.
He worked for a five years as a senior researcher for the Perseus Project, Tufts University’s digital library of the history, literature and culture of the Greco-Roman world. Bamman’s work at the Perseus Project focused on natural language processing for Latin and Greek, including treebank construction, computational lexicography, morphological tagging, and word sense disambiguation.
His recent research has explored a wide range of social questions using a variety of different text collections.
Recent Research
One recent project attempted to computationally identify sarcasm on Twitter. “Sarcasm detection is a very difficult computational problem,” he explained. Even humans are remarkably bad at identifying sarcasm out of context. Bamman’s approach analyzed more than just the text of each tweet; he also considered the broader context, including the personality of the person tweeting, characteristics of their normal audience, and the immediate context in which the tweet was posted. Taking this additional information into account improved accuracy dramatically over looking at the text of the tweet in isolation. But the sarcasm detection algorithm is more than just a party trick; it also gives social scientists new insight into the characteristics of interpersonal interaction that enable sarcasm in conversation.
Another recent analysis of Twitter data explored differences between how men and women write. By identifying a range of writing styles characteristic of women vs. men, the algorithm is able to correctly identify the author’s gender 88% of the time. But looking more closely, Bamman and his collaborators realized that the truth is more complicated — and more interesting. There aren’t just two different ways of writing (men vs. women); they actually identified a range of different sub-communities on Twitter, each with a distinct communication style. Some communities (like those who tweet mostly about sports or mostly about technology) tend to be predominantly male, others are predominantly female, and others are more mixed. Interestingly, some of the gendered sub-communities used language in ways that were atypical for their gender. By looking past a simple gender dichotomy, this research “offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.”
Another recent project analyzed a completely different text collection: a trove of 23,000 cuneiform tablets discovered in Kültepe, Turkey, documenting the activity of a thriving Bronze Age trade colony (kārum Kaneš) from the 2nd century BCE. Bamman analyzed 2,094 Akkadian-language letters between merchants, to quantify and diagram the hierarchy of colony’s social structure, based on linguistic clues and conventions in the texts themselves. He had to develop methods to disambiguate different merchants with the same name or one merchant using different names, while simultaneously learning the merchant’s social rank within the colony. This analysis provides tools for the historians and Assyriologists studying the kārum Kaneš colony to better understand the community’s social and cultural dynamics.
“What excites me about this research is the ability to learn new things about the world — things you couldn’t discover without a large amount of data and computational techniques,” says Bamman.
Crossing Disciplinary Boundaries
Bamman has a varied intellectual background; he has a Ph.D. in computer science from Carnegie Mellon, an M.A. in applied linguistics from Boston University, and a B.A. in classics from the University of Wisconsin. His research has involved working closely with collaborators from a wide range of disciplines.
Because of this interdisciplinary background, Bamman is excited to be joining the School of Information. “This place stands at the intersection of a lot of different disciplinary traditions,” he says. “We see that clearly in the make-up of the faculty and the students. We have the potential to be a common ground for research from across the Berkeley campus, in the humanities, social sciences, computer science, and beyond.”
One of the things Bamman is most looking forward to is working with I School students. “When I’ve visited here in the past, I was really impressed by the depth of the students’ intellectual curiosity, and by the range of different backgrounds they come from,” he says. “I’m excited to see what that future holds.”
In spring 2016, Bamman plans to teach a new course on computational text analysis, and in Fall 2016, he will be teaching the school’s core course Info 202. Information Organization and Retrieval.