From The Register
OpenAI's ChatGPT may face a copyright quagmire after 'memorizing' these books
By Thomas Claburn
Boffins at the University of California, Berkeley, have delved into the undisclosed depths of OpenAI's ChatGPT and the GPT-4 large language model at its heart, and found they're trained on text from copyrighted books.
Academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman describe their work in a paper titled, "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4."
"We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web," the researchers explain in their paper...
David Bamman is an associate professor of the I School. He previously won a National Science Foundation (NSF) CAREER award for his research designing computational methods to improve natural language processing for fiction.
Kent Chang is a Ph.D. student at the I School advised by Professor Bamman. His research uses natural language processing to understand and facilitate the process of meaning-making and social interaction in cultural texts, with particular interests in dialogue and narrative understanding.
Sandeep Soni is a postdoctoral researcher at the I School under Professor Bamman. His research models the dynamics of language with applications in computational social science and computational humanities.