Academic publishers have called for more protections and greater transparency over the way artificial intelligence chatbots are trained, amid a string of lawsuits seeking to protect copyrighted material.
The progress of legal cases alleging that work was copied without consent, credit or compensation by the likes of OpenAI – creator of ChatGPT and GPT-4 – and Google are being closely followed, with experts predicting that large academic publishers might start their own claims in time.
Data “is going to prove to be the moat that companies protect themselves with against the onslaught of generative AI, especially large language models”, predicted Toby Walsh, Scientia professor of artificial intelligence at UNSW Sydney.
“I can’t imagine the publishers are going to watch as their intellectual property is ingested unpaid.”
Campus webinar: Artificial intelligence and academic integrity
Thomas Lancaster, a senior teaching fellow in computing at Imperial College London, agreed. “There are academic publishers out there who are very protective of their copyright, so I’m sure some are actively trying to work out what content is included in the GPT-4 archive,” he said.
“I wouldn’t be surprised if we see academic lawsuits in the future, but I suspect a lot will depend on any precedents that come through from the current claims.”
In July, authors Mona Awad and Paul Tremblay filed a class action complaint in a San Francisco court alleging that their books had been “used to train” ChatGPT, because it was able to generate “very accurate summaries”. Comedian Sarah Silverman has started a similar claim.
OpenAI has said little about the sources that have been fed into its model, and it is unclear how academic research was used during its development.
However, Meta’s Galactica – which bills itself as a large language model (LLM) for science – is known to have been trained on millions of articles and claims to be able to summarise academic papers.
Many of these studies are available openly online, and LLMs also draw on news stories and reviews that discuss research findings, suggesting that publishers might find it difficult to prove that their copyright has been violated.
Dr Lancaster said, after checking for his own papers, it “appears GPT-4 has access to a lot of abstracts, but not the main paper text and detailed content”.
The myriad copyright laws used in different countries are a further complication, he added. Many governments have loosened the rules to enable data mining as a way of encouraging AI development.
Patrick Goold, reader in law at City, University of London, said even if publishers could prove that books and journals had been used in the training of chatbots, courts would likely rule that copyright has not been infringed because the AI “spits out an expression that is entirely unique”.
Despite the legal uncertainties, publishers told Times Higher Education, more needed to be done to protect academic work and to force AI developers to be more open in acknowledging their sources.
Wiley said it was “closely monitoring industry reports and related litigation claiming that generative AI models are harvesting copyright-protected material for training purposes, while disregarding existing restrictions on that information”.
“We have called for greater regulatory oversight and international collaboration, including transparency and audit obligations for AI language model providers, to address the accuracy of inputs and the potential for unauthorised use of restricted content as an input for model training,” a spokesperson said. “In short, we need more protections for copyrighted materials and other intellectual property.”
The American Association for the Advancement of Science, publisher of the Science family of journals, said there was a need for “appropriate limitations” to be put on text and data mining to avoid “unintended consequences”.
“Given the fast pace of artificial intelligence development, it is critically important to monitor the creation and adoption of guidelines for tools that can be trained on full-text journal articles, including for the purposes of replicating scholarly journal content, to ensure a focus on responsible and ethical development,” a statement said.
Elsevier said it did not permit its content to be input into public AI tools because “doing so may train such tools with Elsevier’s content and data, and other companies may claim ownership on outputs based upon our content and data”.
While there is widespread support for open access to academic publications among scholars, researchers have echoed calls for transparency in the development of AI to ensure that its outputs acknowledge scientific uncertainty and are not accepted uncritically.
Professor Walsh said this would help in the understanding of the “limitations and abilities of these systems”, but companies were generally becoming less transparent, “largely I suspect because they’re trying to avoid legal cases from those whose data they’re using”.
Anyone publishing academic work should be prepared for it to be “synthesised, analysed, recrystallised and sometimes misappropriated”, said Andy Farnell, a visiting professor of signals, systems and cybersecurity at a number of European universities.
“Research depends on exactly that process of ingestion and resynthesis that the AI is now doing better than research scientists, who have become fixated on grant applications and administrivia.”
POSTSCRIPT:
Print headline: Journals seek safeguards on AI’s mining of research
Register to continue
Why register?
- Registration is free and only takes a moment
- Once registered, you can read 3 articles a month
- Sign up for our newsletter
Subscribe
Or subscribe for unlimited access to:
- Unlimited access to news, views, insights & reviews
- Digital editions
- Digital access to THE’s university and college rankings analysis
Already registered or a current subscriber? Login