Thousands of Australian authors, including prominent figures like Richard Flanagan, Helen Garner, Tim Winton, and Tim Flannery, have found their books within a pirated dataset of eBooks known as Books3, employed to train generative AI models. This revelation has triggered strong reactions from authors globally who were unaware of their works being used without consent or acknowledgement.
Also Read: Who Owns ChatGPT: Revealing the Puzzling Truth in 2023
The Search Tool That Unveiled the Inclusion
A search tool, recently introduced by The Atlantic, allows authors to check if their books are among the nearly 200,000 works within the Book3 dataset. Authors like Richard Flanagan have expressed their sense of powerlessness in the face of their creative works being exploited without permission. Olivia Lanchester, CEO of the Australian Society of Authors, has raised concerns about the infringement of copyright, highlighting the potential damage to the precarious careers of authors.
As AI technology progresses rapidly, traditional copyright laws appear ill-equipped to address the emerging challenges. Copyright law, which traditionally safeguarded authors and creators from unauthorised usage of their works, is struggling to keep pace with the AI revolution. The legal framework designed for the pre-AI era finds itself out of sync in a world where AI is evolving at breakneck speed.
Also Read: Amazon Is Limiting Kindle Direct Publishing to 3 Books a Day to Protect Against Abuse
A Surge in Copyright Disputes
The central question now confronting us is whether we can afford to wait for the legal system to catch up with technology or if we must expedite this process. Authors are increasingly turning to copyright law, but its effectiveness in the context of AI data sets is questionable. Authors find themselves in a race against time as AI innovation continues to surge ahead.
The issue of copyright disputes concerning AI datasets and copyright-protected works is on the rise. Recently, the US Authors Guild filed a class-action lawsuit against OpenAI for copyright infringement, featuring authors like Jonathan Granzen and Jodi Picoult among the plaintiffs.
This lawsuit follows the first copyright case against OpenAI filed in July by authors Mona Awad and Paul Tremblay for unauthorised use of their books to train AI models. In August, Benji Smith had to remove his website, Prosecraft, which used an algorithm to analyse over 25,000 books without authors’ consent to provide writing advice.
Understanding Books3 and Its Origins
Books3, the contentious dataset in question, was spotlighted by The Atlantic in September. The dataset includes author information for 183,000 of the 191,000 ISBNs it contains, used to train AI models like Meta’s LLaMA, EleutherAI’s GPT-J and Bloomberg’s BloombergGPT.
Its creator, Shaun Presser, developed the dataset as a resource for independent developers to compete with tech giants like OpenAI. OpenAI itself is believed to have used similar datasets, Books1 and Books2, to train AI models like ChatGPT.
The Ethical Dilemma Regarding Pirated Material
Authors like Dervla McTiernan, whose work was found within Books3, expressed outrage over what they perceive as outright theft. They argue that companies like OpenAI and Meta knowingly used pirated material for their AI models, driven by self-interest. The lack of consent and acknowledgement from authors raises questions about the ethics of such actions, leaving many authors feeling broken and exploited.
OpenAI’s Lack of Transparency Sparks Concern Over Book3 Training Data
In a time when OpenAI has gradually reduced its transparency regarding training data, the veil has been lifted on the Books3 repository’s origins. Books3 is revealed to be derived from the Bibliotik library, categorised as a “Shadow library”, akin to industry-derided sources such as Libgen, Z-Library, and Sci-Hub.
To create this data set, Shaun Presser, the dataset’s creator, had to develop scripts capable of transforming PDFs and images into usable.txt files, a labour-intensive endeavour. Presser highlighted the importance of democratising access to the creation of AI models, stating that anyone should have the ability to develop their own models. The notion of accessibility to AI model creation is viewed as crucial, similar to the democratisation of website creation in the 1990s.
Meanwhile, concerns are mounting regarding the use of copyrighted content in training AI models. A group led by Fredenslund is contemplating reaching out to Meta, a tech giant, to address this issue. While it is unlikely that Meta would retain its AI model entirely to satisfy copyright holders, the lack of global regulations mandating transparency for AI models is evident.
Copyright Alone Is Not the Solution
While copyright infringement is evident in the unauthorised use of works within AI datasets, pursuing individual legal action can be challenging and may yield only modest damages. Moreover, fair dealing and fair use provisions in copyright law may provide some protection to AI dataset creators. Additionally, AI-generated outputs may not always meet the substantial similarity threshold required for copyright infringement claims in most jurisdictions.
While the European Union is working on the AI Act, which will require companies to disclose some model transparency, advocates argue that AI developers should be compelled to share the specifics of their training data, including the precise work used to create their AI models.
Also Read: Negative Content From ChatGPT Jailbreak Can Be a Global Threat
Conclusion
In the face of AI’s rapid advancement, copyright law is facing its most formidable challenge yet. The authors are grappling with a complex and evolving legal landscape. The outcome of ongoing copyright disputes will undoubtedly shape the future of AI-generated content and the rights of authors in this digital age.
Frequently Asked Questions
What Are Current Copyright Laws Regarding Work Created by Artificial Intelligence?
The copyright laws regarding AI-produced works vary by nation. For instance, in the US, the US Copyright Office does not register works created by AI without human involvement. In the UK, the UK Intellectual Property Office has adopted a similar stance.
How Can Authors Check if Their Books Are in Books3?
Authors can check if their books are in Books3 by using the search tools introduced by The Atlantic. The tool allows authors to enter their book titles or ISBNs and see if they match any of the works in the dataset.
What Are Some of the AI Models That Have Used Books3 as a Training Dataset?
The Books3 dataset comprises 183,000 books sourced from unauthorised platforms. It has been utilised by companies such as Meta (Developers of LLaMA), EleutherAI, and Bloomberg for training their language models.