Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks

Thousands of Australian authors, including prominent figures like Richard Flanagan, Helen Garner, Tim Winton, and Tim Flannery, have found their books within a pirated dataset of eBooks known as Books3, employed to train generative AI models. This revelation has triggered strong reactions from authors globally who were unaware of their works being used without consent or acknowledgement.

Also Read: Who Owns ChatGPT: Revealing the Puzzling Truth in 2023

Navigator

The Search Tool That Unveiled the Inclusion

A search tool, recently introduced by The Atlantic, allows authors to check if their books are among the nearly 200,000 works within the Book3 dataset. Authors like Richard Flanagan have expressed their sense of powerlessness in the face of their creative works being exploited without permission. Olivia Lanchester, CEO of the Australian Society of Authors, has raised concerns about the infringement of copyright, highlighting the potential damage to the precarious careers of authors.

Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks (1)

As AI technology progresses rapidly, traditional copyright laws appear ill-equipped to address the emerging challenges. Copyright law, which traditionally safeguarded authors and creators from unauthorised usage of their works, is struggling to keep pace with the AI revolution. The legal framework designed for the pre-AI era finds itself out of sync in a world where AI is evolving at breakneck speed.

Also Read: Amazon Is Limiting Kindle Direct Publishing to 3 Books a Day to Protect Against Abuse

A Surge in Copyright Disputes

The central question now confronting us is whether we can afford to wait for the legal system to catch up with technology or if we must expedite this process. Authors are increasingly turning to copyright law, but its effectiveness in the context of AI data sets is questionable. Authors find themselves in a race against time as AI innovation continues to surge ahead.

So my books are also being used to train AI without my permission using the Books3 database. Millions of hours of authors' work being exploited by big tech with zero payment. Funny how writers keep getting shafted. RAGE.

Article here: https://t.co/N88Ay6NJbE pic.twitter.com/rAKzfaw3ri
— Sathnam Sanghera (@Sathnam) September 26, 2023

The issue of copyright disputes concerning AI datasets and copyright-protected works is on the rise. Recently, the US Authors Guild filed a class-action lawsuit against OpenAI for copyright infringement, featuring authors like Jonathan Granzen and Jodi Picoult among the plaintiffs.

This lawsuit follows the first copyright case against OpenAI filed in July by authors Mona Awad and Paul Tremblay for unauthorised use of their books to train AI models. In August, Benji Smith had to remove his website, Prosecraft, which used an algorithm to analyse over 25,000 books without authors’ consent to provide writing advice.

Understanding Books3 and Its Origins

Books3, the contentious dataset in question, was spotlighted by The Atlantic in September. The dataset includes author information for 183,000 of the 191,000 ISBNs it contains, used to train AI models like Meta’s LLaMA, EleutherAI’s GPT-J and Bloomberg’s BloombergGPT.

CALLING ALL AUTHORS!

Find out precisely how many of your books have been stolen to train AI right here:https://t.co/rqYbtrqnKU
— Val McDermid (@valmcdermid) September 26, 2023

Its creator, Shaun Presser, developed the dataset as a resource for independent developers to compete with tech giants like OpenAI. OpenAI itself is believed to have used similar datasets, Books1 and Books2, to train AI models like ChatGPT.

The Ethical Dilemma Regarding Pirated Material

Authors like Dervla McTiernan, whose work was found within Books3, expressed outrage over what they perceive as outright theft. They argue that companies like OpenAI and Meta knowingly used pirated material for their AI models, driven by self-interest. The lack of consent and acknowledgement from authors raises questions about the ethics of such actions, leaving many authors feeling broken and exploited.

OpenAI’s Lack of Transparency Sparks Concern Over Book3 Training Data

In a time when OpenAI has gradually reduced its transparency regarding training data, the veil has been lifted on the Books3 repository’s origins. Books3 is revealed to be derived from the Bibliotik library, categorised as a “Shadow library”, akin to industry-derided sources such as Libgen, Z-Library, and Sci-Hub.

Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks (2)

To create this data set, Shaun Presser, the dataset’s creator, had to develop scripts capable of transforming PDFs and images into usable.txt files, a labour-intensive endeavour. Presser highlighted the importance of democratising access to the creation of AI models, stating that anyone should have the ability to develop their own models. The notion of accessibility to AI model creation is viewed as crucial, similar to the democratisation of website creation in the 1990s.

Meanwhile, concerns are mounting regarding the use of copyrighted content in training AI models. A group led by Fredenslund is contemplating reaching out to Meta, a tech giant, to address this issue. While it is unlikely that Meta would retain its AI model entirely to satisfy copyright holders, the lack of global regulations mandating transparency for AI models is evident.

Copyright Alone Is Not the Solution

While copyright infringement is evident in the unauthorised use of works within AI datasets, pursuing individual legal action can be challenging and may yield only modest damages. Moreover, fair dealing and fair use provisions in copyright law may provide some protection to AI dataset creators. Additionally, AI-generated outputs may not always meet the substantial similarity threshold required for copyright infringement claims in most jurisdictions.

Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks (3)

While the European Union is working on the AI Act, which will require companies to disclose some model transparency, advocates argue that AI developers should be compelled to share the specifics of their training data, including the precise work used to create their AI models.

Also Read: Negative Content From ChatGPT Jailbreak Can Be a Global Threat

Conclusion

In the face of AI’s rapid advancement, copyright law is facing its most formidable challenge yet. The authors are grappling with a complex and evolving legal landscape. The outcome of ongoing copyright disputes will undoubtedly shape the future of AI-generated content and the rights of authors in this digital age.

Frequently Asked Questions

What Are Current Copyright Laws Regarding Work Created by Artificial Intelligence?

The copyright laws regarding AI-produced works vary by nation. For instance, in the US, the US Copyright Office does not register works created by AI without human involvement. In the UK, the UK Intellectual Property Office has adopted a similar stance.

How Can Authors Check if Their Books Are in Books3?

Authors can check if their books are in Books3 by using the search tools introduced by The Atlantic. The tool allows authors to enter their book titles or ISBNs and see if they match any of the works in the dataset.

What Are Some of the AI Models That Have Used Books3 as a Training Dataset?

The Books3 dataset comprises 183,000 books sourced from unauthorised platforms. It has been utilised by companies such as Meta (Developers of LLaMA), EleutherAI, and Bloomberg for training their language models.

Author Profile

Scott Faulkner

Latest entries

NEWS2024.03.18Elon Musk’s SpaceX Ventures into National Security to Empower Spy Satellite Network for U.S.
GAMING2024.03.17PS Plus: 7 New Games for March and Beyond
GAMING2024.03.17Last Epoch Necromancer Builds: All You Need To Know About It
AI2024.03.16The Impact of Super AI: Blessing or Curse?

Visited 16 times, 1 visit(s) today

Platforms:

Top Game Right Now:

Best...

What's new:

Platforms:

Top Game Right Now:

Best...

What's new:

Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks

The Impact of Super AI: Blessing or Curse?

Discover Hanooman: India’s Own ChatGPT-Style AI Model

Google I/O 2024 Event Date Set for May 14, 15 with Prospect of New AI Features

Platforms:

Top Game Right Now:

Best...

What's new:

Platforms:

Top Game Right Now:

Best...

What's new:

Books3’s AI Datasets for Training Generative AI Included the Works of Australian Authors From Pirated Ebooks

The Search Tool That Unveiled the Inclusion

A Surge in Copyright Disputes

Understanding Books3 and Its Origins

The Ethical Dilemma Regarding Pirated Material

OpenAI’s Lack of Transparency Sparks Concern Over Book3 Training Data

Copyright Alone Is Not the Solution

Conclusion

Frequently Asked Questions

What Are Current Copyright Laws Regarding Work Created by Artificial Intelligence?

How Can Authors Check if Their Books Are in Books3?

What Are Some of the AI Models That Have Used Books3 as a Training Dataset?

Author Profile

Latest entries

Related Posts

The Impact of Super AI: Blessing or Curse?

Discover Hanooman: India’s Own ChatGPT-Style AI Model

Google I/O 2024 Event Date Set for May 14, 15 with Prospect of New AI Features