A prominent news media trade group, News Media Outlet has raised concerns about AI technology companies that engage in web scraping to train their chatbots and generative AI systems. The organisation, which represents nearly 2,000 media outlets in the United States, recently published research that highlights how companies like OpenAI and Google have utilised news, magazine, and digital media content for training their AI systems.
Notably, the research revealed that these AI companies have instructed their bots to place significantly more trust in information sourced from reputable publishers, as opposed to content from other sources on the internet. This raises important questions about the use of copyrighted news material and the credibility of AI-generated information.
“The research and analysis we’ve conducted shows that AI companies and developers are not only engaging in unauthorised copying of our members’ content to train their products, but they are using it pervasively and to a greater extent than other sources,” said Danielle Coffey, in a statement.
Also Read: Xiaohongshu: Artists Plan Bold Boycott Against AI Image Generator on the App
What Is Web Scraping?
“This acknowledgment underscores their awareness of the distinctive worth of our content. However, it’s crucial to note that many of these developers are not acquiring the necessary permissions through licensing agreements or providing compensation to publishers for utilising this content,” Coffey emphasised. “This failure to respect high-quality, human-generated content negatively impacts not only publishers but also the long-term viability of AI models and the accessibility of dependable, credible information.”
In their published white paper, the trade group unequivocally dismissed arguments suggesting that AI bots have merely “Learned” facts in the same manner as humans do, by absorbing information from various datasets. The group asserted that forming such a conclusion is “Inaccurate” since these models retain the expressions of facts present in the works included in their training materials (Which are protected by copyright) without genuinely comprehending the underlying concepts.
Publishers, who have been engaged in a sort of Cold War with AI companies, have recently begun implementing defensive measures to safeguard their content. In August, a review by Reliable Sources revealed that a dozen prominent media firms had embedded code into their websites to protect their content from AI bots that scrape the internet for information. Furthermore, many more publishers have since adopted similar protective measures to preserve the integrity of their content.
Related: Qatar Asks AI Assist in Judicial System: Beware of the AI Gavel
Unlawful News Scraping Has Got to Stop
Indeed, these defensive measures primarily focus on safeguarding news organisations from future web scraping activities. Unfortunately, they do not address the issue of prior scraping, which, as noted by news outlets, has been used to train AI bots. To address this challenge, the News Media Alliance has put forward a set of recommendations aimed at preserving the place of news publishers in this rapidly evolving landscape.
These recommendations call for policymakers to acknowledge that the unauthorised use of copyrighted material for training AI bots constitute infringement, and they emphasise the importance of allowing publishers to efficiently license the use of their content under fair terms.
“Our culture, our economy, and our democracy require a solution that allows the news and media industry to grow and flourish, and both to share in the profit from and participate in the development of the GAI revolution that is being built upon the fruits of its labor,” the News Media Alliance said.
For the latest AI news, check out player.me/category/ai/.
Why Is News Scraping Bad for the Industry? 7 Potential Dangers
Web scraping, the automated process of extracting news content from websites and other sources, is considered problematic for several reasons:
1. Copyright Infringement
Many news articles, images, and videos are protected by copyright law, meaning that they are owned by the original content creators or publishers. Web scraping without proper authorisation can infringe upon these copyrights, as the scraper may use this content without permission.
2. Revenue Impact
News organisations rely on various revenue streams, such as advertising, subscriptions, and content licensing, to support their journalism. Web scraping can undermine these revenue models by freely reproducing content that should be subject to licensing fees or advertising placements.
3. Credibility and Accuracy
Scraped content may not always be properly attributed or may be taken out of context. This can affect the credibility and accuracy of the information presented, potentially misleading readers or distorting the original intent of the content.
4. Privacy Concerns
Web scraping can also raise privacy concerns if personal or sensitive information is extracted and used without consent. This may apply to both the individuals mentioned in news articles and the website visitors themselves.
5. Resource Drain
Frequent web scraping can put a strain on the resources and infrastructure of news websites, potentially leading to slower loading times, increased server costs, and other technical issues.
6. Content Manipulation
Some scrapers use the content for malicious purposes, such as generating fake news or spamming online forums and social media with misleading information.
7. Ethical Considerations
Engaging in web scraping without proper authorisation may be seen as unethical, as it may violate the principles of respecting intellectual property, fair use, and the terms of service of websites.
In other news, check out the latest 4 features for the beta of ChatGPT 4.0.
The Good of the People Is the Greatest Law
To address these concerns, there are ongoing discussions and legal battles surrounding web scraping, focusing on the need to strike a balance between data access for legitimate purposes (Such as data analytics) and the protection of content creators’ rights. Data protection is essential to safeguard individuals’ privacy and prevent unauthorised access or misuse of sensitive information. What are your thoughts on this, do you think web scraping is bad or an essential tool for work efficiency?