The publisher becomes the most recent news organization to prohibit the artificial intelligence company from collecting content for the development of its tools
OpenAI has been denied access by The Guardian to utilize its content for powering artificial intelligence products, including ChatGPT. Concerns regarding OpenAI’s use of unlicensed content in developing its AI tools have prompted legal actions from writers against the company, while creative industries are advocating for measures to safeguard their intellectual property.
The Guardian has officially confirmed that it has blocked OpenAI from employing software to scrape its content.
Generative AI technology, which encompasses products capable of producing persuasive text, images, and audio based on basic human inputs, has captivated the public since the introduction of an advanced ChatGPT chatbot last year. Nevertheless, concerns have emerged regarding the possibility of widespread dissemination of misinformation and the methodologies employed in creating these tools.
The mechanism behind ChatGPT and analogous tools involves “training” them by inputting extensive datasets sourced from the public internet, including news articles. This process enables the tools to anticipate the most probable word or sentence following a user’s prompt.
OpenAI, which maintains confidentiality regarding the data used to construct the ChatGPT model, declared in August that it would permit website operators to prevent its web crawler from accessing their content. However, it’s important to note that this action doesn’t entail the removal of materials from pre-existing training datasets. Currently, several publishers and websites are implementing blocks on the GPTBot crawler.
A representative from Guardian News & Media, the publisher of the Guardian and Observer, stated, “The extraction of intellectual property from the Guardian’s website for commercial purposes goes against our terms of service, both historically and presently. The Guardian’s commercial licensing team engages in numerous mutually beneficial partnerships with developers worldwide and anticipates fostering additional such collaborations in the future.”
As reported by Originality.ai, a platform that identifies AI-generated content, various news websites have now implemented blocks on the GPTBot crawler. This crawler extracts data from webpages to use in its AI models. The list of websites includes CNN, Reuters, the Washington Post, Bloomberg, the New York Times, and its sports website, the Athletic. Additionally, other platforms like Lonely Planet, Amazon, job listings site Indeed, question-and-answer site Quora, and dictionary.com have also restricted access to GPTBot.
British book publishers have called upon Rishi Sunak to include the safeguarding of intellectual property rights for creative industries as a topic on the agenda for the upcoming AI safety summit in November, which is taking place in the UK. In a letter from the Publishers Association, representing publishers of both digital and print books, research journals, and educational content, the prime minister is urged to emphasize the importance of upholding intellectual property laws during the development of AI systems.
In July, Elon Musk introduced restrictions on his Twitter platform, now known as X, in response to what he asserted were “excessive levels of data scraping” by AI companies constructing their models. He stated in a tweet that “virtually every AI company” was extracting “enormous quantities of data” from Twitter, which he claimed compelled the company to deploy additional servers, incurring additional expenses, to manage the increased demand.
Nonetheless, Musk has also confirmed that he intends to utilize public tweets for training models developed by his recently announced AI venture, xAI.
Google’s privacy policy currently specifies that the company, which employs web crawlers to assist in locating search results for users, may gather publicly accessible data for the purpose of training models for Google’s AI products, including the Bard chatbot. In contrast, Meta, the parent company of Facebook and Instagram, as well as a significant AI developer, has recently introduced a policy that enables users to indicate whether they wish to opt out of having their personal information utilized for training AI models.