“Open AI·Antropic, bypasses web crawling prevention device…collects data indiscriminately”

(Photo = Shutterstock)

It was revealed that the world’s two largest AI startups, leading the event of cutting-edge artificial intelligence (AI), are engaging in ‘crawling’, an indiscriminate act of collecting web data for model learning. Open AI and Antropic were identified as parties.

Business Insider reported on the twenty second that the 2 firms were found to be ignoring or bypassing a tool (robots.txt) that forestalls automatic scraping of internet sites.

This issue was made known the day before through TollBit, a startup that brokers paid licensing agreements between publishers and AI firms.

Tolbit discovered that several AI firms were behaving in this fashion and notified certain large publishers through letters. It didn’t include the names of firms accused of violating the foundations.

So most individuals paid attention to Perplexity, an AI search startup that recently became controversial over the problem of stealing a Forbes article. Nonetheless, Business Insider confirmed the corporate name and said it was OpenAI and Antropic.

Specifically, the 2 firms have publicly stated that they respect the foundations of ‘robots.txt’ and thereby block their very own web crawlers, GPTbot and Clodbot.

Nonetheless, in response to Tolbit’s findings, this announcement was only a formality. Some firms, including these two, have been found to bypass robots.txt so as to retrieve or scrape all content from specific web sites or pages.

Open AI and Antropic declined to comment specifically on this.

robots.txt is a single piece of code that has been used for the reason that late Nineteen Nineties as a way for web sites to inform crawlers that they are not looking for their data scraped and picked up. It was widely accepted as certainly one of the unofficial rules underpinning the net.

Nonetheless, as AI competition has intensified and data shortages have develop into more severe, these informal rules have been weakened.

Open AI has been signing large contracts with global media and publishers one after one other this yr. At the identical time, it is usually involved in about 20 copyright lawsuits. Last month, it also announced that it was developing a ‘Media Manager’ that will allow creators and content owners to exclude content posted on the Web from AI learning.

Little is understood about Antropic’s copyright agreements or lawsuits. Nonetheless, it has recently received attention with the discharge of Claude 3 and three.5, and is anticipated to face data problems just as much as Open AI.

Reporter Lim Da-jun ydj@aitimes.com

“Open AI·Antropic, bypasses web crawling prevention device…collects data indiscriminately”

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How 81K people really feel about AI

Cloud service providers ask EU regulator to reinstate VMware partner program

Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

The Basics of Vibe Engineering

Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

“Open AI·Antropic, bypasses web crawling prevention device…collects data indiscriminately”

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.