35 US Newspaper Publishers Sue OpenAI and Microsoft Over AI Training Data Use

Thirty-five US newspaper publishers have launched a copyright lawsuit against Microsoft and OpenAI, accusing the companies of copying millions of articles—including paywalled content—into datasets used to train AI models like ChatGPT and Microsoft Copilot without permission or compensation.

Publishers accuse OpenAI of scraping paywalled articles and removing copyright notices.
Over 115 million tokens from plaintiffs’ content identified in AI training datasets.
Claims include direct infringement and DMCA violations for concealed copyright info.

What happened

On June 24, 2026, 35 local and regional newspaper publishers filed a copyright lawsuit against Microsoft and OpenAI entities in the Southern District of New York. The complaint alleges that the defendants used automated tools to crawl the plaintiffs' websites and copied millions of articles, including paywalled content, without authorization. The publishers state that copyright management information such as author names and publication details was deliberately stripped from the collected data before it was used to train AI models like ChatGPT and Microsoft Copilot.

The plaintiffs include a diverse group of publishers ranging from large regional chains to small family-owned newspapers. The complaint details token counts from datasets approximating OpenAI’s training corpora, showing millions of tokens sourced directly from these publications. Leading examples include Ogden Newspapers contributing over 71 million tokens and AIM Media Indiana with close to 900,000 tokens in separate datasets.

Why it matters

This litigation highlights growing tensions between traditional media companies and AI developers over the use of copyrighted content in training datasets. The publishers argue that OpenAI and Microsoft have leveraged their original journalism to develop profitable AI products without compensation or acknowledgment. Beyond standard copyright infringement claims, the case also centers on the removal of copyright management information, which the plaintiffs say was intentional to conceal the use of their works and complicate enforcement efforts.

If the courts find in favor of the publishers, it could set significant precedents on data collection practices and usage rights for AI training. The case underscores the broader industry debate about how AI companies should legally and ethically source content that powers language models, particularly with respect to paywalled and proprietary information.

What to watch next

The lawsuit will be closely observed as it proceeds through the courts, with early stages focusing on discovery and evidentiary details around how content was collected and used by OpenAI and Microsoft. Potential settlements or rulings could influence whether large AI developers need to negotiate licensing deals with news publishers or implement new data filtering protocols.

Regulators and industry stakeholders may also respond by proposing clearer guidelines for using copyrighted materials in AI training, balancing innovation with intellectual property protections. The outcome could impact not only the news media industry but also technology companies worldwide that rely on large-scale data scraping for machine learning.

Source assisted: This briefing began from a discovered source item from MediaNama. Open the original source.

How SignalDesk reports: feeds and outside sources are used for discovery. Public briefings are edited to add context, buyer relevance and attribution before they are published. Read the standards