NY Times Vs OpenAI Court Case: Who Will Solve Information Overload?

The battle for copyright is a critical factor for publishers to secure their place in the attention spans of consumers. The New York Times' case against ChatGPT is ground zero.

Dec 29, 2023

∙ Paid

In the rapidly evolving digital age, the shift from traditional publishing to an era of information overload has necessitated new solutions for content consumption and management. However, the resultant deluge of information led to challenges in filtering and processing this vast data. AI platforms like ChatGPT stand out for their ability to aggregate, process, and personalize information. These developments signify a crucial evolution in managing the abundance of digital content, offering tailored, relevant, and manageable streams of information in an age otherwise characterized by its overwhelming flood of data.

Amidst this transformation, the battle for copyright has emerged as a critical factor for media publishers to secure their place in the attention spans of consumers. The New York Times' case against ChatGPT stands at ground zero of these developments, highlighting the urgent need for clear guidelines and fair practices in the use of AI in content creation and distribution. The unfolding legal battles and strategic partnerships in the industry are reshaping the landscape, making it imperative for media companies to adapt and innovate while fiercely protecting their intellectual property.

The Battle Over AI and Copyright

In a landmark case that could redefine the boundaries of artificial intelligence development and copyright law, The New York Times has launched a federal lawsuit against tech giants OpenAI and Microsoft. Filed in December 2023, the lawsuit alleges unauthorized use of copyrighted content for training advanced AI models, including OpenAI's GPT-3 and Microsoft's Bard. This high-profile case not only challenges the legal tenets of fair use but also raises profound ethical questions regarding AI development.

The NYT Case: A Landmark in Copyright and AI Development

The lawsuit accuses the defendants of copyright infringement under U.S. law, particularly 17 U.S.C. § 106. The crux of the case revolves around the NYT's claim to exclusive rights over its literary works, which it alleges were used to train and distribute the defendants' AI models. The complaint highlights potential financial impacts and threats to future business models, including AI licensing, vital for funding journalism globally. The NYT emphasizes the extensive human and financial investment in reporting critical global events, underscoring the stakes of this case.

Central to the NYT's lawsuit is the pursuit of fair compensation for its content and the fostering of a responsible AI ecosystem within a healthy news landscape. The complaint details instances of alleged verbatim use of NYT content, comparing it to other search engines' approaches. It also addresses the purported preferential use of NYT-sourced content in AI training, indicating potential market destabilization for journalism.

The Constitution and the Copyright Act recognize the critical importance of giving creators exclusive rights over their works. Since our nation’s founding, strong copyright protection has empowered those who gather and report news to secure the fruits of their labor and investment. Copyright law protects The Times’s expressive, original journalism, including, but not limited to, its millions of articles that have registered copyrights….
The Times objected after it discovered that Defendants were using Times content without permission to develop their models and tools. For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple)…
These negotiations have not led to a resolution. Publicly, Defendants insist that their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose. But there is nothing “transformative” about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it. Because the outputs of Defendants’GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use. Case 1:23-cv-11195 Document 1 Filed 12/27/23

Key Allegations

The New York Times (NYT) raises four major complaints centered around copyright infringement during the development and use of generative AI models:

1. Unauthorized Reproduction of NYT Works During GPT Model Training

The NYT alleges that OpenAI used millions of copyrighted NYT works, including from the website and third-party datasets, in building training datasets for GPT models. Common Crawl, a significant dataset in GPT-3's training, is highlighted as a primary source of this content. It's described as a comprehensive snapshot of the internet, with www.nytimes.com being one of the most represented sources. The GPT models, including the speculated larger GPT-4, are believed to have been trained on extensive data including NYT content.

Case 1:23-cv-11195 Document 1 Filed 12/27/23

Common Crawl is a significant dataset in GPT-3's training, is essentially a 'copy of the Internet' made available by a non-profit organization. OpenAI claims fair use of open source databases however Common crawl’s terms of use are in contradiction with that assertion:

"You also acknowledge and agree that all information, data, text, scripts, web pages, web sites, software, html page links, open data APIs, metadata or other materials (collectively, the "Crawled Content ") may be subject to separate terms of use or terms of service from the owners of such Crawled Content."

2. Incorporation of NYT Works in GPT Models

The lawsuit claims that the GPT language models have "memorized" many NYT works, encoding them into their parameters. This is evidenced by the models producing near-verbatim outputs of significant portions of NYT works upon prompting. For instance, a Pulitzer-prize winning series by The Times on predatory lending was allegedly recited verbatim by the AI models.

3. Unauthorized Display of NYT Works in Product Outputs

NYT contends that generative outputs from products like ChatGPT display NYT content by showing memorized copies or derivatives of their works. ChatGPT, for example, is said to produce summaries and texts closely resembling NYT articles that would typically be locked behind a paywall.

4. Unauthorized News Dissemination via Search Applications

The complaint extends to synthetic search applications like Bing Chat, which are said to display extensive paraphrases or excerpts of NYT content. These applications are accused of creating natural-language substitutes for NYT articles, reducing the need for users to access the original sources, thereby impacting the NYT's audience and revenue.

Prayer for Relief

As the legal battle between The New York Times and technology giants OpenAI and Microsoft unfolds, a crucial segment of the lawsuit, known as the "Prayer for Relief," merits special attention. This part of the litigation is where the NYT explicitly outlines the specific remedies and resolutions it seeks from the court. The requests made in this section are pivotal in understanding the depth and seriousness of the allegations, as well as the broader implications for the media, technology, and legal landscapes. The Prayer for Relief is a window into the NYT's strategic goals in the lawsuit, encapsulating its efforts to protect its copyrighted materials and set precedents in the rapidly evolving domain of artificial intelligence and copyright law.

Monetary Compensation: The NYT is seeking various forms of financial compensation. This includes statutory damages (set amounts of money defined by law), compensatory damages (to cover losses or harm suffered), restitution (repayment for unjust gains), disgorgement (forcing the defendant to relinquish unjust profits), and any other financial relief permissible under law.
Permanent Injunction: The NYT is asking the court to permanently prohibit the defendants from continuing the unlawful and infringing activities that were alleged in the lawsuit.
Destruction of Infringing Materials: Under 17 U.S.C. § 503(b), the NYT requests an order for the destruction of all AI models and training sets (like GPT or other Language Learning Models) that include copyrighted works from The Times.
Legal Costs and Attorneys’ Fees: The NYT seeks an award to cover the costs and expenses incurred during the lawsuit, as well as the fees paid to their attorneys, as allowed by law.
Additional Relief: The NYT also requests any other relief that the court may consider appropriate, just, and fair in the context of the case.

Pin by Kewe Love on Adult Humor | Shots fired meme, Funny memes, Funny snaps — Kewe Love

The Ripple Effect

OpenAI and Microsoft's Defense: Both OpenAI and Microsoft have refuted the allegations, standing firm on the principle of fair use. They contend that their use of The Times's content falls within legal parameters, essential for the progress of AI technology. This defense opens a complex debate over where the line is drawn between fair use and copyright infringement in the age of AI. The allegations are only days old so this will be an important factor to monitor in the coming weeks.

Implications for AI Development: The outcome of this lawsuit could be a turning point for AI research and development. A victory for The New York Times might set a precedent, making it harder for AI companies to use copyrighted material without explicit consent. This could slow AI progress or increase costs as companies might have to negotiate rights for content usage. Conversely, if OpenAI and Microsoft succeed, it could reinforce the scope of fair use, potentially accelerating AI advancements.

The Ripple Effect: Elon Musk's X: Amidst this legal battle, Elon Musk's X (formerly Twitter) has revised its terms to prohibit data scraping, a move likely aimed at protecting its data from being used to train AI models. The new terms explicitly ban scraping or crawling without prior written consent, a stark shift from the previous stance that allowed crawling in line with robots.txt instructions. This change, especially the restriction on bots other than Google's, signifies a tightening grip on how public social media data is used in the AI arena.

As this lawsuit unfolds, it has become a watershed moment in the discourse around AI, copyright, and ethics. It not only challenges existing legal frameworks but also compels the tech industry, legal experts, and creators to reevaluate the intersection of AI development and intellectual property rights. Regardless of the outcome, this case is set to have far-reaching implications on how we approach AI training and the balance between innovation and the protection of creators' rights in the digital age.

Axel Springer Partners with OpenAI

In a groundbreaking move that contrasts sharply with ongoing legal disputes, global news publisher, headquartered in Berlin, Axel Springer has announced a collaborative partnership with OpenAI. This venture, unveiled on December 13, represents a strategic alignment between a major news organization and an AI technology leader. Unlike the contentious lawsuit involving The New York Times and OpenAI, this deal exemplifies a mutually beneficial relationship between AI development and content creators. Axel-Springer has effectively addressed all of the NY Times claims within this commercial agreement.

Compensation for Training Data

OpenAI has agreed to compensate Axel Springer for using its content, including archived material, to train its sophisticated language models. While the financial details remain undisclosed, this arrangement is a multiple-year, non-exclusive deal, signaling a sustainable and potentially replicable model for AI and journalism collaborations.

Incorporation of Axel-Springer works into ChatGPT

Under this agreement, when users pose questions to ChatGPT, the AI chatbot will provide summaries of relevant news stories from Axel Springer's brands, including Politico, Business Insider, Bild, and Welt. Remarkably, these summaries will incorporate material from subscription-based articles, citing the original publication as the source and offering links to full articles. This integration not only brings breaking news into the ChatGPT experience but also aims to drive traffic and subscription revenue towards Axel Springer's publications.

Favorable Position for Axel Springer Content

Reuters reported that source familiar with the deal indicates that Axel Springer's content will receive a "favorable position" in ChatGPT search results. This strategic placement is intended to enhance user access to quality journalism while simultaneously boosting Axel Springer's digital footprint and financial returns.

Axel Springer's Vision

Axel Springer CEO Mathias Doepfner articulated the company's ambition to harness AI in journalism. The goal is to elevate the quality, societal relevance, and economic viability of journalism through AI integration, marking a significant shift from traditional news publishing paradigms.

Earlier, OpenAI had struck a similar deal with the Associated Press, focusing on leveraging AI technology without displaying content directly. News Corp is also reportedly in advanced discussions for a similar arrangement, indicating a growing trend of news organizations actively engaging with AI companies.

EU Media Companies Are More Protected than American Ones

The willingness of European companies like Axel Springer to collaborate with AI platforms like ChatGPT can be better understood by examining the differences in legal protections between European and American laws, especially in the context of database rights and copyright directives.

Database Rights in the EU

EU Database Rights: In the European Union, there are "sui generis" database rights, as part of the European Database Directive. These rights are specifically designed to protect the significant investment made by a database creator in assembling, verifying, and presenting the contents of a database. They grant the database maker exclusive rights to prevent the extraction and/or re-use of substantial parts of the database. This legal framework is unique to the EU and is not found in American law.

US Law: In the United States, there is no direct equivalent to the EU's sui generis database rights. Protection for databases in the US is generally afforded under copyright law, but this only extends to the creative elements of the database (such as annotations or novel methods of organization). The raw data itself, especially if it is considered factual information, is not protected in the same way as in the EU. Thus, the legal environment in the US is generally more permissive for the use and aggregation of data in databases.

European Copyright Directive vs. American Copyright Law

EU Copyright Directive (Articles 15 and 17): The European Copyright Directive, particularly Articles 15 and 17 (formerly known as Articles 11 and 13), focuses on protecting press publications and ensuring fair remuneration for online content use. Article 15 allows publishers to obtain fair and proportionate remuneration for the digital use of their press publications by information society service providers. Article 17 deals with the use of copyrighted content by online content-sharing platforms, requiring them to obtain licenses for the content and to filter out unauthorized uploads.

American Copyright Law: US copyright law does not have provisions directly equivalent to Articles 15 and 17 of the EU Copyright Directive. While it provides broad protection for original works of authorship, the US approach is more focused on fair use, which allows limited use of copyrighted material without permission for purposes like criticism, commentary, news reporting, and education. This fair use doctrine creates a more flexible environment for the use of copyrighted materials in the US compared to the EU.

Impact on Axel Springer's Collaboration with ChatGPT

Understanding these legal differences helps explain why a company like Axel Springer might be more open to collaborating with AI platforms. In the EU, the stronger database rights and the new Copyright Directive could provide Axel Springer with a greater degree of control and assurance over how its content is used and monetized. Collaborating with ChatGPT under a formal agreement allows Axel Springer to leverage AI technology while ensuring compliance with European legal standards and securing fair compensation for the use of its content.

In contrast, the legal environment in the US might not provide the same level of protection or control for content creators, making companies in the US potentially more cautious or seeking different terms in their collaborations with AI platforms.

ChatGPT as a Content Platform

The landscape of information dissemination has undergone seismic shifts over the past few decades, driven by the advent of the internet, mobile technology, and social media. These shifts, while democratizing access to information, have also led to an overwhelming surplus of content. This has created an evolution from traditional media to the current era of information saturation and the emergent reliance on aggregators like Substack and AI platforms like ChatGPT as solutions to navigate this deluge.

The Digital Revolution in Publishing

The combination of computers, Google, Kindle, and the Internet revolutionized traditional publishing. Traditional publishers, once constrained by the logistics of physical newspaper and book distribution, found in digital platforms the means to reach broader audiences at significantly lower costs. This digital proliferation not only expanded reach but also democratized access, allowing for a more diverse set of voices to be heard. However, this ease of access and distribution also sowed the seeds for future challenges of information oversaturation.

The Rise of Microbloggers and Social Media Influencers

The personal phone, coupled with platforms like Instagram, TikTok, and YouTube, ushered in the era of the microblogger and social media influencer. Individuals leveraged these tools to create content, reach vast audiences, and, in many instances, achieve celebrity status. This period marked a significant power shift from traditional media houses to individual content creators, highlighting the power of personal branding and niche content.

The Challenge of Information Overload

As the internet became increasingly saturated with content, the challenge is increasingly shifting from accessing information to filtering it. The analogy of "sipping from a fire hose" aptly describes the overwhelming nature of trying to stay informed in the current digital landscape. Users are bombarded with an unending stream of data, making it increasingly difficult to find relevant, trustworthy, and high-quality content.

drinking from the fire hydrant Blank Template - Imgflip — IMGflip

The Emergence of Aggregators and AI Platforms

In response to this information overload, we are witnessing a mean reversion to platforms like Substack and AI tools like ChatGPT. Substack offers a curated experience, allowing users to subscribe to specific writers or topics, thereby filtering the noise and focusing on trusted sources. ChatGPT, on the other hand, represents a more advanced solution. As an AI platform, it not only aggregates information but also processes, summarizes, and personalizes it. ChatGPT's ability to assist in various tasks, from rewriting emails to providing research support and copywriting, positions it as a powerful personal assistant in the digital age.

OpenAI is the Best, But Retention is Still Uncertain

AI tools like ChatGPT, despite their groundbreaking capabilities, still face challenges in user retention and daily engagement compared to established apps such as YouTube, Instagram, TikTok, or WhatsApp. This aspect of user behavior is crucial to understand in an era where technology adoption is rapidly changing.

ChatGPT, with its innovative approach, initially attracted a vast number of users, fueled largely by the hype surrounding its capabilities. This phenomenon mirrored the initial buzz generated by Meta's Threads app. The surge in interest was primarily driven by curiosity – a desire to explore and understand what these new tools were all about. However, for many of these users, their engagement was fleeting. They were not necessarily looking to permanently incorporate these tools into their daily routines but were instead drawn in by the novelty and the widespread conversation surrounding these technologies.

This pattern of onboarding, characterized by a spike in initial usage followed by a decline, highlights a critical challenge for emerging AI applications. While they capture attention and interest at launch, converting that initial curiosity into long-term, sustained usage is a more complex endeavor. It underscores the importance of not only innovating and introducing new technologies but also ensuring that these tools can effectively integrate into and enhance the everyday lives of users.

Daily Active Users/Monthly Active Users, State of AI Report

Impact on Database owners and AI Developers

The integration of Artificial Intelligence (AI), especially with the development of frontier models, is revolutionizing media. This technological leap is not just transforming the tools of trade but is also reshaping market dynamics. Key players, from large entities like Axel Springer and The New York Times to smaller organizations, are finding themselves caught between innovation and copyright protection.

As we stand at the crossroads of a rapidly evolving AI landscape, two distinct futures are emerging, each with profound implications for market leaders and the broader industry. The advent of frontier AI models, characterized by their vast scale and complexity, is not merely a technological revolution; it's reshaping the very fabric of how businesses operate and compete.

In this dichotomy, two possible worlds emerge: one where market leaders successfully integrate AI to reinforce their positions, and another where AI platforms like OpenAI become central to everything, reshaping markets in their image. Monitoring this situation closely is not just advisable; it's imperative for any business looking to thrive in the coming decades.

Key Takeaways

Below we unravel the implications of the AI revolution on industry giants, the challenges faced by smaller entities, and the strategic adaptations necessary for survival and success in this new era. To fully grasp the magnitude of these changes and to gain insightful forecasts on how this revolution will shape future industry, join us behind the paywall and upgrade to a paid subscription.

Continue reading this post for free, courtesy of Nick.

Or purchase a paid subscription.

Zen on Tech