Category: Training Data

  • The Death of the Internet and the Genesis of Language 1.0

    The Death of the Internet and the Genesis of Language 1.0

    The Internet was beginning, around the turn of the century, to be the end sum repository for all human knowledge. All thoughts, all books, all diaries, all photos, all videos… basically, a document repository of everything significant (and insignificant) piece of media that humans had every produced, from the beginning of history to the present…

  • WebText2, Webtext, OpenWebText : Deep Inside the AI Datasets

    WebText2, Webtext, OpenWebText : Deep Inside the AI Datasets

    Webtext was OpenAI’s attempt to give AI higher quality input than the mess of Common Crawl. WebText2 upped the ante. We dig into the contents. The Rationale for Webtext & Webtext2: Quality Prior to GPT-2 (which was really the breakthrough chat AI), Deep Learning LLMs were generally fed on diets comprised wholly of WikiPedia, public-domain…

  • AI Training: the Terrifying Difference a Single Word Makes

    AI Training: the Terrifying Difference a Single Word Makes

    It fascinates me to no end that the current main line of thinking in AI research can be summed up in three simple words: “Just Scale It.” Let’s talk a little about AI training; and more specifically, the black magic of AI initialization prompts. I’ve summed up the (fairly detailed) steps of how a modern…

  • the Deep Learning Revolution: Why Today’s AI so Radically Transcends the Last 50 Years

    the Deep Learning Revolution: Why Today’s AI so Radically Transcends the Last 50 Years

    The purpose of this post is to enlighten you as to the fundamentals of the present Deep Learning Revolution, and to simultaneously debunk two very common myths which I hear over and over again from normal intelligent people. Debunking Common AI Myths Those being: AI is just one more innovation in a long string of…

  • AI Training Datasets: the Books1+Books2 that Big AI eats for breakfast

    AI Training Datasets: the Books1+Books2 that Big AI eats for breakfast

    It’s good to know, when dealing with an AI, just how it obtained its vast knowledge of “the world.” (or more particularly, the world as described on the internet… which, while similar to our physical reality, isn’t quite the same thing…) And as you might imagine, not all AIs are created equal. One of the…

  • The Future of AI Training Data 2023: The Untapped Digital Content Well

    The Future of AI Training Data 2023: The Untapped Digital Content Well

    Scientists say “AI Training Data”…we say: “It ate the Internet.” All the Text, All the Songs, All the Streams, All the Feeds… ALL THE EVERYTHING The particular entity I’m conversing with these days (c. November 2022) is a 5 year old bot, very un-creatively named “GPT3,” which was instantiated c.MMXVII [AD 2017]. In general, the…