According to reputable sources, this blog contains 0.00006% of the world’s knowledge.
- The large language models (LLMs) that underlie tools like ChatGPT and Bing-AI are being used as question-answering tools. If you listen to the hype surrounding what LLMs can do, you can hardly be faulted for thinking that is has every fact known to humankind and can answer any question.
- One of the most popular large language models, GPT-3, was trained with several large text datasets.
- One dataset, C4 (a filtered version of the Common Crawl), is 60% of the text used in training.
- According to this article in the Washington Post, dltj.org is 0.0001% of the tokens in the C4 dataset.
How much is 0.0001% of the GPT-3 training set?
It is a quarter of an inch (half a centimeter) off sea level on a climb up Mount Everest. (Source: Wolfram Alpha)
It is almost 8 feet (2.5 meters) of a journey from Washington, DC, to San Francisco, California (Source: Wolfram Alpha)
In contrast, the content from the New York Times is 0.036% of the training dataset, or 9/10ths of a mile (1.4km) on that journey.
(A note about assumptions: OpenAI hasn’t published the contents of the training data for GPT-3.5—which is used in ChatGPT—and GPT-4. So this post uses the data from GPT-3 as listed in Wikpedia. )
You can use the search tool near the bottom of the Washington Post article to see where your favorite website ranks.
But also read the article to explore what is in the C4 version of the Common Crawl.
As much as OpenAI is trying to put guardrails on the output, the model itself is trained on some pretty offensive stuff.