This website contains 0.00006% of the world’s knowledge

April 30, 2023

According to reputable sources, this blog contains 0.00006% of the world’s knowledge.

The large language models (LLMs) that underlie tools like ChatGPT and Bing-AI are being used as question-answering tools. If you listen to the hype surrounding what LLMs can do, you can hardly be faulted for thinking that is has every fact known to humankind and can answer any question.

One of the most popular large language models, GPT-3, was trained with several large text datasets.
One dataset, C4 (a filtered version of the Common Crawl), is 60% of the text used in training.
According to this article in the Washington Post, dltj.org is 0.0001% of the tokens in the C4 dataset.

Screen capture of ‘dltj.org’ search results in Washington Post article.

How much is 0.0001% of the GPT-3 training set?
It is a quarter of an inch (half a centimeter) off sea level on a climb up Mount Everest. (Source: Wolfram Alpha)
It is almost 8 feet (2.5 meters) of a journey from Washington, DC, to San Francisco, California (Source: Wolfram Alpha)
In contrast, the content from the New York Times is 0.036% of the training dataset, or 9/10ths of a mile (1.4km) on that journey.

(A note about assumptions: OpenAI hasn’t published the contents of the training data for GPT-3.5—which is used in ChatGPT—and GPT-4. So this post uses the data from GPT-3 as listed in Wikpedia. )

You can use the search tool near the bottom of the Washington Post article to see where your favorite website ranks.
But also read the article to explore what is in the C4 version of the Common Crawl.
As much as OpenAI is trying to put guardrails on the output, the model itself is trained on some pretty offensive stuff.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

This website contains 0.00006% of the world’s knowledge

LEAVE A REPLY Cancel reply

EDITOR PICKS

Builders Are Building Smaller Homes [INFOGRAPHIC]

Arming Teachers Could Cause ‘Accidents and More Tragedy,’ Miguel Cardona Says

Authentic Project Ideas – Mardi Gras

POPULAR POSTS

Builders Are Building Smaller Homes [INFOGRAPHIC]

Arming Teachers Could Cause ‘Accidents and More Tragedy,’ Miguel Cardona Says

Authentic Project Ideas – Mardi Gras

POPULAR CATEGORY

ABOUT US

FOLLOW US