Thursday, April 20, 2023

0.0004%

 The Washington Post has an interesting search engine that searches Google's C4 dataset, which is often used for Large Language Models (although not, as far as we know, ChatGPT, which seems to use a much, much larger dataset that was obtained independently), to see how much of the dataset is occupied by a given website. Blogspot blogs that have been around for a while tend to do fairly well (they are easily accessible large collections of webpages); Siris is ranked 21,453 in C4 (out of about 15 million websites), occupying 0.0004% of the whole dataset. (For comparison, Wikipedia occupies 0.19% of the dataset and the largest influence, patents.google.com, occupies 0.46%.) The ranking is determined by counting 'tokens', which are roughly words and word-like segments in archived websites that would be statistically analyzed by LLM's, and there are about 670,000 Siris-originated tokens in C4.