December 20, 2011

Google Books 1-grams

December 20, 2011
Google Books 1-grams
Peter de Blanc

Google has some very cool n-gram data sets available for download. Alas, the files are quite large. For example, the 3-grams are split into 200 zip files, each weighing in at 440 MB. That’s about 88 GB total.

The 1-grams are much lighter, totaling 2 GB. I was able to reduce this to about 35 MB by throwing away the time information (the original files indicate the year each data point came from). That’s smaller than the original files by a factor of 57.

1-grams in lexicographical order | 1-grams by count | 1-grams by length

The 2-grams, 3-grams, and so on could also be similarly compressed by anyone who can get the files. I’d be happy to help if anyone wants to do this.

