Google has some very cool n-gram data sets available for download. Alas, the files are quite large. For example, the 3-grams are split into 200 zip files, each weighing in at 440 MB. That’s about 88 GB total.
The 1-grams are much lighter, totaling 2 GB. I was able to reduce this to about 35 MB by throwing away the time information (the original files indicate the year each data point came from). That’s smaller than the original files by a factor of 57.
The 2-grams, 3-grams, and so on could also be similarly compressed by anyone who can get the files. I’d be happy to help if anyone wants to do this.