Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of all articles into the English section of the Wikipedia which was consumed April 2010. It absolutely was prepared, as described in more detail below, to get rid of all links and unimportant product (navigation text, etc) The corpus is untagged, raw text. Utilized by Stanford NLP (1.8 GB).
: a corpus of manually-constructed explanation graphs, explanatory part ranks, and associated semistructured tablestore for the majority of publicly available primary technology exam concerns in the usa (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)
Wikipedia XML information: complete content of all of the Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)
Yahoo! Answers questions that are comprehensive Responses: Yahoo! Responses corpus as of 10/25/2007. Contains 4,483,032 concerns and their responses. (3.6 GB)
Yahoo! Responses comprising questions expected in French: Subset of this Yahoo! Responses corpus from 2006 to 2015 composed of 1.7 million concerns posed in French, and their matching responses. (3.8 GB)
Yahoo! Responses Manner issues: subset associated with the Yahoo! Responses corpus from the 10/25/2007 dump, chosen because of their properties that are linguistic. Continue Reading →