WorldTree Corpus of Explanation Graphs for Elementary Science Questions

Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)

Wesbury Lab Wikipedia Corpus Snapshot of all articles into the English section of the Wikipedia which was consumed April 2010. It absolutely was prepared, as described in more detail below, to get rid of all links and unimportant product (navigation text, etc) The corpus is untagged, raw text. Utilized by Stanford NLP (1.8 GB).

: a corpus of manually-constructed explanation graphs, explanatory part ranks, and associated semistructured tablestore for the majority of publicly available primary technology exam concerns in the usa (8 MB)

Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)

Wikipedia XML information: complete content of all of the Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)

Yahoo! Answers questions that are comprehensive Responses: Yahoo! Responses corpus as of 10/25/2007. Contains 4,483,032 concerns and their responses. (3.6 GB)

Yahoo! Responses comprising questions expected in French: Subset of this Yahoo! Responses corpus from 2006 to 2015 composed of 1.7 million concerns posed in French, and their matching responses. (3.8 GB)

Yahoo! Responses Manner issues: subset associated with the Yahoo! Responses corpus from the 10/25/2007 dump, chosen because of their properties that are linguistic. Contains 142,627 concerns and their responses. (104 MB)

Yahoo! HTML Forms removed from Publicly Webpages that is available a tiny test of pages that have complex HTML kinds, contains 2.67 million complex types. (50+ GB)

Yahoo N-Gram Representations: This dataset contains n-gram representations. The info may act as a testbed for question task that is rewriting a common issue in IR research also to term and phrase similarity task, which can be typical in NLP research. (2.6 GB)

Yahoo! N-Grams, variation 2.0: n-grams (letter = 1 to 5), removed from a corpus of 14.6 million papers (126 million unique sentences, 3.4 billion running terms) crawled from over 12000 news-oriented web web sites (12 GB)

Yahoo! Search Logs with Relevance Judgments: Annonymized Yahoo! Re Search Logs with Relevance Judgments (1.3 GB)

Yahoo! Semantically Annotated Snapshot for the English Wikipedia: English Wikipedia dated from 2006-11-04 prepared with an amount of publicly-available NLP tools. 1,490,688 entries. (6 GB)

Yelp: including restaurant ratings and 2.2M reviews (on demand)

Youtube: 1.7 million youtube videos information (torrent)

Awesome datasets/NLP that are publicincludes more lists)
AWS Public Datasets
CrowdFlower: information for all (plenty of small studies they carried out and data acquired by crowdsourcing for a task that is specific
Kaggle 1, 2 (make certain though that the kaggle competition information can be utilized not in the competition! )
Open Library
Quora (primarily annotated corpora)
/r/datasets (endless variety of datasets, many is scraped by amateurs though rather than precisely documented or certified)
Rs.io (another big list)
Stackexchange: Opendata
Stanford NLP team (primarily annotated corpora and TreeBanks or real tools that are NLP
Yahoo! Webscope (comes with papers which use the info this is certainly supplied)

SaudiNewsNet: 31,030 Arabic paper articles alongwith metadata, removed from various online Saudi magazines. (2 MB)

Number of Urdu Datasets for POS, NER and NLP tasks.

German governmental Speeches Corpus: assortment of current speeches held by top German representatives (25 MB, 11 MTokens)

NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Readily available for free for several Universities and organizations that are non-profit. Want to signal and deliver type to have. (on demand)

Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for subject classification. (26.1 MB)

100k German Court Decisions: Open Legal Data releases a dataset of russian brides.com review 100,000 German court choices and 444,000 citations (772 MB)

Terms
Privacy
Protection
Status
Assist

Contact GitHub
Rates
API
Training
We We Blog
About

That action can’t be performed by you at this time around.

You finalized in with another window or tab. Reload to recharge your session. You finalized call at another window or tab. Reload to recharge your session.