In this section, we will learn about the concept of Corpus/Corpora and how to exploit them from NLTK. In this section we will only go through the basic information of Corpus only, in the Webtext tool Group Buy following sections, we will go deeper into the things that Corpus provides and is more valuable such as Part-of-speech tags, dialogue tags or syntactic trees…
What is Corpus/Corpora?
Simply put, Corpora is digitized text and language data. Corpora can be viewed as a corpus. Corpora is usually processed data, used as input to algorithms in NLP. For example, Webtext tool Group Buy Book is a supplied Corpora. Corpora has many features, but within the framework of the article will not go into details of those features. You can refer to the link below to understand more about Corpora/Corpus Overview of Corpus Deeper understanding of Corpus
The available Corpora of NLTK.
In the first lessons, we have exploited 1 corpora of NLTK, Book. In this article, we will explore other Corporas that Webtext tool Group Buy provides to diversify information in our language processing.
Gutenberg is a project that provides 25,000 free e-books, its Web site is https://www.gutenberg.org/. And Webtext tool Group Buy has taken a small portion of the books in this project to give us in their packages. To display the books in the Gutenberg project that NLTK provides, we do the following
Web and Chat Text Corpus.
Although Gutenberg offers many books, it only provides us with formal texts, classics. Now, we want more documents, conversations on the Web or social networks. Webtext tool Group Buy also provides us with these documents. Try viewing Web Text documents, get the File name and the first 65 characters.
This is the first electronic Corpus in English, dating back to 1961. It contains text from 500 sources and the sources are group by category such as news, editorials… You can see the list. full book here http://icame.uib.no/brown/bcm-los.html Now let’s explore a little about Brown Corpus. First get the Corpus and its categories information.
Brown Corpus is a great resource for learning semantics between genres of text, a form of linguistics called “stylistics”. Then let’s do a little statistic about the use of modal verbs in the text genres that Brown Corpus provides. First, we will have a list of Modal verbs: modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’, ‘will’].
Here, I have used a function called ConditionalFreqDist, we will learn it later, temporarily we use it to know how to use Model verbs in each type of text.
Corpus now contains more than 10,000 news and 1.3 million words. It has been divided into 90 topics and has not been made into 2 volumes, Webtext tool Group Buy “training” and “testing”. The reason for such division is for the use of Machine Learning – I will write about this in the future. Documents will be marked as “training/1234” or “test/1234”.
Annotated Text Corpus
Marked Corpus, NLTK provides a lot of marked Corpus showing like POP Tags, name entities, syntactic structures… If you already have basic knowledge of NLP then you can always access these this resource.