This documentation describes the handy python modules that I use frequently in the course of my research in Natural Language Processing.
Below is a list of Python modules in this package.
| Modules | Description |
|---|---|
| bagofwords | This module contains classes and methods that are useful when handling bags of words data structure. |
| bigvocab | This module contains classes and methods for handling vocabulary of corpora with millions of tokens. |
| bleu | This module contains methods for computing the BLEU precision/recall from two given list of tokens. |
| corpus | This module contains classes and methods that are useful when handling corpora. |
| tfidf | This module contains classes and methods for computing TF-IDF values. |
| tokenize | This module contains function to split a piece of text into individual sentences and into individual words. |
| tsvio | This module contains the TSVFile class which handles reading/writing to tab separated values files. |
| urls | This module contains classes and methods that are useful when trying to download data from the web. |
| misc | This module contains miscellaneous functions that don’t really fall under any category. |
Here are some useful scripts for performing common NLP preprocessing tasks. These script depends heavily on the above modules and classes. They can be found in scripts/ directory of the package.
| Scripts | Description |
|---|---|
| prune-vocab.py | This script prunes the vocabulary file according to criterion specified on the command line. |
| tokenize-docs.py | This script tokenizes large collections of text using ycutils.tokenize. |
| vocab-stats.py | This script displays statistics about the given ycutils.corpus.CorpusVocabulary file. |
yc-pyutils is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
yc-pyutils is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with yc-pyutils. If not, see http://www.gnu.org/licenses/.