This documentation describes the handy python modules that I use frequently in the course of my research in Natural Language Processing.
There are several sub modules for handling NLP text processing operations.
|bagofwords||Classes and methods that are useful when handling bags of words data structure.|
|bigvocab||Classes and methods for handling vocabulary of corpora with millions of tokens. (unreliable, untested!)|
|bleu||Methods for computing the BLEU precision/recall from two given list of tokens.|
|corpus||Classes and methods that are useful when handling corpora.|
|tfidf||Classes and methods for computing TF-IDF values.|
|tokenize||Methods to split a piece of text into individual sentences and into individual words. (deprecated)|
|tokenizer||Classes and methods for tokenizing text. This is a more robust and better version over the previous ycutils.nlp.tokenize module.|
These modules deal with reading and writing of files, in various formats, and/or handling files in general.
|io||Miscelleanous IO methods.|
|tsvio||ycutils.io.tsvio.TSVFile class handles reading/writing to tab separated values files.|
This module contains classes and methods that are useful when trying to download data from the Web.
TODO: Standardize download mechanisms for all these below and more robust JSON handling.
|webpages||Download webpages by calling Wget with subprocess. (TODO: a cURL implementation and more robust support of proxies, Tor, etc.)|
|wikipedia||Methods for downloading Wikipedia articles.|
|googlebooks||Downloading descriptions of books from Google books.|
|youtube||Methods for downloading Youtube video descriptions.|
|printable||Methods for identifying “printable” links and downloading printable versions of webpages.|
Currently a single module file containing miscellaneous convenience string methods.
Modules and methods which do not have a category of its own.
|misc||Miscellaneous functions that don’t really fall under any category.|
Here are some useful scripts for performing common NLP preprocessing tasks. These script depends heavily on the above modules and classes. They can be found in scripts/ directory of the package.
|plot-likelihood.py||This script displays a graphical plot of likelihood against iterations.|
|tokenize-docs.py||This script tokenizes large collections of text using ycutils.tokenize.|
|build-vocab.py||This script builds a vocabulary of the corpus using ycutils.corpus methods.|
|prune-vocab.py||This script prunes the vocabulary file according to criterion specified on the command line.|
|vocab-stats.py||This script displays statistics about the given ycutils.corpus.CorpusVocabulary file.|
yc-pyutils is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
yc-pyutils is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with yc-pyutils. If not, see http://www.gnu.org/licenses/.