Table Of Contents

Next topic

bagofwords module

This Page

Documentation for yc-pyutils 1.0dev

This documentation describes the handy python modules that I use frequently in the course of my research in Natural Language Processing.

NLP modules

There are several sub modules for handling NLP text processing operations.

Modules Description
bagofwords Classes and methods that are useful when handling bags of words data structure.
bigvocab Classes and methods for handling vocabulary of corpora with millions of tokens. (unreliable, untested!)
bleu Methods for computing the BLEU precision/recall from two given list of tokens.
corpus Classes and methods that are useful when handling corpora.
tfidf Classes and methods for computing TF-IDF values.
tokenize Methods to split a piece of text into individual sentences and into individual words.

IO modules

These modules deal with reading and writing of files, in various formats, and/or handling files in general.

Modules Description
io Miscelleanous IO methods.
tsvio ycutils.io.tsvio.TSVFile class handles reading/writing to tab separated values files.

URLs modules

This module contains classes and methods that are useful when trying to download data from the Web.

TODO: Standardize download mechanisms for all these below and more robust JSON handling.

Modules Description
webpages Download webpages by calling Wget with subprocess. (TODO: a cURL implementation and more robust support of proxies, Tor, etc.)
wikipedia Methods for downloading Wikipedia articles.
googlebooks Downloading descriptions of books from Google books.
youtube Methods for downloading Youtube video descriptions.
printable Methods for identifying “printable” links and downloading printable versions of webpages.

String module

Currently a single module file containing miscellaneous convenience string methods.

strings

Miscellaneous modules

Modules and methods which do not have a category of its own.

Modules Description
misc Miscellaneous functions that don’t really fall under any category.

Useful scripts

Here are some useful scripts for performing common NLP preprocessing tasks. These script depends heavily on the above modules and classes. They can be found in scripts/ directory of the package.

Scripts Description
plot-likelihood.py This script displays a graphical plot of likelihood against iterations.
tokenize-docs.py This script tokenizes large collections of text using ycutils.tokenize.
build-vocab.py This script builds a vocabulary of the corpus using ycutils.corpus methods.
prune-vocab.py This script prunes the vocabulary file according to criterion specified on the command line.
vocab-stats.py This script displays statistics about the given ycutils.corpus.CorpusVocabulary file.

License

yc-pyutils is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

yc-pyutils is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with yc-pyutils. If not, see http://www.gnu.org/licenses/.