IMDB sentiment analysis with QRNN's

Written 15 Mar 2017 by Sergei Turukin

Some time ago MetaMind published a paper on Quasi-recurrent neural networks. See their blog post for some nice explanations. My particular interest in QRNN was because deepvoice paper used them as “conditioning” part for synthesis model.

Toy problem definition

In a paper authors used several problems to demonstate novel networks: sentiment analysis, language modeling and character-level translation. This post describes how to implement QRNN network and perform sentiment analysis experiment with IMDB dataset.

In his post James Bradbury mentioned he used chainer for his experiments and published chainer implementation of QRNN layer. That’s very convenient and useful. Thanks to him! We will be using his implementation (as well as modified chainer) in our experiments.

You can find whole QRNN model implementation in my repo in case you’re interested.

IMDB dataset

Stanford offers IMDb dataset of their website. Basically it’s well-balanced dataset: it consists of 25k positive and 25k negative samples. Every sample is a short description with a 10-scale rating. Samples are split into pos/ and neg/ folders. As a bonus, dataset containes bag-of-words representations and vocabulary file.

To replicate paper experiment setup we will also need GloVe embeddings.

word vectors initialized using 300-dimensional cased GloVe embeddings

Here you can download cased 300-dimensional embeddings from common crawl.

Data preparation and preprocessing

Data preparation includes several steps: first we need to convert dataset samples from ascii text into numbers, that is create mapping word -> number. We have to create vocabulary first. Then we can turn text into vector of integers using create vocabulary as a reference.

Create vocabulary
Convert text samples into vectors of integers

I used glove_to_npy.py script that performs data conversion. For it to work one will need to pass GloVe embeddings to it (as a text file) and specify output path for both embeddings and vocabulary. We will use vocabulary for dataset creation and embeddings for weights initialization.

For example, use it like this:

python3 glove_to_npy.py -d data/glove.840B.300d.txt --npy_output data/embeddings.npy --dict_output data/vocab.pckl --dict_whitelist data/aclImdb/imdb.vocab

Original common crawl GloVe embeddings have vocabulary size of 2.2M words that is quite large. Holding embedding matrix of that size will require ~2.5GB of memory. For this and some other reason I decided to reduce its size: only words from IMDB dataset (specified in imdb.vocab) would go into our vocabulary. That effectively reduces its size to 68K and memory requirements to ~0.4GB. Good enough.

Other reason is that chainer (version 1.21) has problems allocating such huge matrix and gives me errors once it tries to access allocated data. Nor pytorch neither tensorflow didn’t have such problem (at the time of writing).

Here is small code snippet that actually traverse GloVe embeddings file and produces npy embeddings file and vocabulary:

float_re = re.compile(' [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?')

with open(args.dataset) as ofile, \
     open(args.dict_output, 'wb') as dfile, \
     open(args.npy_output, 'wb') as nfile:

    # explanation why we initialize it to 1 will follow
    idx = 1
    for line in ofile:
        pos = next(re.finditer(float_re, line)).start()
        word, vector = line[:pos], line[pos+1:].split()

        # ignore words that we're not interested in
        if word not in whitelist:
            continue

        embedding = np.fromiter([float(d) for d in vector], np.float32)
        embeddings.append(embedding)
        data[word] = idx

        idx += 1

Small trick is to reserve some label for <UNK> token when we encounter word that aren’t in our vocabulary. chainer has builtin functionality for that, but here I do so explicilty.

# during vocabluary building, reserve 0 for unknown words
data = {
    '': 0
}
embeddings = [
    np.zeros((300), dtype=np.float32)
]

# initialize index with 1 as 0 is reserved
idx = 1

Hopefully, after running preprocessing script you’ll have all files we need:

$ ll data
total 5.4G
drwxr-xr-x 4 xxx xxx 4.0K Jun 26  2011 aclImdb
-rw-rw-r-- 1 xxx xxx  79M Mar 14 12:55 embeddings.npy
-rw-rw-r-- 1 xxx xxx 5.3G Oct 24  2015 glove.840B.300d.txt
-rw-rw-r-- 1 xxx xxx 1.4M Mar 14 12:55 vocab.pckl

For model training one might actually want to use embeddings (sic!). Following code snippet shows function that loads embeddings from *.npy file:

def load_embeddings(path, size, dimensions):
    # premature memory optimization :)
    ret = np.zeros((size, dimensions), dtype=np.float32)

    # As embedding matrix could be quite big we 'stream' it into output file
    # chunk by chunk. One chunk shape could be [size // 10, dimensions].
    # So to load whole matrix we read the file until it's exhausted.
    size = os.stat(path).st_size
    with open(path, 'rb') as ifile:
        pos = 0
        idx = 0
        while pos < size:
            chunk = np.load(ifile)
            chunk_size = chunk.shape[0]
            ret[idx:idx+chunk_size, :] = chunk
            idx += chunk_size
            pos = ifile.tell()
    return ret

Model architecture

This time wheel reinvention wasn’t needed as James Bradbury published his implementation of QRNN layer for chainer. Nice! We will use it!

For this implementation to work we will need some changes to be done to chainer codebase. One method is to clone jekbradbury chainer repo, checkout raw-kernel branch and build from sources (consult with chainer docs for more information in case of problems):

git clone https://github.com/jekbradbury/chainer
cd chainer
git checkout raw-kernel
python setup.py install

That didn’t work for my setup (python3.6, newer versions of packages, I couldn’t get chainer compiled). So other way is to apply following patch (extracted from jekbradbury tree and modified) to most recent chainer source code:

git clone https://github.com/pfnet/chainer.git
cd chainer
git apply patch.diff
python setup.py install

Find patch.diff file contents here:

diff --git a/chainer/cuda.py b/chainer/cuda.py
index 5daaea0b..96ea6cfa 100644
--- a/chainer/cuda.py
+++ b/chainer/cuda.py
@@ -537,6 +537,24 @@ def reduce(in_params, out_params, map_expr, reduce_expr, post_map_expr,
         identity, name, **kwargs)


+@memoize(for_each_device=True)
+def raw(operation, name, **kwargs):
+    """Creates a global raw kernel function.
+
+    This function uses :func:`~chainer.cuda.memoize` to cache the resulting
+    kernel object, i.e. the resulting kernel object is cached for each argument
+    combination and CUDA device.
+
+    The arguments are the same as those for
+    :class:`cupy.RawKernel`, except that the ``name`` argument is
+    mandatory.
+
+    """
+    check_cuda_available()
+    return cupy.RawKernel(
+        operation, name, **kwargs)
+
+
 # ------------------------------------------------------------------------------
 # numpy/cupy compatible coding
 # ------------------------------------------------------------------------------
diff --git a/cupy/__init__.py b/cupy/__init__.py
index dcbd288a..fb4774e0 100644
--- a/cupy/__init__.py
+++ b/cupy/__init__.py
@@ -393,5 +393,6 @@ from cupy.util import clear_memo  # NOQA
 from cupy.util import memoize  # NOQA

 from cupy.core import ElementwiseKernel  # NOQA
+from cupy.core import RawKernel  # NOQA
 from cupy.core import ReductionKernel  # NOQA

diff --git a/cupy/core/__init__.py b/cupy/core/__init__.py
index 21324a9a..0ce05faf 100644
--- a/cupy/core/__init__.py
+++ b/cupy/core/__init__.py
@@ -38,6 +38,7 @@ from cupy.core.core import ndarray  # NOQA
 from cupy.core.core import negative  # NOQA
 from cupy.core.core import not_equal  # NOQA
 from cupy.core.core import power  # NOQA
+from cupy.core.core import RawKernel  # NOQA
 from cupy.core.core import ReductionKernel  # NOQA
 from cupy.core.core import remainder  # NOQA
 from cupy.core.core import right_shift  # NOQA
diff --git a/cupy/core/core.pyx b/cupy/core/core.pyx
index e1903d3f..06473cfa 100644
--- a/cupy/core/core.pyx
+++ b/cupy/core/core.pyx
@@ -1399,6 +1399,7 @@ cpdef vector.vector[Py_ssize_t] _get_strides_for_nocopy_reshape(
 include "carray.pxi"
 include "elementwise.pxi"
 include "reduction.pxi"
+include "raw.pxi"


 # =============================================================================
diff --git a/cupy/core/raw.pxi b/cupy/core/raw.pxi
new file mode 100644
index 00000000..453458ac
--- /dev/null
+++ b/cupy/core/raw.pxi
@@ -0,0 +1,74 @@
+import string
+
+import numpy
+import six
+
+from cupy import util
+
+from cupy.cuda cimport device
+from cupy.cuda cimport function
+
+
+@util.memoize(for_each_device=True)
+def _get_raw_kernel(
+        module_code, name,
+        options=()):
+    module = compile_with_cache(module_code, options)
+    return module.get_function(name)
+
+
+cdef class RawKernel:
+
+    """User-defined raw CUDA kernel.
+
+    This class can be used to define a raw CUDA kernel by writing the entire
+    function declaration and body as CUDA-C code.
+
+    The kernel is compiled at an invocation of the
+    :meth:`~RawKernel.__call__` method, which is cached for each device.
+    The compiled binary is also cached into a file under the
+    ``$HOME/.cupy/kernel_cache/`` directory with a hashed file name. The cached
+    binary is reused by other processes.
+
+    Args:
+        operation (str): Raw CUDA-C/C++ code with one or more kernels.
+        name (str): Name of the kernel function to call. It should be set for
+            readability of the performance profiling. It must be identical
+            to the name of a function defined in the CUDA-C code.
+        options (tuple): Options passed to the ``nvcc`` command.
+
+    """
+
+    cdef:
+        readonly str operation
+        readonly str name
+        readonly tuple options
+
+    def __init__(self, operation,
+                 name='kernel', options=()):
+        self.operation = operation
+        self.name = name
+        self.options = options
+
+    def __call__(self, grid=(1,), block=(1,), *args, stream=None):
+        """Compiles and invokes the raw kernel.
+
+        The compilation runs only if the kernel is not cached.
+
+        Args:
+            grid (tuple): Grid sizes (number of blocks in x,y,z dimensions).
+            block (tuple): Block sizes (number of threads/block in x,y,z dims).
+            args: Arguments of the kernel.
+            stream: CUDA stream or None.
+
+        Returns:
+            None
+
+        """
+
+        cdef function.Function kern
+
+        kern = _get_raw_kernel(
+            self.operation,
+            self.name, options=self.options)
+        kern(grid, block, *args, stream=stream)

I also have to do minor changes to published code to make it work: Linear layer wasn’t designed to produce output with different number of filters (channel/number of units/number of neurons/non-rectangular weight matrix).

For IMDb sentiment analysis experiment I used architecture inspired by paper description:

Our best performance on a held-out development set was achieved using a four-layer densely-connected QRNN with 256 units per layer and word vectors initialized using 300-dimensional cased GloVe embeddings (Pennington et al., 2014). Dropout of 0.3 was applied between layers, and we used \(L2\) regularization of \(4 × 10^{−6}\). Optimization was performed on minibatches of 24 examples using RMSprop (Tieleman & Hinton, 2012) with learning rate of 0.001, \(\alpha = 0.9\), and \(\epsilon = 10^{−8}\).

Here is chainer model code definition. Differencies (known at least) are:

I didn’t implement dense connections (connections between each pair of layers, for details look here)
Layers kernel width is always 2 (in paper they propose using different width for first layer)

class QRNNModel(chainer.Chain):
    def __init__(self, vocab_size, out_size, hidden_size, dropout):
        super().__init__(
            layer1=QRNNLayer(out_size, hidden_size),
            layer2=QRNNLayer(hidden_size, hidden_size),
            layer3=QRNNLayer(hidden_size, hidden_size),
            layer4=QRNNLayer(hidden_size, hidden_size),
            fc=L.Linear(None, 2)
        )
        self.embed = L.EmbedID(vocab_size, out_size)
        self.dropout = dropout
        self.train = True

    def __call__(self, x):
        h = self.embed(x)
        h = F.dropout(self.layer1(h), self.dropout, self.train)
        h = F.dropout(self.layer2(h), self.dropout, self.train)
        h = F.dropout(self.layer3(h), self.dropout, self.train)
        h = F.dropout(self.layer4(h), self.dropout, self.train)
        return self.fc(h)

Training results

Training routine is quite simple: load dataset, initialize embeddings, setup model and optimizer and start training. Here is excerpt from my train.py:

def main():
    train, test = IMDBDataset(os.path.join(args.dataset, 'train'), args.vocabulary, args.maxlen),\
                  IMDBDataset(os.path.join(args.dataset, 'test'), args.vocabulary, args.maxlen)

    model = L.Classifier(QRNNModel(
        args.vocab_size, args.out_size, args.hidden_size, args.dropout))

    if args.embeddings:
        model.predictor.embed.W.data = util.load_embeddings(
            args.embeddings, args.vocab_size, args.out_size)

    optimizer = chainer.optimizers.RMSprop(lr=0.001, alpha=0.9)
    optimizer.setup(model)
    optimizer.add_hook(chainer.optimizer.WeightDecay(5e-4))

    train_iter = chainer.iterators.SerialIterator(train, args.batchsize)
    test_iter = chainer.iterators.SerialIterator(test, args.batchsize,
                                                 repeat=False, shuffle=False)

    updater = training.StandardUpdater(train_iter, optimizer, device=args.gpu)
    trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

    trainer.run()

Following command will start training with default parameters.

$ python3 train.py -g0 -o results/ --dataset data/aclImdb/ --vocabulary data/vocab.pckl --embeddings data/embeddings.npy

Default parameters are:

Batch size: 24
Vocabulary size: 68379
Embeddings dimension size: 300
QRNN layer number of units (hidden dimension size): 256
Maximum document length (number of words in a single document): 400
Dropout ratio: 0.3

It trains the model for 20 epochs, one epoch takes 30 seconds on my NVIDIA Titan X Pascal GPU, and achieves ~87% accuracy on a validation set. One could go further and improve the model and perform some hyper parameter tuning that could improve results and get exactly the same numbers they have in a paper.

Regularization

I’ve experimented with the model and found that using pretrained GloVe embeddings acts as a regularizer: it doesn’t let the model to suit embeddings exactly to a task (or data). Using pretrained embeddings helps model to achieve higher accuracy numbers.

« Wavenet Training neural models for speech recognition and synthesis »

Sergei Turukin

Software engineer and enterpreneur