Bricks

class blocks.bricks.Bias(*args, **kwargs)

Bases: blocks.bricks.Feedforward, blocks.bricks.Initializable

Add a bias (i.e. sum with a vector).

apply

Apply the linear transformation.

Parameters:input (TensorVariable) – The input on which to apply the transformation
Returns:output – The transformed input plus optional bias
Return type:TensorVariable
get_dim(name)
input_dim
output_dim
class blocks.bricks.Feedforward(name=None)

Bases: blocks.bricks.base.Brick

Declares an interface for bricks with one input and one output.

Many bricks have just one input and just one output (activations, Linear, MLP). To make such bricks interchangable in most contexts they should share an interface for configuring their input and output dimensions. This brick declares such an interface.

input_dim

int

The input dimension of the brick.

output_dim

int

The output dimension of the brick.

class blocks.bricks.FeedforwardSequence(application_methods, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Feedforward

A sequence where the first and last bricks are feedforward.

Parameters:application_methods (list) – List of BoundApplication to apply. The first and last application method should belong to a Feedforward brick.
input_dim
output_dim
class blocks.bricks.Identity(name=None)

Bases: blocks.bricks.Activation

Elementwise application of identity function.

apply

Apply the identity function element-wise.

Parameters:input (TensorVariable) – Theano variable to apply identity to, element-wise.
Returns:output – The input with the activation function applied.
Return type:TensorVariable
class blocks.bricks.Initializable(*args, **kwargs)

Bases: blocks.bricks.base.Brick

Base class for bricks which push parameter initialization.

Many bricks will initialize children which perform a linear transformation, often with biases. This brick allows the weights and biases initialization to be configured in the parent brick and pushed down the hierarchy.

Parameters:
  • weights_init (object) – A NdarrayInitialization instance which will be used by to initialize the weight matrix. Required by initialize().
  • biases_init (object, optional) – A NdarrayInitialization instance that will be used to initialize the biases. Required by initialize() when use_bias is True. Only supported by bricks for which has_biases is True.
  • use_bias (bool, optional) – Whether to use a bias. Defaults to True. Required by initialize(). Only supported by bricks for which has_biases is True.
  • rng (numpy.random.RandomState) –
has_biases

bool

False if the brick does not support biases, and only has weights_init. For an example of this, see Bidirectional. If this is False, the brick does not support the arguments biases_init or use_bias.

has_biases = True
rng
seed
seed_rng = <mtrand.RandomState object at 0x7f42910e2510>
class blocks.bricks.Linear(*args, **kwargs)

Bases: blocks.bricks.Initializable, blocks.bricks.Feedforward

A linear transformation with optional bias.

Brick which applies a linear (affine) transformation by multiplying the input with a weight matrix. By default, a bias term is added (see Initializable for information on disabling this).

Parameters:
  • input_dim (int) – The dimension of the input. Required by allocate().
  • output_dim (int) – The dimension of the output. Required by allocate().

Notes

See Initializable for initialization parameters.

A linear transformation with bias is a matrix multiplication followed by a vector summation.

\[f(\mathbf{x}) = \mathbf{W}\mathbf{x} + \mathbf{b}\]
W
apply

Apply the linear transformation.

Parameters:input (TensorVariable) – The input on which to apply the transformation
Returns:output – The transformed input plus optional bias
Return type:TensorVariable
b
get_dim(name)
class blocks.bricks.LinearMaxout(*args, **kwargs)

Bases: blocks.bricks.Initializable, blocks.bricks.Feedforward

Maxout pooling following a linear transformation.

This code combines the Linear brick with a Maxout brick.

Parameters:
  • input_dim (int) – The dimension of the input. Required by allocate().
  • output_dim (int) – The dimension of the output. Required by allocate().
  • num_pieces (int) – The number of linear functions. Required by allocate().

Notes

See Initializable for initialization parameters.

apply

Apply the linear transformation followed by maxout.

Parameters:input (TensorVariable) – The input on which to apply the transformations
Returns:output – The transformed input
Return type:TensorVariable
input_dim
class blocks.bricks.Logistic(name=None)

Bases: blocks.bricks.Activation

Elementwise application of logistic function.

apply

Apply the logistic function element-wise.

Parameters:input (TensorVariable) – Theano variable to apply logistic to, element-wise.
Returns:output – The input with the activation function applied.
Return type:TensorVariable
class blocks.bricks.MLP(*args, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Initializable, blocks.bricks.Feedforward

A simple multi-layer perceptron.

Parameters:
  • activations (list of Brick, BoundApplication,) – or None A list of activations to apply after each linear transformation. Give None to not apply any activation. It is assumed that the application method to use is apply. Required for __init__().
  • dims (list of ints) – A list of input dimensions, as well as the output dimension of the last layer. Required for allocate().

Notes

See Initializable for initialization parameters.

Note that the weights_init, biases_init and use_bias configurations will overwrite those of the layers each time the MLP is re-initialized. For more fine-grained control, push the configuration to the child layers manually before initialization.

>>> from blocks.initialization import IsotropicGaussian, Constant
>>> mlp = MLP(activations=[Tanh(), None], dims=[30, 20, 10],
...           weights_init=IsotropicGaussian(),
...           biases_init=Constant(1))
>>> mlp.push_initialization_config()  # Configure children
>>> mlp.children[0].weights_init = IsotropicGaussian(0.1)
>>> mlp.initialize()
input_dim
output_dim
class blocks.bricks.Maxout(*args, **kwargs)

Bases: blocks.bricks.base.Brick

Maxout pooling transformation.

A brick that does max pooling over groups of input units. If you use this code in a research project, please cite [GWFM13].

[GWFM13]Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, Maxout networks, ICML (2013), pp. 1319-1327.
Parameters:num_pieces (int) – The size of the groups the maximum is taken over.

Notes

Maxout applies a set of linear transformations to a vector and selects for each output dimension the result with the highest value.

apply

Apply the maxout transformation.

Parameters:input (TensorVariable) – The input on which to apply the transformation
Returns:output – The transformed input
Return type:TensorVariable
class blocks.bricks.NDimensionalSoftmax(name=None)

Bases: blocks.bricks.Softmax

A wrapped brick class.

This brick was automatically constructed by wrapping Softmax with WithExtraDims.

See also

BrickWrapper
For explanation of brick wrapping.

Softmax WithExtraDims

apply

Wraps the application method with reshapes.

Parameters:extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.apply()
For documentation of the wrapped application method.
apply_delegate()
categorical_cross_entropy

Wraps the application method with reshapes.

Parameters:extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.categorical_cross_entropy()
For documentation of the wrapped application method.
categorical_cross_entropy_delegate()
decorators = [<blocks.bricks.wrappers.WithExtraDims object at 0x7f42910ec190>]
log_probabilities

Wraps the application method with reshapes.

Parameters:extra_ndim (int, optional) – The number of extra dimensions. Default is zero.

See also

Softmax.log_probabilities()
For documentation of the wrapped application method.
log_probabilities_delegate()
class blocks.bricks.Random(theano_seed=None, **kwargs)

Bases: blocks.bricks.base.Brick

A mixin class for Bricks which need Theano RNGs.

Parameters:theano_seed (int or list, optional) – Seed to use for a MRG_RandomStreams object.
seed_rng = <mtrand.RandomState object at 0x7f42910e22d0>
theano_rng

Returns Brick’s Theano RNG, or a default one.

The default seed can be set through blocks.config.

theano_seed
class blocks.bricks.Rectifier(name=None)

Bases: blocks.bricks.Activation

Elementwise application of rectifier function.

apply

Apply the rectifier function element-wise.

Parameters:input (TensorVariable) – Theano variable to apply rectifier to, element-wise.
Returns:output – The input with the activation function applied.
Return type:TensorVariable
class blocks.bricks.Sequence(application_methods, **kwargs)

Bases: blocks.bricks.base.Brick

A sequence of bricks.

This brick applies a sequence of bricks, assuming that their in- and outputs are compatible.

Parameters:application_methods (list) – List of BoundApplication to apply
apply
apply_inputs()
apply_outputs()
class blocks.bricks.Softmax(name=None)

Bases: blocks.bricks.base.Brick

A softmax brick.

Works with 2-dimensional inputs only. If you need more, see NDimensionalSoftmax.

apply

Standard softmax.

Parameters:input (Variable) – A matrix, each row contains unnormalized log-probabilities of a distribution.
Returns:output_ – A matrix with probabilities in each row for each distribution from input_.
Return type:Variable
categorical_cross_entropy

Computationally stable cross-entropy for pre-softmax values.

Parameters:
  • y (TensorVariable) – In the case of a matrix argument, each row represents a probabilility distribution. In the vector case, each element represents a distribution by specifying the position of 1 in a 1-hot vector.
  • x (TensorVariable) – A matrix, each row contains unnormalized probabilities of a distribution.
Returns:

cost – A vector of cross-entropies between respective distributions from y and x.

Return type:

TensorVariable

log_probabilities

Normalize log-probabilities.

Converts unnormalized log-probabilities (exponents of which do not sum to one) into actual log-probabilities (exponents of which sum to one).

Parameters:input (Variable) – A matrix, each row contains unnormalized log-probabilities of a distribution.
Returns:output – A matrix with normalized log-probabilities in each row for each distribution from input_.
Return type:Variable
class blocks.bricks.Softplus(name=None)

Bases: blocks.bricks.Activation

Elementwise application of softplus function.

apply

Apply the softplus function element-wise.

Parameters:input (TensorVariable) – Theano variable to apply softplus to, element-wise.
Returns:output – The input with the activation function applied.
Return type:TensorVariable
class blocks.bricks.Tanh(name=None)

Bases: blocks.bricks.Activation

Elementwise application of tanh function.

apply

Apply the tanh function element-wise.

Parameters:input (TensorVariable) – Theano variable to apply tanh to, element-wise.
Returns:output – The input with the activation function applied.
Return type:TensorVariable
class blocks.bricks.lookup.LookupTable(*args, **kwargs)

Bases: blocks.bricks.Initializable

Encapsulates representations of a range of integers.

Parameters:
  • length (int) – The size of the lookup table, or in other words, one plus the maximum index for which a representation is contained.
  • dim (int) – The dimensionality of representations.

Notes

See Initializable for initialization parameters.

W
apply

Perform lookup.

Parameters:indices (TensorVariable) – The indices of interest. The dtype must be integer.
Returns:output – Representations for the indices of the query. Has \(k+1\) dimensions, where \(k\) is the number of dimensions of the indices parameter. The last dimension stands for the representation element.
Return type:TensorVariable
has_bias = False

Convolutional bricks

class blocks.bricks.conv.Convolutional(*args, **kwargs)

Bases: blocks.bricks.Initializable

Performs a 2D convolution.

Parameters:
  • filter_size (tuple) – The height and width of the filter (also called kernels).
  • num_filters (int) – Number of filters per channel.
  • num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer. The filters are pooled over the channels.
  • batch_size (int, optional) – Number of examples per batch. If given, this will be passed to Theano convolution operator, possibly resulting in faster execution.
  • image_size (tuple, optional) – The height and width of the input (image or feature map). If given, this will be passed to the Theano convolution operator, resulting in possibly faster execution times.
  • step (tuple, optional) – The step (or stride) with which to slide the filters over the image. Defaults to (1, 1).
  • border_mode ({‘valid’, ‘full’}, optional) – The border mode to use, see scipy.signal.convolve2d() for details. Defaults to ‘valid’.
  • tied_biases (bool) – If True, it indicates that the biases of every filter in this layer should be shared amongst all applications of that filter. Setting this to False will untie the biases, yielding a separate bias for every location at which the filter is applied. Defaults to False.
apply

Perform the convolution.

Parameters:input (TensorVariable) – A 4D tensor with the axes representing batch size, number of channels, image height, and image width.
Returns:output – A 4D tensor of filtered images (feature maps) with dimensions representing batch size, number of filters, feature map height, and feature map width.

The height and width of the feature map depend on the border mode. For ‘valid’ it is image_size - filter_size + 1 while for ‘full’ it is image_size + filter_size - 1.

Return type:TensorVariable
get_dim(name)
class blocks.bricks.conv.ConvolutionalActivation(*args, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Initializable

A convolution followed by an activation function.

Parameters:activation (BoundApplication) – The application method to apply after convolution (i.e. the nonlinear activation function)

See also

Convolutional
For the documentation of other parameters.
get_dim(name)
class blocks.bricks.conv.ConvolutionalLayer(*args, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Initializable

A complete convolutional layer: Convolution, nonlinearity, pooling.

Todo

Mean pooling.

Parameters:activation (BoundApplication) – The application method to apply in the detector stage (i.e. the nonlinearity before pooling. Needed for __init__.

See also

Convolutional
Documentation of convolution arguments.
MaxPooling
Documentation of pooling arguments.

Notes

Uses max pooling.

get_dim(name)
class blocks.bricks.conv.ConvolutionalSequence(*args, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Initializable, blocks.bricks.Feedforward

A sequence of convolutional operations.

Parameters:
  • layers (list) – List of convolutional bricks (i.e. ConvolutionalActivation or ConvolutionalLayer)
  • num_channels (int) – Number of input channels in the image. For the first layer this is normally 1 for grayscale images and 3 for color (RGB) images. For subsequent layers this is equal to the number of filters output by the previous convolutional layer.
  • batch_size (int, optional) – Number of images in batch. If given, will be passed to theano’s convolution operator resulting in possibly faster execution.
  • image_size (tuple, optional) – Width and height of the input (image/featuremap). If given, will be passed to theano’s convolution operator resulting in possibly faster execution.

Notes

The passed convolutional operators should be ‘lazy’ constructed, that is, without specifying the batch_size, num_channels and image_size. The main feature of ConvolutionalSequence is that it will set the input dimensions of a layer to the output dimensions of the previous layer by the push_allocation_config() method.

get_dim(name)
class blocks.bricks.conv.Flattener(name=None)

Bases: blocks.bricks.base.Brick

Flattens the input.

It may be used to pass multidimensional objects like images or feature maps of convolutional bricks into bricks which allow only two dimensional input (batch, features) like MLP.

apply
class blocks.bricks.conv.MaxPooling(*args, **kwargs)

Bases: blocks.bricks.Initializable, blocks.bricks.Feedforward

Max pooling layer.

Parameters:
  • pooling_size (tuple) – The height and width of the pooling region i.e. this is the factor by which your input’s last two dimensions will be downscaled.
  • step (tuple, optional) – The vertical and horizontal shift (stride) between pooling regions. By default this is equal to pooling_size. Setting this to a lower number results in overlapping pooling regions.
  • input_dim (tuple, optional) – A tuple of integers representing the shape of the input. The last two dimensions will be used to calculate the output dimension.
apply

Apply the pooling (subsampling) transformation.

Parameters:input (TensorVariable) – An tensor with dimension greater or equal to 2. The last two dimensions will be downsampled. For example, with images this means that the last two dimensions should represent the height and width of your image.
Returns:output – A tensor with the same number of dimensions as input_, but with the last two dimensions downsampled.
Return type:TensorVariable
get_dim(name)

Routing bricks

class blocks.bricks.parallel.Distribute(*args, **kwargs)

Bases: blocks.bricks.parallel.Fork

Transform an input and add it to other inputs.

This brick is designed for the following scenario: one has a group of variables and another separate variable, and one needs to somehow distribute information from the latter across the former. We call that “to distribute a varible across other variables”, and refer to the separate variable as “the source” and to the variables from the group as “the targets”.

Given a prototype brick, a Parallel brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the source and the transformation results are added to the targets (in the literate sense).

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x = tensor.matrix('x')
>>> y = tensor.matrix('y')
>>> z = tensor.matrix('z')
>>> distribute = Distribute(target_names=['x', 'y'], source_name='z',
...                         target_dims=[2, 3], source_dim=3,
...                         weights_init=Constant(2))
>>> distribute.initialize()
>>> new_x, new_y = distribute.apply(x=x, y=y, z=z)
>>> new_x.eval({x: [[2, 2]], z: [[1, 1, 1]]}) 
array([[ 8.,  8.]]...
>>> new_y.eval({y: [[1, 1, 1]], z: [[1, 1, 1]]}) 
array([[ 7.,  7.,  7.]]...
Parameters:
  • target_names (list) – The names of the targets.
  • source_name (str) – The name of the source.
  • target_dims (list) – A list of target dimensions, corresponding to target_names.
  • source_dim (int) – The dimension of the source input.
  • prototype (Feedforward, optional) – The transformation prototype. A copy will be created for every input. By default a linear transformation is used.
target_dims

list

source_dim

int

Notes

See Initializable for initialization parameters.

apply

Distribute the source across the targets.

Parameters:**kwargs (dict) – The source and the target variables.
Returns:output – The new target variables.
Return type:list
apply_inputs()
apply_outputs()
class blocks.bricks.parallel.Fork(*args, **kwargs)

Bases: blocks.bricks.parallel.Parallel

Several outputs from one input by applying similar transformations.

Given a prototype brick, a Fork brick makes several copies of it (each with its own parameters). At the application time the copies are applied to the input to produce different outputs.

A typical usecase for this brick is to produce inputs for gates of gated recurrent bricks, such as GatedRecurrent.

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x = tensor.matrix('x')
>>> fork = Fork(output_names=['y', 'z'],
...             input_dim=2, output_dims=[3, 4],
...             weights_init=Constant(2), biases_init=Constant(1))
>>> fork.initialize()
>>> y, z = fork.apply(x)
>>> y.eval({x: [[1, 1]]}) 
array([[ 5.,  5.,  5.]]...
>>> z.eval({x: [[1, 1]]}) 
array([[ 5.,  5.,  5.,  5.]]...
Parameters:
  • output_names (list of str) – Names of the outputs to produce.
  • input_dim (int) – The input dimension.
  • prototype (Feedforward, optional) – The transformation prototype. A copy will be created for every input. By default an affine transformation is used.
input_dim

int

The input dimension.

output_dims

list

The output dimensions as a list of integers, corresponding to output_names.

apply
apply_outputs()
class blocks.bricks.parallel.Merge(*args, **kwargs)

Bases: blocks.bricks.parallel.Parallel

Merges several variables by applying a transformation and summing.

Parameters:
  • input_names (list) – The input names.
  • input_dims (list) – The dictionary of input dimensions, keys are input names, values are dimensions.
  • output_dim (int) – The output dimension of the merged variables.
  • prototype (Feedforward, optional) – A transformation prototype. A copy will be created for every input. If None, a linear transformation is used.
  • child_prefix (str, optional) – A prefix for children names. By default “transform” is used.
  • warning (..) – Note that if you want to have a bias you can pass a Linear brick as a prototype, but this will result in several redundant biases. It is a better idea to use merge.children[0].use_bias = True.
input_names

list

The input names.

input_dims

list

List of input dimensions corresponding to input_names.

output_dim

int

The output dimension.

Examples

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> a = tensor.matrix('a')
>>> b = tensor.matrix('b')
>>> merge = Merge(input_names=['a', 'b'], input_dims=[3, 4],
...               output_dim=2, weights_init=Constant(1.))
>>> merge.initialize()
>>> c = merge.apply(a=a, b=b)
>>> c.eval({a: [[1, 1, 1]], b: [[2, 2, 2, 2]]})  
array([[ 11.,  11.]]...
apply
apply_inputs()
class blocks.bricks.parallel.Parallel(*args, **kwargs)

Bases: blocks.bricks.Initializable

Apply similar transformations to several inputs.

Given a prototype brick, a Parallel brick makes several copies of it (each with its own parameters). At the application time every copy is applied to the respective input.

>>> from theano import tensor
>>> from blocks.initialization import Constant
>>> x, y = tensor.matrix('x'), tensor.matrix('y')
>>> parallel = Parallel(
...     prototype=Linear(use_bias=False),
...     input_names=['x', 'y'], input_dims=[2, 3], output_dims=[4, 5],
...     weights_init=Constant(2))
>>> parallel.initialize()
>>> new_x, new_y = parallel.apply(x=x, y=y)
>>> new_x.eval({x: [[1, 1]]}) 
array([[ 4.,  4.,  4.,  4.]]...
>>> new_y.eval({y: [[1, 1, 1]]}) 
array([[ 6.,  6.,  6.,  6.,  6.]]...
Parameters:
  • input_names (list) – The input names.
  • input_dims (list) – List of input dimensions, given in the same order as input_names.
  • output_dims (list) – List of output dimensions.
  • prototype (Feedforward) – The transformation prototype. A copy will be created for every input.
  • child_prefix (str, optional) – The prefix for children names. By default “transform” is used.
input_names

list

The input names.

input_dims

list

Input dimensions.

output_dims

list

Output dimensions.

Notes

See Initializable for initialization parameters.

apply
apply_inputs()
apply_outputs()

Recurrent bricks

class blocks.bricks.recurrent.BaseRecurrent(name=None)

Bases: blocks.bricks.base.Brick

Base class for brick with recurrent application method.

has_bias = False
initial_states

Return initial states for an application call.

Default implementation assumes that the recurrent application method is called apply. It fetches the state names from apply.states and a returns a zero matrix for each of them.

SimpleRecurrent, LSTM and GatedRecurrent override this method with trainable initial states initialized with zeros.

Parameters:
  • batch_size (int) – The batch size.
  • *args – The positional arguments of the application call.
  • **kwargs – The keyword arguments of the application call.
initial_states_outputs()
class blocks.bricks.recurrent.Bidirectional(*args, **kwargs)

Bases: blocks.bricks.Initializable

Bidirectional network.

A bidirectional network is a combination of forward and backward recurrent networks which process inputs in different order.

Parameters:prototype (instance of BaseRecurrent) – A prototype brick from which the forward and backward bricks are cloned.

Notes

See Initializable for initialization parameters.

apply

Applies forward and backward networks and concatenates outputs.

apply_delegate()
has_bias = False
class blocks.bricks.recurrent.GatedRecurrent(*args, **kwargs)

Bases: blocks.bricks.recurrent.BaseRecurrent, blocks.bricks.Initializable

Gated recurrent neural network.

Gated recurrent neural network (GRNN) as introduced in [CvMG14]. Every unit of a GRNN is equipped with update and reset gates that facilitate better gradient propagation.

Parameters:
  • dim (int) – The dimension of the hidden state.
  • activation (Brick or None) – The brick to apply as activation. If None a Tanh brick is used.
  • gate_activation (Brick or None) – The brick to apply as activation for gates. If None a Logistic brick is used.

Notes

See Initializable for initialization parameters.

[CvMG14]Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP (2014), pp. 1724-1734.
apply

Apply the gated recurrent transition.

Parameters:
  • states (TensorVariable) – The 2 dimensional matrix of current states in the shape (batch_size, dim). Required for one_step usage.
  • inputs (TensorVariable) – The 2 dimensional matrix of inputs in the shape (batch_size, dim)
  • gate_inputs (TensorVariable) – The 2 dimensional matrix of inputs to the gates in the shape (batch_size, 2 * dim).
  • mask (TensorVariable) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.
Returns:

output – Next states of the network.

Return type:

TensorVariable

get_dim(name)
initial_states
state_to_gates
state_to_state
class blocks.bricks.recurrent.LSTM(*args, **kwargs)

Bases: blocks.bricks.recurrent.BaseRecurrent, blocks.bricks.Initializable

Long Short Term Memory.

Every unit of an LSTM is equipped with input, forget and output gates. This implementation is based on code by Mohammad Pezeshki that implements the architecture used in [GSS03] and [Grav13]. It aims to do as many computations in parallel as possible and expects the last dimension of the input to be four times the output dimension.

Unlike a vanilla LSTM as described in [HS97], this model has peephole connections from the cells to the gates. The output gates receive information about the cells at the current time step, while the other gates only receive information about the cells at the previous time step. All ‘peephole’ weight matrices are diagonal.

[GSS03]Gers, Felix A., Nicol N. Schraudolph, and Jürgen Schmidhuber, Learning precise timing with LSTM recurrent networks, Journal of Machine Learning Research 3 (2003), pp. 115-143.
[Grav13](1, 2) Graves, Alex, Generating sequences with recurrent neural networks, arXiv preprint arXiv:1308.0850 (2013).
[HS97]Sepp Hochreiter, and Jürgen Schmidhuber, Long Short-Term Memory, Neural Computation 9(8) (1997), pp. 1735-1780.
Parameters:
  • dim (int) – The dimension of the hidden state.
  • activation (Brick, optional) – The activation function. The default and by far the most popular is Tanh.

Notes

See Initializable for initialization parameters.

apply

Apply the Long Short Term Memory transition.

Parameters:
  • states (TensorVariable) – The 2 dimensional matrix of current states in the shape (batch_size, features). Required for one_step usage.
  • cells (TensorVariable) – The 2 dimensional matrix of current cells in the shape (batch_size, features). Required for one_step usage.
  • inputs (TensorVariable) – The 2 dimensional matrix of inputs in the shape (batch_size, features * 4). The inputs needs to be four times the dimension of the LSTM brick to insure each four gates receive different transformations of the input. See [Grav13] equations 7 to 10 for more details.
  • mask (TensorVariable) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.
  • [Grav13] Graves, Alex, *Generating sequences with recurrent (..) –

    neural networks*, arXiv preprint arXiv:1308.0850 (2013).

Returns:

  • states (TensorVariable) – Next states of the network.
  • cells (TensorVariable) – Next cell activations of the network.

get_dim(name)
initial_states
class blocks.bricks.recurrent.RecurrentStack(transitions, fork_prototype=None, states_name='states', skip_connections=False, **kwargs)

Bases: blocks.bricks.recurrent.BaseRecurrent, blocks.bricks.Initializable

Stack of recurrent networks.

Builds a stack of recurrent layers from a supplied list of BaseRecurrent objects. Each object must have a sequences, contexts, states and outputs parameters to its apply method, such as the ones required by the recurrent decorator from blocks.bricks.recurrent.

In Blocks in general each brick can have an apply method and this method has attributes that list the names of the arguments that can be passed to the method and the name of the outputs returned by the method. The attributes of the apply method of this class is made from concatenating the attributes of the apply methods of each of the transitions from which the stack is made. In order to avoid conflict, the names of the arguments appearing in the states and outputs attributes of the apply method of each layers are renamed. The names of the bottom layer are used as-is and a suffix of the form ‘#<n>’ is added to the names from other layers, where ‘<n>’ is the number of the layer starting from 1, used for first layer above bottom.

The contexts of all layers are merged into a single list of unique names, and no suffix is added. Different layers with the same context name will receive the same value.

The names that appear in sequences are treated in the same way as the names of states and outputs if skip_connections is “True”. The only exception is the “mask” element that may appear in the sequences attribute of all layers, no suffix is added to it and all layers will receive the same mask value. If you set skip_connections to False then only the arguments of the sequences from the bottom layer will appear in the sequences attribute of the apply method of this class. When using this class, with skip_connections set to “True”, you can supply all inputs to all layers using a single fork which is created with output_names set to the apply.sequences attribute of this class. For example, SequenceGenerator will create a such a fork.

Whether or not skip_connections is set, each layer above the bottom also receives an input (values to its sequences arguments) from a fork of the state of the layer below it. Not to be confused with the external fork discussed in the previous paragraph. It is assumed that all states attributes have a “states” argument name (this can be configured with states_name parameter.) The output argument with this name is forked and then added to all the elements appearing in the sequences of the next layer (except for “mask”.) If skip_connections is False then this fork has a bias by default. This allows direct usage of this class with input supplied only to the first layer. But if you do supply inputs to all layers (by setting skip_connections to “True”) then by default there is no bias and the external fork you use to supply the inputs should have its own separate bias.

Parameters:
  • transitions (list) – List of recurrent units to use in each layer. Each derived from BaseRecurrent Note: A suffix with layer number is added to transitions’ names.
  • fork_prototype (FeedForward, optional) – A prototype for the transformation applied to states_name from the states of each layer. The transformation is used when the states_name argument from the outputs of one layer is used as input to the sequences of the next layer. By default it Linear transformation is used, with bias if skip_connections is “False”. If you supply your own prototype you have to enable/disable bias depending on the value of skip_connections.
  • states_name (string) – In a stack of RNN the state of each layer is used as input to the next. The states_name identify the argument of the states and outputs attributes of each layer that should be used for this task. By default the argument is called “states”. To be more precise, this is the name of the argument in the outputs attribute of the apply method of each transition (layer.) It is used, via fork, as the sequences (input) of the next layer. The same element should also appear in the states attribute of the apply method.
  • skip_connections (bool) – By default False. When true, the sequences of all layers are add to the sequences of the apply of this class. When false only the sequences of the bottom layer appear in the sequences of the apply of this class. In this case the default fork used internally between layers has a bias (see fork_prototype.) An external code can inspect the sequences attribute of the apply method of this class to decide which arguments it need (and in what order.) With skip_connections you can control what is exposed to the externl code. If it is false then the external code is expected to supply inputs only to the bottom layer and if it is true then the external code is expected to supply inputs to all layers. There is just one small problem, the external inputs to the layers above the bottom layer are added to a fork of the state of the layer below it. As a result the output of two forks is added together and it will be problematic if both will have a bias. It is assumed that the external fork has a bias and therefore by default the internal fork will not have a bias if skip_connections is true.

Notes

See BaseRecurrent for more initialization parameters.

apply

Apply the stack of transitions.

Parameters:
  • low_memory (bool) – Use the slow, but also memory efficient, implementation of this code.
  • *args

    Positional argumentes in the order in which they appear in self.apply.sequences followed by self.apply.contexts.

  • **kwargs

    Named argument defined in self.apply.sequences, self.apply.states or self.apply.contexts

Returns:

outputs – The outputs of all transitions as defined in self.apply.outputs

Return type:

(list of) TensorVariable

See also

See docstring of this class for arguments appearing in the lists self.apply.sequences, self.apply.states, self.apply.contexts. See recurrent() : for all other parameters such as iterate and return_initial_states however reverse is currently not implemented.

do_apply(*args, **kwargs)

Apply the stack of transitions.

This is the undecorated implementation of the apply method. A method with an @apply decoration should call this method with iterate=True to indicate that the iteration over all steps should be done internally by this method. A method with a @recurrent method should have iterate=False (or unset) to indicate that the iteration over all steps is done externally.

get_dim(name)
initial_states
low_memory_apply
normal_inputs(level)
static split_suffix(name)
static suffix(name, level)
static suffixes(names, level)
class blocks.bricks.recurrent.SimpleRecurrent(*args, **kwargs)

Bases: blocks.bricks.recurrent.BaseRecurrent, blocks.bricks.Initializable

The traditional recurrent transition.

The most well-known recurrent transition: a matrix multiplication, optionally followed by a non-linearity.

Parameters:
  • dim (int) – The dimension of the hidden state
  • activation (Brick) – The brick to apply as activation.

Notes

See Initializable for initialization parameters.

W
apply

Apply the simple transition.

Parameters:
  • inputs (TensorVariable) – The 2D inputs, in the shape (batch, features).
  • states (TensorVariable) – The 2D states, in the shape (batch, features).
  • mask (TensorVariable) – A 1D binary array in the shape (batch,) which is 1 if there is data available, 0 if not. Assumed to be 1-s only if not given.
get_dim(name)
initial_states
blocks.bricks.recurrent.recurrent(*args, **kwargs)

Wraps an apply method to allow its iterative application.

This decorator allows you to implement only one step of a recurrent network and enjoy applying it to sequences for free. The idea behind is that its most general form information flow of an RNN can be described as follows: depending on the context and driven by input sequences the RNN updates its states and produces output sequences.

Given a method describing one step of an RNN and a specification which of its inputs are the elements of the input sequence, which are the states and which are the contexts, this decorator returns an application method which implements the whole RNN loop. The returned application method also has additional parameters, see documentation of the recurrent_apply inner function below.

Parameters:
  • sequences (list of strs) – Specifies which of the arguments are elements of input sequences.
  • states (list of strs) – Specifies which of the arguments are the states.
  • contexts (list of strs) – Specifies which of the arguments are the contexts.
  • outputs (list of strs) – Names of the outputs. The outputs whose names match with those in the state parameter are interpreted as next step states.
Returns:

recurrent_apply – The new application method that applies the RNN to sequences.

Return type:

Application

Attention bricks

This module defines the interface of attention mechanisms and a few concrete implementations. For a gentle introduction and usage examples see the tutorial TODO.

An attention mechanism decides to what part of the input to pay attention. It is typically used as a component of a recurrent network, though one can imagine it used in other conditions as well. When the input is big and has certain structure, for instance when it is sequence or an image, an attention mechanism can be applied to extract only information which is relevant for the network in its current state.

For the purpose of documentation clarity, we fix the following terminology in this file:

  • network is the network, typically a recurrent one, which uses the attention mechanism.
  • The network has states. Using this word in plural might seem weird, but some recurrent networks like LSTM do have several states.
  • The big structured input, to which the attention mechanism is applied, is called the attended. When it has variable structure, e.g. a sequence of variable length, there might be a mask associated with it.
  • The information extracted by the attention from the attended is called glimpse, more specifically glimpses because there might be a few pieces of this information.

Using this terminology, the attention mechanism computes glimpses given the states of the network and the attended.

An example: in the machine translation network from [BCB] the attended is a sequence of so-called annotations, that is states of a bidirectional network that was driven by word embeddings of the source sentence. The attention mechanism assigns weights to the annotations. The weighted sum of the annotations is further used by the translation network to predict the next word of the generated translation. The weights and the weighted sum are the glimpses. A generalized attention mechanism for this paper is represented here as SequenceContentAttention.

class blocks.bricks.attention.AbstractAttention(*args, **kwargs)

Bases: blocks.bricks.base.Brick

The common interface for attention bricks.

First, see the module-level docstring for terminology.

A generic attention mechanism functions as follows. Its inputs are the states of the network and the attended. Given these two it produces so-called glimpses, that is it extracts information from the attended which is necessary for the network in its current states

For computational reasons we separate the process described above into two stages:

1. The preprocessing stage, preprocess(), includes computation that do not involve the state. Those can be often performed in advance. The outcome of this stage is called preprocessed_attended.

  1. The main stage, take_glimpses(), includes all the rest.

When an attention mechanism is applied sequentially, some glimpses from the previous step might be necessary to compute the new ones. A typical example for that is when the focus position from the previous step is required. In such cases take_glimpses() should specify such need in its interface (its docstring explains how to do that). In addition initial_glimpses() should specify some sensible initialization for the glimpses to be carried over.

Todo

Only single attended is currently allowed.

preprocess() and initial_glimpses() might end up needing masks, which are currently not provided for them.

Parameters:
  • state_names (list) – The names of the network states.
  • state_dims (list) – The state dimensions corresponding to state_names.
  • attended_dim (int) – The dimension of the attended.
state_names

list

state_dims

list

attended_dim

int

get_dim(name)
initial_glimpses(batch_size, attended)

Return sensible initial values for carried over glimpses.

Parameters:
  • batch_size (int or Variable) – The batch size.
  • attended (Variable) – The attended.
Returns:

initial_glimpses – The initial values for the requested glimpses. These might simply consist of zeros or be somehow extracted from the attended.

Return type:

list of Variable

preprocess

Perform the preprocessing of the attended.

Stage 1 of the attention mechanism, see AbstractAttention docstring for an explanation of stages. The default implementation simply returns attended.

Parameters:attended (Variable) – The attended.
Returns:preprocessed_attended – The preprocessed attended.
Return type:Variable
take_glimpses(attended, preprocessed_attended=None, attended_mask=None, **kwargs)

Extract glimpses from the attended given the current states.

Stage 2 of the attention mechanism, see AbstractAttention for an explanation of stages. If preprocessed_attended is not given, should trigger the stage 1.

This application method must declare its inputs and outputs. The glimpses to be carried over are identified by their presence in both inputs and outputs list. The attended must be the first input, the preprocessed attended must be the second one.

Parameters:
  • attended (Variable) – The attended.
  • preprocessed_attended (Variable, optional) – The preprocessed attended computed by preprocess(). When not given, preprocess() should be called.
  • attended_mask (Variable, optional) – The mask for the attended. This is required in the case of padded structured output, e.g. when a number of sequences are force to be the same length. The mask identifies position of the attended that actually contain information.
  • **kwargs (dict) – Includes the states and the glimpses to be carried over from the previous step in the case when the attention mechanism is applied sequentially.
class blocks.bricks.attention.AbstractAttentionRecurrent(name=None)

Bases: blocks.bricks.recurrent.BaseRecurrent

The interface for attention-equipped recurrent transitions.

When a recurrent network is equipped with an attention mechanism its transition typically consists of two steps: (1) the glimpses are taken by the attention mechanism and (2) the next states are computed using the current states and the glimpses. It is required for certain usecases (such as sequence generator) that apart from a do-it-all recurrent application method interfaces for the first step and the second steps of the transition are provided.

apply(**kwargs)

Compute next states taking glimpses on the way.

compute_states(**kwargs)

Compute next states given current states and glimpses.

take_glimpses(**kwargs)

Compute glimpses given the current states.

class blocks.bricks.attention.AttentionRecurrent(transition, attention, distribute=None, add_contexts=True, attended_name=None, attended_mask_name=None, **kwargs)

Bases: blocks.bricks.attention.AbstractAttentionRecurrent, blocks.bricks.Initializable

Combines an attention mechanism and a recurrent transition.

This brick equips a recurrent transition with an attention mechanism. In order to do this two more contexts are added: one to be attended and a mask for it. It is also possible to use the contexts of the given recurrent transition for these purposes and not add any new ones, see add_context parameter.

At the beginning of each step attention mechanism produces glimpses; these glimpses together with the current states are used to compute the next state and finish the transition. In some cases glimpses from the previous steps are also necessary for the attention mechanism, e.g. in order to focus on an area close to the one from the previous step. This is also supported: such glimpses become states of the new transition.

To let the user control the way glimpses are used, this brick also takes a “distribute” brick as parameter that distributes the information from glimpses across the sequential inputs of the wrapped recurrent transition.

Parameters:
  • transition (BaseRecurrent) – The recurrent transition.
  • attention (Brick) – The attention mechanism.
  • distribute (Brick, optional) – Distributes the information from glimpses across the input sequences of the transition. By default a Distribute is used, and those inputs containing the “mask” substring in their name are not affected.
  • add_contexts (bool, optional) – If True, new contexts for the attended and the attended mask are added to this transition, otherwise existing contexts of the wrapped transition are used. True by default.
  • attended_name (str) – The name of the attended context. If None, “attended” or the first context of the recurrent transition is used depending on the value of add_contents flag.
  • attended_mask_name (str) – The name of the mask for the attended context. If None, “attended_mask” or the second context of the recurrent transition is used depending on the value of add_contents flag.

Notes

See Initializable for initialization parameters.

Wrapping your recurrent brick with this class makes all the states mandatory. If you feel this is a limitation for you, try to make it better! This restriction does not apply to sequences and contexts: those keep being as optional as they were for your brick.

Those coming to Blocks from Groundhog might recognize that this is a RecurrentLayerWithSearch, but on steroids :)

apply

Preprocess a sequence attending the attended context at every step.

Preprocesses the attended context and runs do_apply(). See do_apply() documentation for further information.

apply_contexts()
apply_delegate()
compute_states

Compute current states when glimpses have already been computed.

Combines an application of the distribute that alter the sequential inputs of the wrapped transition and an application of the wrapped transition. All unknown keyword arguments go to the wrapped transition.

Parameters:**kwargs – Should contain everything what self.transition needs and in addition the current glimpses.
Returns:current_states – Current states computed by self.transition.
Return type:list of TensorVariable
compute_states_outputs()
do_apply

Process a sequence attending the attended context every step.

In addition to the original sequence this method also requires its preprocessed version, the one computed by the preprocess method of the attention mechanism. Unknown keyword arguments are passed to the wrapped transition.

Parameters:**kwargs – Should contain current inputs, previous step states, contexts, the preprocessed attended context, previous step glimpses.
Returns:outputs – The current step states and glimpses.
Return type:list of TensorVariable
do_apply_contexts()
do_apply_outputs()
do_apply_sequences()
do_apply_states()
get_dim(name)
initial_states
initial_states_outputs()
take_glimpses

Compute glimpses with the attention mechanism.

A thin wrapper over self.attention.take_glimpses: takes care of choosing and renaming the necessary arguments.

Parameters:**kwargs – Must contain the attended, previous step states and glimpses. Can optionaly contain the attended mask and the preprocessed attended.
Returns:glimpses – Current step glimpses.
Return type:list of TensorVariable
take_glimpses_outputs()
class blocks.bricks.attention.GenericSequenceAttention(*args, **kwargs)

Bases: blocks.bricks.attention.AbstractAttention

Logic common for sequence attention mechanisms.

compute_weighted_averages

Compute weighted averages of the attended sequence vectors.

Parameters:
  • weights (Variable) – The weights. The shape must be equal to the attended shape without the last dimension.
  • attended (Variable) – The attended. The index in the sequence must be the first dimension.
Returns:

weighted_averages – The weighted averages of the attended elements. The shape is equal to the attended shape with the first dimension dropped.

Return type:

Variable

compute_weights

Compute weights from energies in softmax-like fashion.

Todo

Use Softmax.

Parameters:
  • energies (Variable) – The energies. Must be of the same shape as the mask.
  • attended_mask (Variable) – The mask for the attended. The index in the sequence must be the first dimension.
Returns:

weights – Summing to 1 non-negative weights of the same shape as energies.

Return type:

Variable

class blocks.bricks.attention.SequenceContentAttention(*args, **kwargs)

Bases: blocks.bricks.attention.GenericSequenceAttention, blocks.bricks.Initializable

Attention mechanism that looks for relevant content in a sequence.

This is the attention mechanism used in [BCB]. The idea in a nutshell:

  1. The states and the sequence are transformed independently,
  2. The transformed states are summed with every transformed sequence element to obtain match vectors,
  3. A match vector is transformed into a single number interpreted as energy,
  4. Energies are normalized in softmax-like fashion. The resulting summing to one weights are called attention weights,
  5. Weighted average of the sequence elements with attention weights is computed.

In terms of the AbstractAttention documentation, the sequence is the attended. The weighted averages from 5 and the attention weights from 4 form the set of glimpses produced by this attention mechanism.

Parameters:
  • state_names (list of str) – The names of the network states.
  • attended_dim (int) – The dimension of the sequence elements.
  • match_dim (int) – The dimension of the match vector.
  • state_transformer (Brick) – A prototype for state transformations. If None, a linear transformation is used.
  • attended_transformer (Feedforward) – The transformation to be applied to the sequence. If None an affine transformation is used.
  • energy_computer (Feedforward) – Computes energy from the match vector. If None, an affine transformations preceeded by \(tanh\) is used.

Notes

See Initializable for initialization parameters.

[BCB](1, 2) Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.
compute_energies
get_dim(name)
initial_glimpses
preprocess

Preprocess the sequence for computing attention weights.

Parameters:attended (TensorVariable) – The attended sequence, time is the 1-st dimension.
take_glimpses

Compute attention weights and produce glimpses.

Parameters:
  • attended (TensorVariable) – The sequence, time is the 1-st dimension.
  • preprocessed_attended (TensorVariable) – The preprocessed sequence. If None, is computed by calling preprocess().
  • attended_mask (TensorVariable) – A 0/1 mask specifying available data. 0 means that the corresponding sequence element is fake.
  • **states – The states of the network.
Returns:

  • weighted_averages (Variable) – Linear combinations of sequence elements with the attention weights.
  • weights (Variable) – The attention weights. The first dimension is batch, the second is time.

take_glimpses_inputs()
class blocks.bricks.attention.ShallowEnergyComputer(*args, **kwargs)

Bases: blocks.bricks.Sequence, blocks.bricks.Initializable, blocks.bricks.Feedforward

A simple energy computer: first tanh, then weighted sum.

input_dim
output_dim

Sequence generators

Recurrent networks are often used to generate/model sequences. Examples include language modelling, machine translation, handwriting synthesis, etc.. A typical pattern in this context is that sequence elements are generated one often another, and every generated element is fed back into the recurrent network state. Sometimes also an attention mechanism is used to condition sequence generation on some structured input like another sequence or an image.

This module provides SequenceGenerator that builds a sequence generating network from three main components:

  • a core recurrent transition, e.g. LSTM or GatedRecurrent
  • a readout component that can produce sequence elements using the network state and the information from the attention mechanism
  • an attention mechanism (see attention for more information)

Implementation-wise SequenceGenerator fully relies on BaseSequenceGenerator. At the level of the latter an attention is mandatory, moreover it must be a part of the recurrent transition (see AttentionRecurrent). To simulate optional attention, SequenceGenerator wraps the pure recurrent network in FakeAttentionRecurrent.

class blocks.bricks.sequence_generators.AbstractEmitter(name=None)

Bases: blocks.bricks.base.Brick

The interface for the emitter component of a readout.

readout_dim

int

The dimension of the readout. Is given by the Readout brick when allocation configuration is pushed.

See also

Readout

SoftmaxEmitter
for integer outputs
cost(readouts, outputs)

Implements the respective method of Readout.

emit(readouts)

Implements the respective method of Readout.

initial_outputs(batch_size)

Implements the respective method of Readout.

class blocks.bricks.sequence_generators.AbstractFeedback(name=None)

Bases: blocks.bricks.base.Brick

The interface for the feedback component of a readout.

feedback(outputs)

Implements the respective method of Readout.

class blocks.bricks.sequence_generators.AbstractReadout(*args, **kwargs)

Bases: blocks.bricks.Initializable

The interface for the readout component of a sequence generator.

The readout component of a sequence generator is a bridge between the core recurrent network and the output sequence.

Parameters:
  • source_names (list) – A list of the source names (outputs) that are needed for the readout part e.g. ['states'] or ['states', 'weighted_averages'] or ['states', 'feedback'].
  • readout_dim (int) – The dimension of the readout.
source_names

list

readout_dim

int

See also

BaseSequenceGenerator
see how exactly a readout is used
Readout
the typically used readout brick
cost(readouts, outputs)

Compute generation cost of outputs given readouts.

Parameters:
  • readouts (Variable) – Readouts produced by the readout() method of a (..., readout dim) shape.
  • outputs (Variable) – Outputs whose cost should be computed. Should have as many or one less dimensions compared to readout. If readout has n dimensions, first n - 1 dimensions of outputs should match with those of readouts.
emit(readouts)

Produce outputs from readouts.

Parameters:readouts (Variable) – Readouts produced by the readout() method of a (batch_size, readout_dim) shape.
feedback(outputs)

Feeds outputs back to be used as inputs of the transition.

initial_outputs(batch_size)

Compute initial outputs for the generator’s first step.

In the notation from the BaseSequenceGenerator documentation this method should compute \(y_0\).

readout(**kwargs)

Compute the readout vector from states, glimpses, etc.

Parameters:**kwargs (dict) – Contains sequence generator states, glimpses, contexts and feedback from the previous outputs.
class blocks.bricks.sequence_generators.BaseSequenceGenerator(*args, **kwargs)

Bases: blocks.bricks.Initializable

A generic sequence generator.

This class combines two components, a readout network and an attention-equipped recurrent transition, into a context-dependent sequence generator. Third component must be also given which forks feedback from the readout network to obtain inputs for the transition.

The class provides two methods: generate() and cost(). The former is to actually generate sequences and the latter is to compute the cost of generating given sequences.

The generation algorithm description follows.

Definitions and notation:

  • States \(s_i\) of the generator are the states of the transition as specified in transition.state_names.
  • Contexts of the generator are the contexts of the transition as specified in transition.context_names.
  • Glimpses \(g_i\) are intermediate entities computed at every generation step from states, contexts and the previous step glimpses. They are computed in the transition’s apply method when not given or by explicitly calling the transition’s take_glimpses method. The set of glimpses considered is specified in transition.glimpse_names.
  • Outputs \(y_i\) are produced at every step and form the output sequence. A generation cost \(c_i\) is assigned to each output.

Algorithm:

  1. Initialization.

    \[\begin{split}y_0 = readout.initial\_outputs(contexts)\\ s_0, g_0 = transition.initial\_states(contexts)\\ i = 1\\\end{split}\]

    By default all recurrent bricks from recurrent have trainable initial states initialized with zeros. Subclass them or BaseRecurrent directly to get custom initial states.

  2. New glimpses are computed:

    \[g_i = transition.take\_glimpses( s_{i-1}, g_{i-1}, contexts)\]
  3. A new output is generated by the readout and its cost is computed:

    \[\begin{split}f_{i-1} = readout.feedback(y_{i-1}) \\ r_i = readout.readout(f_{i-1}, s_{i-1}, g_i, contexts) \\ y_i = readout.emit(r_i) \\ c_i = readout.cost(r_i, y_i)\end{split}\]

    Note that the new glimpses and the old states are used at this step. The reason for not merging all readout methods into one is to make an efficient implementation of cost() possible.

  4. New states are computed and iteration is done:

    \[\begin{split}f_i = readout.feedback(y_i) \\ s_i = transition.compute\_states(s_{i-1}, g_i, fork.apply(f_i), contexts) \\ i = i + 1\end{split}\]
  5. Back to step 2 if the desired sequence length has not been yet reached.

A scheme of the algorithm described above follows.
../_images/sequence_generator_scheme.png
Parameters:
  • readout (instance of AbstractReadout) – The readout component of the sequence generator.
  • transition (instance of AbstractAttentionRecurrent) – The transition component of the sequence generator.
  • fork (Brick) – The brick to compute the transition’s inputs from the feedback.

See also

Initializable
for initialization parameters
SequenceGenerator
more user friendly interface to thisbrick
cost

Returns the average cost over the minibatch.

The cost is computed by averaging the sum of per token costs for each sequence over the minibatch.

Warning

Note that, the computed cost can be problematic when batches consist of vastly different sequence lengths.

Parameters:
  • outputs (TensorVariable) – The 3(2) dimensional tensor containing output sequences. The axis 0 must stand for time, the axis 1 for the position in the batch.
  • mask (TensorVariable) – The binary matrix identifying fake outputs.
Returns:

cost – Theano variable for cost, computed by summing over timesteps and then averaging over the minibatch.

Return type:

Variable

Notes

The contexts are expected as keyword arguments.

Adds average cost per sequence element AUXILIARY variable to the computational graph with name per_sequence_element.

cost_matrix

Returns generation costs for output sequences.

See also

cost()
Scalar cost.
generate

A sequence generation step.

Parameters:outputs (TensorVariable) – The outputs from the previous step.

Notes

The contexts, previous states and glimpses are expected as keyword arguments.

generate_delegate()
generate_outputs()
generate_states()
get_dim(name)
initial_states
initial_states_outputs()
class blocks.bricks.sequence_generators.FakeAttentionRecurrent(transition, **kwargs)

Bases: blocks.bricks.attention.AbstractAttentionRecurrent, blocks.bricks.Initializable

Adds fake attention interface to a transition.

BaseSequenceGenerator requires its transition brick to support AbstractAttentionRecurrent interface, that is to have an embedded attention mechanism. For the cases when no attention is required (e.g. language modeling or encoder-decoder models), FakeAttentionRecurrent is used to wrap a usual recurrent brick. The resulting brick has no glimpses and simply passes all states and contexts to the wrapped one.

Todo

Get rid of this brick and support attention-less transitions in BaseSequenceGenerator.

apply
apply_delegate()
compute_states
compute_states_delegate()
get_dim(name)
initial_states
initial_states_outputs()
take_glimpses
class blocks.bricks.sequence_generators.LookupFeedback(num_outputs=None, feedback_dim=None, **kwargs)

Bases: blocks.bricks.sequence_generators.AbstractFeedback, blocks.bricks.Initializable

A feedback brick for the case when readout are integers.

Stores and retrieves distributed representations of integers.

feedback
get_dim(name)
class blocks.bricks.sequence_generators.Readout(emitter=None, feedback_brick=None, merge=None, merge_prototype=None, post_merge=None, merged_dim=None, **kwargs)

Bases: blocks.bricks.sequence_generators.AbstractReadout

Readout brick with separated emitter and feedback parts.

Readout combines a few bits and pieces into an object that can be used as the readout component in BaseSequenceGenerator. This includes an emitter brick, to which emit(), cost() and initial_outputs() calls are delegated, a feedback brick to which feedback() functionality is delegated, and a pipeline to actually compute readouts from all the sources (see the source_names attribute of AbstractReadout).

The readout computation pipeline is constructed from merge and post_merge brick, whose responsibilites are described in the respective docstrings.

Parameters:
  • emitter (an instance of AbstractEmitter) – The emitter component.
  • feedback_brick (an instance of AbstractFeedback) – The feedback component.
  • merge (Brick, optional) – A brick that takes the sources given in source_names as an input and combines them into a single output. If given, merge_prototype cannot be given.
  • merge_prototype (FeedForward, optional) – If merge isn’t given, the transformation given by merge_prototype is applied to each input before being summed. By default a Linear transformation without biases is used. If given, merge cannot be given.
  • post_merge (Feedforward, optional) – This transformation is applied to the merged inputs. By default Bias is used.
  • merged_dim (int, optional) – The input dimension of post_merge i.e. the output dimension of merge (or merge_prototype). If not give, it is assumed to be the same as readout_dim (i.e. post_merge is assumed to not change dimensions).
  • **kwargs (dict) – Passed to the parent’s constructor.

See also

BaseSequenceGenerator
see how exactly a readout is used

AbstractEmitter, AbstractFeedback

cost
emit
feedback
get_dim(name)
initial_outputs
readout
class blocks.bricks.sequence_generators.SequenceGenerator(readout, transition, attention=None, add_contexts=True, **kwargs)

Bases: blocks.bricks.sequence_generators.BaseSequenceGenerator

A more user-friendly interface for BaseSequenceGenerator.

Parameters:
  • readout (instance of AbstractReadout) – The readout component for the sequence generator.
  • transition (instance of BaseRecurrent) – The recurrent transition to be used in the sequence generator. Will be combined with attention, if that one is given.
  • attention (object, optional) – The attention mechanism to be added to transition, an instance of AbstractAttention.
  • add_contexts (bool) – If True, the AttentionRecurrent wrapping the transition will add additional contexts for the attended and its mask.
  • **kwargs (dict) – All keywords arguments are passed to the base class. If fork keyword argument is not provided, Fork is created that forks all transition sequential inputs without a “mask” substring in them.
class blocks.bricks.sequence_generators.SoftmaxEmitter(initial_output=0, **kwargs)

Bases: blocks.bricks.sequence_generators.AbstractEmitter, blocks.bricks.Initializable, blocks.bricks.Random

A softmax emitter for the case of integer outputs.

Interprets readout elements as energies corresponding to their indices.

Parameters:initial_output (int or a scalar Variable) – The initial output.
cost
emit
get_dim(name)
initial_outputs
probs
class blocks.bricks.sequence_generators.TrivialEmitter(*args, **kwargs)

Bases: blocks.bricks.sequence_generators.AbstractEmitter

An emitter for the trivial case when readouts are outputs.

Parameters:readout_dim (int) – The dimension of the readout.

Notes

By default cost() always returns zero tensor.

cost
emit
get_dim(name)
initial_outputs
class blocks.bricks.sequence_generators.TrivialFeedback(*args, **kwargs)

Bases: blocks.bricks.sequence_generators.AbstractFeedback

A feedback brick for the case when readout are outputs.

feedback
get_dim(name)

Cost bricks

class blocks.bricks.cost.AbsoluteError(name=None)

Bases: blocks.bricks.cost.CostMatrix

cost_matrix
class blocks.bricks.cost.BinaryCrossEntropy(name=None)

Bases: blocks.bricks.cost.CostMatrix

cost_matrix
class blocks.bricks.cost.CategoricalCrossEntropy(name=None)

Bases: blocks.bricks.cost.Cost

apply
class blocks.bricks.cost.Cost(name=None)

Bases: blocks.bricks.base.Brick

apply
class blocks.bricks.cost.CostMatrix(name=None)

Bases: blocks.bricks.cost.Cost

Base class for costs which can be calculated element-wise.

Assumes that the data has format (batch, features).

apply
cost_matrix
class blocks.bricks.cost.MisclassificationRate(top_k=1)

Bases: blocks.bricks.cost.Cost

Calculates the misclassification rate for a mini-batch.

Parameters:top_k (int, optional) – If the ground truth class is within the top_k highest responses for a given example, the model is considered to have predicted correctly. Default: 1.

Notes

Ties for top_k-th place are broken pessimistically, i.e. in the (in practice, rare) case that there is a tie for top_k-th highest output for a given example, it is considered an incorrect prediction.

apply
class blocks.bricks.cost.SquaredError(name=None)

Bases: blocks.bricks.cost.CostMatrix

cost_matrix

Wrapper bricks

class blocks.bricks.wrappers.BrickWrapper

Bases: object

Base class for wrapper metaclasses.

Sometimes one wants to extend a brick with the capability to handle inputs different from what it was designed to handle. A typical example are inputs with more dimensions that was foreseen at the development stage. One way to proceed in such a situation is to write a decorator that wraps all application methods of the brick class by some additional logic before and after the application call. BrickWrapper serves as a convenient base class for such decorators.

Note, that since directly applying a decorator to a Brick subclass will only take place after __new__() is called, subclasses of BrickWrapper should be applied by setting the decorators attribute of the new brick class, like in the example below:

>>> from blocks.bricks.base import Brick
>>> class WrappedBrick(Brick):
...     decorators = [WithExtraDims()]
__call__(mcs, name, bases, namespace)

Calls wrap() for all applications of the base class.

wrap(wrapped, namespace)

Wrap an application of the base brick.

This method should be overriden to write into its namespace argument all required changes.

Parameters:
  • mcs (type) – The metaclass.
  • wrapped (Application) – The application to be wrapped.
  • namespace (dict) – The namespace of the class being created.
class blocks.bricks.wrappers.WithExtraDims

Bases: blocks.bricks.wrappers.BrickWrapper

Wraps a brick’s applications to handle inputs with extra dimensions.

A brick can be often reused even when data has more dimensions than in the default setting. An example is a situation when one wants to apply categorical_cross_entropy() to temporal data, that is when an additional ‘time’ axis is prepended to its both x and y inputs.

This wrapper adds reshapes required to use application methods of a brick with such data by merging the extra dimensions with the first non-extra one. Two key assumptions are made: that all inputs and outputs have the same number of extra dimensions and that these extra dimensions are equal throughout all inputs and outputs.

While this might be inconvinient, the wrapped brick does not try to guess the number of extra dimensions, but demands it as an argument. The considerations of simplicity and reliability motivated this design choice. Upon availability in Blocks of a mechanism to request the expected number of dimensions for an input of a brick, this can be reconsidered.

wrap(wrapped, namespace)