Algorithms

class blocks.algorithms.AdaDelta(decay_rate=0.95, epsilon=1e-06)

Bases: blocks.algorithms.StepRule

Adapts the step size over time using only first order information.

Parameters:
  • decay_rate (float, optional) – Decay rate in [0, 1]. Defaults to 0.95.
  • epsilon (float, optional) – Stabilizing constant for RMS. Defaults to 1e-6.

Notes

For more information, see [ADADELTA].

[ADADELTA]Matthew D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, arXiv:1212.5701.
compute_step(parameter, previous_step)
class blocks.algorithms.AdaGrad(learning_rate=0.002, epsilon=1e-06)

Bases: blocks.algorithms.StepRule

Implements the AdaGrad learning rule.

Parameters:
  • learning_rate (float, optional) – Step size. Default value is set to 0.0002.
  • epsilon (float, optional) – Stabilizing constant for one over root of sum of squares. Defaults to 1e-6.

Notes

For more information, see [ADAGRAD].

[ADADGRAD]

Duchi J, Hazan E, Singer Y., *Adaptive subgradient methods for online learning and

stochastic optimization*,

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

compute_step(parameter, previous_step)
class blocks.algorithms.Adam(learning_rate=0.002, beta1=0.1, beta2=0.001, epsilon=1e-08, decay_factor=0.99999999)

Bases: blocks.algorithms.StepRule

Adam optimizer as described in [King2014].

[King2014]Diederik Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization, http://arxiv.org/abs/1412.6980
Parameters:
  • learning_rate (float, optional) – Step size. Default value is set to 0.0002.
  • beta1 (float, optional) – Exponential decay rate for the first moment estimates. Default value is set to 0.1.
  • beta2 (float, optional) – Exponential decay rate for the second moment estimates. Default value is set to 0.001.
  • epsilon (float, optional) – Default value is set to 1e-8.
  • decay_factor (float, optional) – Default value is set to 1 - 1e-8.
compute_step(parameter, previous_step)
class blocks.algorithms.BasicMomentum(momentum=0.0)

Bases: blocks.algorithms.StepRule

Accumulates step with exponential discount.

Parameters:momentum (float, optional) – The momentum coefficient. Defaults to 0.

Notes

This step rule is intended to be used in conjunction with another step rule, _e.g._ Scale. For an all-batteries-included experience, look at Momentum.

compute_step(parameter, previous_step)
class blocks.algorithms.BasicRMSProp(decay_rate=0.9, max_scaling=100000.0)

Bases: blocks.algorithms.StepRule

Scales the step size by a running average of the recent step norms.

Parameters:
  • decay_rate (float, optional) – How fast the running average decays, value in [0, 1] (lower is faster). Defaults to 0.9.
  • max_scaling (float, optional) – Maximum scaling of the step size, in case the running average is really small. Needs to be greater than 0. Defaults to 1e5.

Notes

This step rule is intended to be used in conjunction with another step rule, _e.g._ Scale. For an all-batteries-included experience, look at RMSProp.

In general, this step rule should be used _before_ other step rules, because it has normalization properties that may undo their work. For instance, it should be applied first when used in conjunction with Scale.

For more information, see [Hint2014].

compute_step(parameter, previous_step)
class blocks.algorithms.CompositeRule(components)

Bases: blocks.algorithms.StepRule

Chains several step rules.

Parameters:components (list of StepRule) – The learning rules to be chained. The rules will be applied in the order as given.
compute_steps(previous_steps)
class blocks.algorithms.DifferentiableCostMinimizer(cost, parameters)

Bases: blocks.algorithms.TrainingAlgorithm

Minimizes a differentiable cost given as a Theano expression.

Very often the goal of training is to minimize the expected value of a Theano expression. Batch processing in this cases typically consists of running a (or a few) Theano functions. DifferentiableCostMinimizer is the base class for such algorithms.

Parameters:
  • cost (TensorVariable) – The objective to be minimized.
  • parameters (list of TensorSharedVariable) – The parameters to be tuned.
updates

list of TensorSharedVariable updates

Updates to be done for every batch. It is required that the updates are done using the old values of optimized parameters.

cost

TensorVariable

The objective to be minimized.

parameters

list of TensorSharedVariable

The parameters to be tuned.

Notes

Changing updates attribute or calling add_updates after the initialize method is called will have no effect.

Todo

Some shared variables are not parameters (e.g. those created by random streams).

Todo

Due to a rather premature status of the ComputationGraph class the parameter used only inside scans are not fetched currently.

add_updates(updates)

Add updates to the training process.

The updates will be done _before_ the parameters are changed.

Parameters:updates (list of tuples or OrderedDict) – The updates to add.
inputs

Return inputs of the cost computation graph.

Returns:inputs – Inputs to this graph.
Return type:list of TensorVariable
updates
class blocks.algorithms.GradientDescent(step_rule=None, gradients=None, known_grads=None, consider_constant=None, on_unused_sources='raise', theano_func_kwargs=None, **kwargs)

Bases: blocks.algorithms.DifferentiableCostMinimizer

A base class for all gradient descent algorithms.

By “gradient descent” we mean a training algorithm of the following form:

for batch in data:
    steps = step_rule.compute_steps(parameters,
                                    gradients_wr_parameters)
    for parameter in parameters:
        parameter -= steps[parameter]

Note, that the step is subtracted, not added! This is done in order to make step rule chaining possible.

Parameters:
  • step_rule (instance of StepRule, optional) – An object encapsulating most of the algorithm’s logic. Its compute_steps method is called to get Theano expression for steps. Note, that the step rule might have a state, e.g. to remember a weighted sum of gradients from previous steps like it is done in gradient descent with momentum. If None, an instance of Scale is created.
  • gradients (dict, optional) – A dictionary mapping a parameter to an expression for the cost’s gradient with respect to the parameter. If None, the gradient are taken automatically using theano.gradient.grad().
  • known_grads (dict, optional) – A passthrough to theano.tensor.grad‘s known_grads argument. Useful when you know the [approximate] gradients of some sub-expressions and would like Theano to use that information to compute parameter gradients. Only makes sense when gradients is None.
  • consider_constant (list, optional) – A passthrough to theano.tensor.grad‘s consider_constant argument. A list of expressions through which gradients will not be backpropagated. Only makes sense when gradients is None.
  • on_unused_sources (str, one of ‘raise’ (default), ‘ignore’, ‘warn’) – Controls behavior when not all sources are used.
  • theano_func_kwargs (dict, optional) – A passthrough to theano.function for additional arguments. Useful for passing profile or mode arguments to the theano function that will be compiled for the algorithm.
gradients

dict

The gradient dictionary.

step_rule

instance of StepRule

The step rule.

initialize()
process_batch(batch)
class blocks.algorithms.Momentum(learning_rate=1.0, momentum=0.0)

Bases: blocks.algorithms.CompositeRule

Accumulates step with exponential discount.

Combines BasicMomentum and Scale to form the usual momentum step rule.

Parameters:
  • learning_rate (float, optional) – The learning rate by which the previous step scaled. Defaults to 1.
  • momentum (float, optional) – The momentum coefficient. Defaults to 0.
learning_rate

SharedVariable

A variable for learning rate.

momentum

SharedVariable

A variable for momentum.

See also

SharedVariableModifier

class blocks.algorithms.RMSProp(learning_rate=1.0, decay_rate=0.9, max_scaling=100000.0)

Bases: blocks.algorithms.CompositeRule

Scales the step size by a running average of the recent step norms.

Combines BasicRMSProp and Scale to form the step rule described in [Hint2014].

[Hint2014](1, 2) Geoff Hinton, Neural Networks for Machine Learning, lecture 6a, http://cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Parameters:
  • learning_rate (float, optional) – The learning rate by which the previous step scaled. Defaults to 1.
  • decay_rate (float, optional) – How fast the running average decays (lower is faster). Defaults to 0.9.
  • max_scaling (float, optional) – Maximum scaling of the step size, in case the running average is really small. Defaults to 1e5.
learning_rate

SharedVariable

A variable for learning rate.

decay_rate

SharedVariable

A variable for decay rate.

See also

SharedVariableModifier

class blocks.algorithms.RemoveNotFinite(scaler=1)

Bases: blocks.algorithms.StepRule

A step rule that skips steps with non-finite elements.

Replaces a step (the parameter update of a single shared variable) which contains non-finite elements (such as inf or NaN) with a step rescaling the parameters.

Parameters:scaler (float, optional) – The scaling applied to the parameter in case the step contains non-finite elements. Defaults to 1, which means that parameters will not be changed.

Notes

This rule should be applied last!

This trick was originally used in the GroundHog framework.

compute_step(parameter, previous_step)
class blocks.algorithms.Restrict(step_rule, variables)

Bases: blocks.algorithms.StepRule

Applies a given StepRule only to certain variables.

Example applications include clipping steps on only certain parameters, or scaling a certain kind of parameter’s updates (e.g. adding an additional scalar multiplier to the steps taken on convolutional filters).

Parameters:
  • step_rule (StepRule) – The StepRule to be applied on the given variables.
  • variables (iterable) – A collection of Theano variables on which to apply step_rule. Variables not appearing in this collection will not have step_rule applied to them.
compute_steps(previous_steps)
class blocks.algorithms.Scale(learning_rate=1.0)

Bases: blocks.algorithms.StepRule

A step in the direction proportional to the previous step.

If used in GradientDescent alone, this step rule implements steepest descent.

Parameters:learning_rate (float) – The learning rate by which the previous step is multiplied to produce the step.
learning_rate

TensorSharedVariable

The shared variable storing the learning rate used.

compute_step(parameter, previous_step)
class blocks.algorithms.StepClipping(threshold=None)

Bases: blocks.algorithms.StepRule

Rescales an entire step if its L2 norm exceeds a threshold.

When the previous steps are the gradients, this step rule performs gradient clipping.

Parameters:threshold (float, optional) – The maximum permitted L2 norm for the step. The step will be rescaled to be not higher than this quanity. If None, no rescaling will be applied.
threshold

tensor.TensorSharedVariable

The shared variable storing the clipping threshold used.

compute_steps(previous_steps)
class blocks.algorithms.StepRule

Bases: object

A rule to compute steps for a gradient descent algorithm.

compute_step(parameter, previous_step)

Build a Theano expression for the step for a parameter.

This method is called by default implementation of compute_steps(), it relieves from writing a loop each time.

Parameters:
  • parameter (TensorSharedVariable) – The parameter.
  • previous_step (TensorVariable) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns:

  • step (Variable) – Theano variable for the step to take.
  • updates (list) – A list of tuples representing updates to be performed. This is useful for stateful rules such as Momentum which need to update shared variables after itetations.

compute_steps(previous_steps)

Build a Theano expression for steps for all parameters.

Override this method if you want to process the steps with respect to all parameters as a whole, not parameter-wise.

Parameters:previous_steps (OrderedDict) – An OrderedDict of (TensorSharedVariable TensorVariable) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions.
Returns:
  • steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
  • updates (list) – A list of tuples representing updates to be performed.
class blocks.algorithms.TrainingAlgorithm

Bases: object

Base class for training algorithms.

A training algorithm object has a simple life-cycle. First it is initialized by calling its initialize() method. At this stage, for instance, Theano functions can be compiled. After that the process_batch() method is repeatedly called with a batch of training data as a parameter.

initialize(**kwargs)

Initialize the training algorithm.

process_batch(batch)

Process a batch of training data.

batch

dict

A dictionary of (source name, data) pairs.

class blocks.algorithms.VariableClipping(threshold, axis=None)

Bases: blocks.algorithms.StepRule

Clip the maximum norm of individual variables along certain axes.

This StepRule can be used to implement L2 norm constraints on e.g. the weight vectors of individual hidden units, convolutional filters or entire weight tensors. Combine with Restrict (and possibly CompositeRule), to apply such constraints only to certain variables and/or apply different norm constraints to different variables.

Parameters:
  • threshold (float) – Maximum norm for a given (portion of a) tensor.
  • axis (int or iterable, optional) – An integer single axis, or an iterable collection of integer axes over which to sum in order to calculate the L2 norm. If None (the default), the norm is computed over all elements of the tensor.

Notes

Because of the way the StepRule API works, this particular rule implements norm clipping of the value after update in the following way: it computes parameter - previous_step, scales it to have (possibly axes-wise) norm(s) of at most threshold, then subtracts that value from parameter to yield an ‘equivalent step’ that respects the desired norm constraints. This procedure implicitly assumes one is doing simple (stochastic) gradient descent, and so steps computed by this step rule may not make sense for use in other contexts.

Investigations into max-norm regularization date from [Srebro2005]. The first appearance of this technique as a regularization method for the weight vectors of individual hidden units in feed-forward neural networks may be [Hinton2012].

[Srebro2005]Nathan Srebro and Adi Shraibman. “Rank, Trace-Norm and Max-Norm”. 18th Annual Conference on Learning Theory (COLT), June 2005.
[Hinton2012]Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. “Improving neural networks by preventing co-adaptation of feature detectors”. arXiv:1207.0580.
compute_step(parameter, previous_step)