Algorithms¶
- class blocks.algorithms.AdaDelta(decay_rate=0.95, epsilon=1e-06)¶
Bases: blocks.algorithms.StepRule
Adapts the step size over time using only first order information.
Parameters: - decay_rate (float, optional) – Decay rate in [0, 1]. Defaults to 0.95.
- epsilon (float, optional) – Stabilizing constant for RMS. Defaults to 1e-6.
Notes
For more information, see [ADADELTA].
[ADADELTA] Matthew D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, arXiv:1212.5701. - compute_step(parameter, previous_step)¶
- class blocks.algorithms.AdaGrad(learning_rate=0.002, epsilon=1e-06)¶
Bases: blocks.algorithms.StepRule
Implements the AdaGrad learning rule.
Parameters: - learning_rate (float, optional) – Step size. Default value is set to 0.0002.
- epsilon (float, optional) – Stabilizing constant for one over root of sum of squares. Defaults to 1e-6.
Notes
For more information, see [ADAGRAD].
[ADADGRAD] Duchi J, Hazan E, Singer Y., *Adaptive subgradient methods for online learning and
stochastic optimization*,- compute_step(parameter, previous_step)¶
- class blocks.algorithms.Adam(learning_rate=0.002, beta1=0.1, beta2=0.001, epsilon=1e-08, decay_factor=0.99999999)¶
Bases: blocks.algorithms.StepRule
Adam optimizer as described in [King2014].
[King2014] Diederik Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization, http://arxiv.org/abs/1412.6980 Parameters: - learning_rate (float, optional) – Step size. Default value is set to 0.0002.
- beta1 (float, optional) – Exponential decay rate for the first moment estimates. Default value is set to 0.1.
- beta2 (float, optional) – Exponential decay rate for the second moment estimates. Default value is set to 0.001.
- epsilon (float, optional) – Default value is set to 1e-8.
- decay_factor (float, optional) – Default value is set to 1 - 1e-8.
- compute_step(parameter, previous_step)¶
- class blocks.algorithms.BasicMomentum(momentum=0.0)¶
Bases: blocks.algorithms.StepRule
Accumulates step with exponential discount.
Parameters: momentum (float, optional) – The momentum coefficient. Defaults to 0. Notes
This step rule is intended to be used in conjunction with another step rule, _e.g._ Scale. For an all-batteries-included experience, look at Momentum.
- compute_step(parameter, previous_step)¶
- class blocks.algorithms.BasicRMSProp(decay_rate=0.9, max_scaling=100000.0)¶
Bases: blocks.algorithms.StepRule
Scales the step size by a running average of the recent step norms.
Parameters: - decay_rate (float, optional) – How fast the running average decays, value in [0, 1] (lower is faster). Defaults to 0.9.
- max_scaling (float, optional) – Maximum scaling of the step size, in case the running average is really small. Needs to be greater than 0. Defaults to 1e5.
Notes
This step rule is intended to be used in conjunction with another step rule, _e.g._ Scale. For an all-batteries-included experience, look at RMSProp.
In general, this step rule should be used _before_ other step rules, because it has normalization properties that may undo their work. For instance, it should be applied first when used in conjunction with Scale.
For more information, see [Hint2014].
- compute_step(parameter, previous_step)¶
- class blocks.algorithms.CompositeRule(components)¶
Bases: blocks.algorithms.StepRule
Chains several step rules.
Parameters: components (list of StepRule) – The learning rules to be chained. The rules will be applied in the order as given. - compute_steps(previous_steps)¶
- class blocks.algorithms.DifferentiableCostMinimizer(cost, parameters)¶
Bases: blocks.algorithms.TrainingAlgorithm
Minimizes a differentiable cost given as a Theano expression.
Very often the goal of training is to minimize the expected value of a Theano expression. Batch processing in this cases typically consists of running a (or a few) Theano functions. DifferentiableCostMinimizer is the base class for such algorithms.
Parameters: - cost (TensorVariable) – The objective to be minimized.
- parameters (list of TensorSharedVariable) – The parameters to be tuned.
- updates¶
list of TensorSharedVariable updates
Updates to be done for every batch. It is required that the updates are done using the old values of optimized parameters.
- cost¶
TensorVariable
The objective to be minimized.
- parameters¶
list of TensorSharedVariable
The parameters to be tuned.
Notes
Changing updates attribute or calling add_updates after the initialize method is called will have no effect.
Todo
Some shared variables are not parameters (e.g. those created by random streams).
Todo
Due to a rather premature status of the ComputationGraph class the parameter used only inside scans are not fetched currently.
- add_updates(updates)¶
Add updates to the training process.
The updates will be done _before_ the parameters are changed.
Parameters: updates (list of tuples or OrderedDict) – The updates to add.
- inputs¶
Return inputs of the cost computation graph.
Returns: inputs – Inputs to this graph. Return type: list of TensorVariable
- updates
- class blocks.algorithms.GradientDescent(step_rule=None, gradients=None, known_grads=None, consider_constant=None, on_unused_sources='raise', theano_func_kwargs=None, **kwargs)¶
Bases: blocks.algorithms.DifferentiableCostMinimizer
A base class for all gradient descent algorithms.
By “gradient descent” we mean a training algorithm of the following form:
for batch in data: steps = step_rule.compute_steps(parameters, gradients_wr_parameters) for parameter in parameters: parameter -= steps[parameter]
Note, that the step is subtracted, not added! This is done in order to make step rule chaining possible.
Parameters: - step_rule (instance of StepRule, optional) – An object encapsulating most of the algorithm’s logic. Its compute_steps method is called to get Theano expression for steps. Note, that the step rule might have a state, e.g. to remember a weighted sum of gradients from previous steps like it is done in gradient descent with momentum. If None, an instance of Scale is created.
- gradients (dict, optional) – A dictionary mapping a parameter to an expression for the cost’s gradient with respect to the parameter. If None, the gradient are taken automatically using theano.gradient.grad().
- known_grads (dict, optional) – A passthrough to theano.tensor.grad‘s known_grads argument. Useful when you know the [approximate] gradients of some sub-expressions and would like Theano to use that information to compute parameter gradients. Only makes sense when gradients is None.
- consider_constant (list, optional) – A passthrough to theano.tensor.grad‘s consider_constant argument. A list of expressions through which gradients will not be backpropagated. Only makes sense when gradients is None.
- on_unused_sources (str, one of ‘raise’ (default), ‘ignore’, ‘warn’) – Controls behavior when not all sources are used.
- theano_func_kwargs (dict, optional) – A passthrough to theano.function for additional arguments. Useful for passing profile or mode arguments to the theano function that will be compiled for the algorithm.
- gradients¶
dict
The gradient dictionary.
- initialize()¶
- process_batch(batch)¶
- class blocks.algorithms.Momentum(learning_rate=1.0, momentum=0.0)¶
Bases: blocks.algorithms.CompositeRule
Accumulates step with exponential discount.
Combines BasicMomentum and Scale to form the usual momentum step rule.
Parameters: - learning_rate (float, optional) – The learning rate by which the previous step scaled. Defaults to 1.
- momentum (float, optional) – The momentum coefficient. Defaults to 0.
- learning_rate¶
SharedVariable
A variable for learning rate.
- momentum¶
SharedVariable
A variable for momentum.
See also
SharedVariableModifier
- class blocks.algorithms.RMSProp(learning_rate=1.0, decay_rate=0.9, max_scaling=100000.0)¶
Bases: blocks.algorithms.CompositeRule
Scales the step size by a running average of the recent step norms.
Combines BasicRMSProp and Scale to form the step rule described in [Hint2014].
[Hint2014] (1, 2) Geoff Hinton, Neural Networks for Machine Learning, lecture 6a, http://cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Parameters: - learning_rate (float, optional) – The learning rate by which the previous step scaled. Defaults to 1.
- decay_rate (float, optional) – How fast the running average decays (lower is faster). Defaults to 0.9.
- max_scaling (float, optional) – Maximum scaling of the step size, in case the running average is really small. Defaults to 1e5.
- learning_rate¶
SharedVariable
A variable for learning rate.
- decay_rate¶
SharedVariable
A variable for decay rate.
See also
SharedVariableModifier
- class blocks.algorithms.RemoveNotFinite(scaler=1)¶
Bases: blocks.algorithms.StepRule
A step rule that skips steps with non-finite elements.
Replaces a step (the parameter update of a single shared variable) which contains non-finite elements (such as inf or NaN) with a step rescaling the parameters.
Parameters: scaler (float, optional) – The scaling applied to the parameter in case the step contains non-finite elements. Defaults to 1, which means that parameters will not be changed. Notes
This rule should be applied last!
This trick was originally used in the GroundHog framework.
- compute_step(parameter, previous_step)¶
- class blocks.algorithms.Restrict(step_rule, variables)¶
Bases: blocks.algorithms.StepRule
Applies a given StepRule only to certain variables.
Example applications include clipping steps on only certain parameters, or scaling a certain kind of parameter’s updates (e.g. adding an additional scalar multiplier to the steps taken on convolutional filters).
Parameters: - compute_steps(previous_steps)¶
- class blocks.algorithms.Scale(learning_rate=1.0)¶
Bases: blocks.algorithms.StepRule
A step in the direction proportional to the previous step.
If used in GradientDescent alone, this step rule implements steepest descent.
Parameters: learning_rate (float) – The learning rate by which the previous step is multiplied to produce the step. - learning_rate¶
TensorSharedVariable
The shared variable storing the learning rate used.
- compute_step(parameter, previous_step)¶
- class blocks.algorithms.StepClipping(threshold=None)¶
Bases: blocks.algorithms.StepRule
Rescales an entire step if its L2 norm exceeds a threshold.
When the previous steps are the gradients, this step rule performs gradient clipping.
Parameters: threshold (float, optional) – The maximum permitted L2 norm for the step. The step will be rescaled to be not higher than this quanity. If None, no rescaling will be applied. - threshold¶
tensor.TensorSharedVariable
The shared variable storing the clipping threshold used.
- compute_steps(previous_steps)¶
- class blocks.algorithms.StepRule¶
Bases: object
A rule to compute steps for a gradient descent algorithm.
- compute_step(parameter, previous_step)¶
Build a Theano expression for the step for a parameter.
This method is called by default implementation of compute_steps(), it relieves from writing a loop each time.
Parameters: - parameter (TensorSharedVariable) – The parameter.
- previous_step (TensorVariable) – Some quantity related to the gradient of the cost with respect to the parameter, either the gradient itself or a step in a related direction.
Returns: - step (Variable) – Theano variable for the step to take.
- updates (list) – A list of tuples representing updates to be performed. This is useful for stateful rules such as Momentum which need to update shared variables after itetations.
- compute_steps(previous_steps)¶
Build a Theano expression for steps for all parameters.
Override this method if you want to process the steps with respect to all parameters as a whole, not parameter-wise.
Parameters: previous_steps (OrderedDict) – An OrderedDict of (TensorSharedVariable TensorVariable) pairs. The keys are the parameters being trained, the values are the expressions for quantities related to gradients of the cost with respect to the parameters, either the gradients themselves or steps in related directions. Returns: - steps (OrderedDict) – A dictionary of the proposed steps in the same form as previous_steps.
- updates (list) – A list of tuples representing updates to be performed.
- class blocks.algorithms.TrainingAlgorithm¶
Bases: object
Base class for training algorithms.
A training algorithm object has a simple life-cycle. First it is initialized by calling its initialize() method. At this stage, for instance, Theano functions can be compiled. After that the process_batch() method is repeatedly called with a batch of training data as a parameter.
- initialize(**kwargs)¶
Initialize the training algorithm.
- class blocks.algorithms.VariableClipping(threshold, axis=None)¶
Bases: blocks.algorithms.StepRule
Clip the maximum norm of individual variables along certain axes.
This StepRule can be used to implement L2 norm constraints on e.g. the weight vectors of individual hidden units, convolutional filters or entire weight tensors. Combine with Restrict (and possibly CompositeRule), to apply such constraints only to certain variables and/or apply different norm constraints to different variables.
Parameters: - threshold (float) – Maximum norm for a given (portion of a) tensor.
- axis (int or iterable, optional) – An integer single axis, or an iterable collection of integer axes over which to sum in order to calculate the L2 norm. If None (the default), the norm is computed over all elements of the tensor.
Notes
Because of the way the StepRule API works, this particular rule implements norm clipping of the value after update in the following way: it computes parameter - previous_step, scales it to have (possibly axes-wise) norm(s) of at most threshold, then subtracts that value from parameter to yield an ‘equivalent step’ that respects the desired norm constraints. This procedure implicitly assumes one is doing simple (stochastic) gradient descent, and so steps computed by this step rule may not make sense for use in other contexts.
Investigations into max-norm regularization date from [Srebro2005]. The first appearance of this technique as a regularization method for the weight vectors of individual hidden units in feed-forward neural networks may be [Hinton2012].
[Srebro2005] Nathan Srebro and Adi Shraibman. “Rank, Trace-Norm and Max-Norm”. 18th Annual Conference on Learning Theory (COLT), June 2005. [Hinton2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov. “Improving neural networks by preventing co-adaptation of feature detectors”. arXiv:1207.0580. - compute_step(parameter, previous_step)¶