basic¶
numqi.optimize
module provides a wrapper for L-BFGS-B
optimization algorithm from scipy.optimize.minimize
function and Adam
optimizer from torch.optim
module. It is designed to directly apply the optimization algorithm on the torch.nn.Module
with few lines of code.
import time
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
import torch
try:
import numqi
except ImportError:
%pip install numqi
import numqi
mathematical statement¶
Let's start with a simple example (Booth function) of a quadratic function optimization $\mathcal{L}:\mathbb{R}^2\mapsto\mathbb{R}$ with two variables:
$$ \mathcal{L}(x,y)=(x+2y-7)^2 + (2x+y-5)^2 $$
One can easily find the minimum of this function is
$$ \min_{x,y} \mathcal{L}(x,y)=\mathcal{L}(1,3)=0 $$
with a unique solution $(x,y)=(1,3)$.
model definition¶
To optimize such a loss function in numqi
, firstly, one need to convert the loss function into a torch.nn.Module
. If you try Pytorch before, you may notice this is just a normal torch.nn.Module
with a forward
function. torch.nn.Module
defines a model to contain the parameters and how to evaluate the target function. In the __init__
function (the constructor of the class), we define two torch parameters x
and y
with random initial values. In the forward
function, we calculate the loss function with these parameters.
The main difference from the usual torch.nn.Module
is that we do not have data
or label
in the input of the forward()
function as we do not need them in the loss function.
class DummyModel(torch.nn.Module):
def __init__(self):
super().__init__()
np_rng = np.random.default_rng()
self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
def forward(self):
x,y = self.x, self.y
loss = (x + 2 * y - 7) ** 2 + (2 * x + y - 5) ** 2
return loss
Create an instance of the model, and play with its function value and gradients at different points. numqi.optimize.hf_model_wrapper
convert a torch.nn.Module
into a pure function $\mathbb{R}^2\mapsto\mathbb{R}$ which inputs a vector of length 2 (the first is model.x
and the the seond is model.y
sorted alphabetically) and output the loss function and the gradient (default, one can turn off the gradient calculation by setting tag_grad=False
to reduce computational time).
model = DummyModel()
hf_model = numqi.optimize.hf_model_wrapper(model)
for x,y in [(0,0), (1,1), (1,3), (3,1)]:
fval, gradient = hf_model(np.array([x,y]), tag_grad=True)
print(f'f(x={x}, y={y})={fval}, gradient={gradient.tolist()}')
f(x=0, y=0)=74.0, gradient=[-34, -38] f(x=1, y=1)=20.0, gradient=[-16, -20] f(x=1, y=3)=0.0, gradient=[0, 0] f(x=3, y=1)=8.0, gradient=[4, -4]
Apparently, we are going to find the optimal points $(x=1,y=3)$ where the gradient is also zero. The gradient is obtained by automatic gradient back-propagation from Pytorch framework, which is the main advantage of using torch.nn.Module
to define the model.
optimize with L-BFGS-B algorithm¶
One can make use of numqi.optimize.minimize()
to search the optimal points. This function will return the optimal value with some default hyper-parameter setting. Under the hood, the model is randomly initialized three rounds to avoid local minimum, and the L-BFGS-B
algorithm from scipy.optimize.minimize
is applied to optimize the model scipy/documentation/L-BFGS-B wiki/L-BFGS-B.
numqi.optimize.minimize
return the same data structure as scipy.optimize.minimize
and one can access the optimal value with res.x
which is a 1-dimensional vector (the first element is model.x
and the second is model.y
), and the optimal loss with res.fun
. One can also access the optimal value with model.x
and model.y
directly, and call model()
to get the optimal loss which is the same as.
theta_optim = numqi.optimize.minimize(model, tol=1e-12, num_repeat=3)
print(f"Optimal theta: {theta_optim.x}")
print(f'model.x={model.x.item()}, model.y={model.y.item()}')
print(f'model()={model().item()}')
[round=0] min(f)=2.5427764811558198e-17, current(f)=2.5427764811558198e-17 [round=1] min(f)=1.2511137640880374e-20, current(f)=1.2511137640880374e-20 [round=2] min(f)=2.9528901728331637e-23, current(f)=2.9528901728331637e-23 Optimal theta: [1. 3.] model.x=0.999999999996105, model.y=3.000000000003782 model()=2.9528901728331637e-23
Mostly, the optimal value can approach the theoretical optimal $(x,y)=(1,3)$ up to machine precision (roughly $10^{-7}$ for single precision dtype=float32
and $10^-{14}$ for double precision dtype=float64
).
visualize the loss landscape¶
Let's watch closer to this optimization problem. For that the target function has only two parameters, we can draw the contour lines around the optimal point $(x,y)=(1,3)$.
xdata = np.linspace(-1, 3, 50)
ydata = np.linspace(1, 5, 50)
hf_model = numqi.optimize.hf_model_wrapper(model)
zdata = np.array([[hf_model(np.array([x, y]), tag_grad=False) for x in xdata] for y in ydata])
fig,ax = plt.subplots()
hcontour = ax.contour(xdata, ydata, zdata, levels=13)
ax.clabel(hcontour, inline=True, fontsize=10)
ax.plot([1], [3], 'rx', label='optimal')
ax.legend()
ax.set_xlabel('x')
_ = ax.set_ylabel('y')
We can think of the optimization process is like the rolling of a ball on the surface of the loss landscape. Initially the ball is put randomly on the surface (default, uniform distribution in the range [0,1]
). The ball will roll down the slope of the surface and finally stop at the bottom of the valley (the cross point in the figure). The rolling process is like the gradient descent algorithm, and the bottom of the valley is the optimal point.
why gradient backpropagation is important¶
Without gradient backpropagation, one can still approximate the gradient by finite difference method wiki-link
$$ \frac{\partial f}{\partial x}=\frac{f(x+\varepsilon)-f(x-\varepsilon)}{2\varepsilon} $$
which is not quite accurate (it's accurate enough in most cases, see wiki-page/accuracy-and-order) and have to be calculated for each parameter separately. Assuming we have $N$ parameters, the computational cost is $O(N)$ evaluation time (forward) for finite different method. However, with gradient backpropagation. However, the computational cost is $O(1)$ evaluation time (forward) and gradient evaluation time (backward) for all parameters.
From personal experience, the backward time is comparable to the forward time (1-2 times mostly) (TODO more solid reference). Also, it need to save intermediate variable for backward calculation, which may cost more memory. In some worst cases (large computational graph and few parameters), the gradient backpropagation may be slower than finite difference method.
Let's use a target function with much more parameters to compare the computational time of gradient backpropagation and finite difference method.
$$ \mathcal{L}(x,y)=\sum_{i}\left(xA_{i}y-b_{i}\right)^{2}:\mathbb{R}^{n}\times\mathbb{R}^{n}\mapsto\mathbb{R} $$
class DummyModel01(torch.nn.Module):
def __init__(self, matA, matB):
super().__init__()
np_rng = np.random.default_rng()
N0,N1,_ = matA.shape
self.matA = torch.tensor(matA, dtype=torch.float64)
self.matB = torch.tensor(matB, dtype=torch.float64)
self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
def forward(self):
tmp0 = (self.matA @ self.x) @ self.y - self.matB
loss = torch.dot(tmp0, tmp0)
return loss
We generate some random matrix $A_{i}$ and vector $b_{i}$ for test.
num_matrix = 64
dimension = 32
np_rng = np.random.default_rng(233) #fix seed for reproducibility
matA = np_rng.uniform(-1,1, size=(num_matrix, dimension, dimension))
matB = np_rng.uniform(-1,1, size=num_matrix)
model = DummyModel01(matA, matB)
hf_model = numqi.optimize.hf_model_wrapper(model)
hf_model_ag = lambda x: hf_model(x, tag_grad=True)
hf_model_fd = lambda x: hf_model(x, tag_grad=False)
x0 = np_rng.uniform(-1,1,size=2*dimension)
t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_ag, x0, jac=True, tol=1e-10)
print(f'gradient backpropagation: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')
t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_fd, x0, jac=False, tol=1e-10)
print(f'finite difference: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')
gradient backpropagation: time=0.2089s, loss=0.0256487
finite difference: time=3.2941s, loss=0.0256487
In my computer, the computational time for gradient backpropagation is 0.13s
and the finite difference method is 1.6s
. Such difference could be more significant for larger dimension
, you may take a try.