basic¶

numqi.optimize module provides a wrapper for L-BFGS-B optimization algorithm from scipy.optimize.minimize function and Adam optimizer from torch.optim module. It is designed to directly apply the optimization algorithm on the torch.nn.Module with few lines of code.

In [1]:

Copied!





import time
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
import torch

try:
    import numqi
except ImportError:
    %pip install numqi
    import numqi
import time
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
import torch

try:
    import numqi
except ImportError:
    %pip install numqi
    import numqi

mathematical statement¶

Let's start with a simple example (Booth function) of a quadratic function optimization $\mathcal{L}:\mathbb{R}^2\mapsto\mathbb{R}$ with two variables:

$$ \mathcal{L}(x,y)=(x+2y-7)^2 + (2x+y-5)^2 $$

One can easily find the minimum of this function is

$$ \min_{x,y} \mathcal{L}(x,y)=\mathcal{L}(1,3)=0 $$

with a unique solution $(x,y)=(1,3)$.

model definition¶

To optimize such a loss function in numqi, firstly, one need to convert the loss function into a torch.nn.Module. If you try Pytorch before, you may notice this is just a normal torch.nn.Module with a forward function. torch.nn.Module defines a model to contain the parameters and how to evaluate the target function. In the __init__ function (the constructor of the class), we define two torch parameters x and y with random initial values. In the forward function, we calculate the loss function with these parameters.

The main difference from the usual torch.nn.Module is that we do not have data or label in the input of the forward() function as we do not need them in the loss function.

In [2]:

Copied!





class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        np_rng = np.random.default_rng()
        self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
        self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
    def forward(self):
        x,y = self.x, self.y
        loss = (x + 2 * y - 7) ** 2 + (2 * x + y - 5) ** 2
        return loss
class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        np_rng = np.random.default_rng()
        self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
        self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1), dtype=torch.float64))
    def forward(self):
        x,y = self.x, self.y
        loss = (x + 2 * y - 7) ** 2 + (2 * x + y - 5) ** 2
        return loss

Create an instance of the model, and play with its function value and gradients at different points. numqi.optimize.hf_model_wrapper convert a torch.nn.Module into a pure function $\mathbb{R}^2\mapsto\mathbb{R}$ which inputs a vector of length 2 (the first is model.x and the the seond is model.y sorted alphabetically) and output the loss function and the gradient (default, one can turn off the gradient calculation by setting tag_grad=False to reduce computational time).

In [3]:

Copied!





model = DummyModel()
hf_model = numqi.optimize.hf_model_wrapper(model)
for x,y in [(0,0), (1,1), (1,3), (3,1)]:
    fval, gradient = hf_model(np.array([x,y]), tag_grad=True)
    print(f'f(x={x}, y={y})={fval}, gradient={gradient.tolist()}')
model = DummyModel()
hf_model = numqi.optimize.hf_model_wrapper(model)
for x,y in [(0,0), (1,1), (1,3), (3,1)]:
    fval, gradient = hf_model(np.array([x,y]), tag_grad=True)
    print(f'f(x={x}, y={y})={fval}, gradient={gradient.tolist()}')

f(x=0, y=0)=74.0, gradient=[-34, -38]
f(x=1, y=1)=20.0, gradient=[-16, -20]
f(x=1, y=3)=0.0, gradient=[0, 0]
f(x=3, y=1)=8.0, gradient=[4, -4]

Apparently, we are going to find the optimal points $(x=1,y=3)$ where the gradient is also zero. The gradient is obtained by automatic gradient back-propagation from Pytorch framework, which is the main advantage of using torch.nn.Module to define the model.

optimize with L-BFGS-B algorithm¶

One can make use of numqi.optimize.minimize() to search the optimal points. This function will return the optimal value with some default hyper-parameter setting. Under the hood, the model is randomly initialized three rounds to avoid local minimum, and the L-BFGS-B algorithm from scipy.optimize.minimize is applied to optimize the model scipy/documentation/L-BFGS-B wiki/L-BFGS-B.

numqi.optimize.minimize return the same data structure as scipy.optimize.minimize and one can access the optimal value with res.x which is a 1-dimensional vector (the first element is model.x and the second is model.y), and the optimal loss with res.fun. One can also access the optimal value with model.x and model.y directly, and call model() to get the optimal loss which is the same as.

In [4]:

Copied!





theta_optim = numqi.optimize.minimize(model, tol=1e-12, num_repeat=3)
print(f"Optimal theta: {theta_optim.x}")
print(f'model.x={model.x.item()}, model.y={model.y.item()}')
print(f'model()={model().item()}')
theta_optim = numqi.optimize.minimize(model, tol=1e-12, num_repeat=3)
print(f"Optimal theta: {theta_optim.x}")
print(f'model.x={model.x.item()}, model.y={model.y.item()}')
print(f'model()={model().item()}')

[round=0] min(f)=1.2648257669548845e-16, current(f)=1.2648257669548845e-16
[round=1] min(f)=4.961567236167146e-21, current(f)=4.961567236167146e-21
[round=2] min(f)=5.790611781099944e-23, current(f)=5.790611781099944e-23
Optimal theta: [1. 3.]
model.x=0.999999999994896, model.y=3.0000000000055675
model()=5.790611781099944e-23

Mostly, the optimal value can approach the theoretical optimal $(x,y)=(1,3)$ up to machine precision (roughly $10^{-7}$ for single precision dtype=float32 and $10^-{14}$ for double precision dtype=float64).

visualize the loss landscape¶

Let's watch closer to this optimization problem. For that the target function has only two parameters, we can draw the contour lines around the optimal point $(x,y)=(1,3)$.

In [5]:

Copied!





xdata = np.linspace(-1, 3, 50)
ydata = np.linspace(1, 5, 50)
hf_model = numqi.optimize.hf_model_wrapper(model)
zdata = np.array([[hf_model(np.array([x, y]), tag_grad=False) for x in xdata] for y in ydata])
fig,ax = plt.subplots()
hcontour = ax.contour(xdata, ydata, zdata, levels=13)
ax.clabel(hcontour, inline=True, fontsize=10)
ax.plot([1], [3], 'rx', label='optimal')
ax.legend()
ax.set_xlabel('x')
_ = ax.set_ylabel('y')
xdata = np.linspace(-1, 3, 50)
ydata = np.linspace(1, 5, 50)
hf_model = numqi.optimize.hf_model_wrapper(model)
zdata = np.array([[hf_model(np.array([x, y]), tag_grad=False) for x in xdata] for y in ydata])
fig,ax = plt.subplots()
hcontour = ax.contour(xdata, ydata, zdata, levels=13)
ax.clabel(hcontour, inline=True, fontsize=10)
ax.plot([1], [3], 'rx', label='optimal')
ax.legend()
ax.set_xlabel('x')
_ = ax.set_ylabel('y')

No description has been provided for this image

We can think of the optimization process is like the rolling of a ball on the surface of the loss landscape. Initially the ball is put randomly on the surface (default, uniform distribution in the range [0,1]). The ball will roll down the slope of the surface and finally stop at the bottom of the valley (the cross point in the figure). The rolling process is like the gradient descent algorithm, and the bottom of the valley is the optimal point.

why gradient backpropagation is important¶

Without gradient backpropagation, one can still approximate the gradient by finite difference method wiki-link

$$ \frac{\partial f}{\partial x}=\frac{f(x+\varepsilon)-f(x-\varepsilon)}{2\varepsilon} $$

which is not quite accurate (it's accurate enough in most cases, see wiki-page/accuracy-and-order) and have to be calculated for each parameter separately. Assuming we have $N$ parameters, the computational cost is $O(N)$ evaluation time (forward) for finite different method. However, with gradient backpropagation. However, the computational cost is $O(1)$ evaluation time (forward) and gradient evaluation time (backward) for all parameters.

From personal experience, the backward time is comparable to the forward time (1-2 times mostly) (TODO more solid reference). Also, it need to save intermediate variable for backward calculation, which may cost more memory. In some worst cases (large computational graph and few parameters), the gradient backpropagation may be slower than finite difference method.

Let's use a target function with much more parameters to compare the computational time of gradient backpropagation and finite difference method.

$$ \mathcal{L}(x,y)=\sum_{i}\left(xA_{i}y-b_{i}\right)^{2}:\mathbb{R}^{n}\times\mathbb{R}^{n}\mapsto\mathbb{R} $$

In [6]:

Copied!





class DummyModel01(torch.nn.Module):
    def __init__(self, matA, matB):
        super().__init__()
        np_rng = np.random.default_rng()
        N0,N1,_ = matA.shape
        self.matA = torch.tensor(matA, dtype=torch.float64)
        self.matB = torch.tensor(matB, dtype=torch.float64)
        self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
        self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
    def forward(self):
        tmp0 = (self.matA @ self.x) @ self.y - self.matB
        loss = torch.dot(tmp0, tmp0)
        return loss
class DummyModel01(torch.nn.Module):
    def __init__(self, matA, matB):
        super().__init__()
        np_rng = np.random.default_rng()
        N0,N1,_ = matA.shape
        self.matA = torch.tensor(matA, dtype=torch.float64)
        self.matB = torch.tensor(matB, dtype=torch.float64)
        self.x = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
        self.y = torch.nn.Parameter(torch.tensor(np_rng.uniform(-1,1,size=N1), dtype=torch.float64))
    def forward(self):
        tmp0 = (self.matA @ self.x) @ self.y - self.matB
        loss = torch.dot(tmp0, tmp0)
        return loss

We generate some random matrix $A_{i}$ and vector $b_{i}$ for test.

In [7]:

Copied!





num_matrix = 64
dimension = 32
np_rng = np.random.default_rng(233) #fix seed for reproducibility
matA = np_rng.uniform(-1,1, size=(num_matrix, dimension, dimension))
matB = np_rng.uniform(-1,1, size=num_matrix)
num_matrix = 64
dimension = 32
np_rng = np.random.default_rng(233) #fix seed for reproducibility
matA = np_rng.uniform(-1,1, size=(num_matrix, dimension, dimension))
matB = np_rng.uniform(-1,1, size=num_matrix)

In [8]:

Copied!





model = DummyModel01(matA, matB)

hf_model = numqi.optimize.hf_model_wrapper(model)
hf_model_ag = lambda x: hf_model(x, tag_grad=True)
hf_model_fd = lambda x: hf_model(x, tag_grad=False)
x0 = np_rng.uniform(-1,1,size=2*dimension)

t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_ag, x0, jac=True, tol=1e-10)
print(f'gradient backpropagation: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')

t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_fd, x0, jac=False, tol=1e-10)
print(f'finite difference: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')
model = DummyModel01(matA, matB)

hf_model = numqi.optimize.hf_model_wrapper(model)
hf_model_ag = lambda x: hf_model(x, tag_grad=True)
hf_model_fd = lambda x: hf_model(x, tag_grad=False)
x0 = np_rng.uniform(-1,1,size=2*dimension)

t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_ag, x0, jac=True, tol=1e-10)
print(f'gradient backpropagation: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')

t0 = time.time()
theta_optim = scipy.optimize.minimize(hf_model_fd, x0, jac=False, tol=1e-10)
print(f'finite difference: time={time.time()-t0:.4f}s, loss={theta_optim.fun:.7f}')

gradient backpropagation: time=0.2144s, loss=0.0256487

finite difference: time=3.4042s, loss=0.0256487

In my computer, the computational time for gradient backpropagation is 0.13s and the finite difference method is 1.6s. Such difference could be more significant for larger dimension, you may take a try.