A General Optimization Framework for Dynamic Time Warping

Given two signals $x$ and $y$ , the goal of dynamic time warping (DTW) is to find a warping function $\phi$ such that $x \circ \phi \approx y$ . In other words, instead of modifying amplitude, we modify (or "warp") time to align the two signals.

This problem arises naturally whenever you compare signals that share the same shape but differ in timing. A cardiologist comparing heartbeats across patients sees the same P-wave, QRS complex, and T-wave — but each occurs at slightly different times and speeds. An ADC clock hang stalls the digitized signal while the reference keeps running, then races to catch up. A multimodal waveform may stretch or compress differently across each of its lobes. In each case, a meaningful comparison requires first removing the timing differences — and that means finding the right warping function.

The figures above show three such problems. In each, the blue and orange signals share the same underlying shape, but their timing differs — the gray lines show the pointwise correspondence found by a warping function. Simple operations like shifting or scaling the time axis cannot fix these misalignments; the timing distortion varies across the signal, so only a nonlinear warping function can bring them into register.

Classic DTW

DTW is the classic tool for this, used wherever signals must be aligned before they can be compared. The classic DTW algorithm (Sakoe & Chiba, 1978) discretizes both signals into $N$ time steps and fills in a cost matrix $D \in \mathbb{R}^{N \times N}$ via the recurrence

D(i,j) = \|x(\tau_j) - y(t_i)\| + \min \begin{cases} D(i+1,\, j) \\ D(i+1,\, j+1) \\ D(i,\, j+1) \end{cases}

where each cell accumulates the local distance plus the cheapest path from one of three neighbors. The optimal path is then recovered by backtracking from $D(N, N)$ to $D(1, 1)$ . The three allowed transitions — right, up, or diagonal — enforce monotonicity and continuity, while the boundary conditions $D(1,1)$ and $D(N,N)$ anchor the endpoints.

The interactive demo below shows the classic DTW algorithm in action. The distance matrix $D$ is shown as a heatmap with the optimal path overlaid. Try adjusting $N$ to see how resolution affects the alignment.

N

Despite its popularity, DTW has a longstanding problem: the warping path is restricted to a discrete grid, so the resulting warping function is piecewise constant — a staircase. Flat segments of the staircase are singularities, where one signal's time stands still while the other advances, collapsing many time points onto one. The comparison below illustrates this issue.

Classic DTW

Most approaches to reducing singularities fall into two camps: transforming the input signals (via derivatives, square-root velocity functions, adaptive down-sampling, or wavelet ensembles) to indirectly influence warping smoothness, or varying the continuity constraints by expanding allowable transitions via step patterns. Both are ad-hoc techniques applied outside the optimization itself, and both require hand-tuning for different signal types.

A continuous-time optimization framework

Instead of relying on pre- or post-processing, we handle singularities directly within the optimization. We pose DTW in continuous time, choosing a warping function $\phi: [0,1] \to [0,1]$ that solves the following optimization problem:

\begin{array}{ll} \text{minimize} & \mathcal{L}(\phi) + \lambda^{\text{cuml}} \cdot \mathcal{R}^{\text{cuml}}(\phi) + \lambda^{\text{inst}} \cdot \mathcal{R}^{\text{inst}}(\phi) \\ \text{subject to} & \phi(0) = 0, \quad \phi(1) = 1 \end{array}

where $\lambda^{\text{cuml}}, \lambda^{\text{inst}} \in \mathbb{R}_+$ are hyperparameters that control the trade-off between alignment fidelity and warping regularity. The loss $\mathcal{L}$ penalizes misalignment, while the two regularization terms $\mathcal{R}^{\text{cuml}}$ and $\mathcal{R}^{\text{inst}}$ penalize cumulative warping (limiting overfitting, analogous to ridge regularization) and the instantaneous warping rate (directly producing smoother warping functions), respectively. Let's unpack each term.

The loss functional

The loss measures how well the time-warped signal approximates the target. The penalty function $L : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\}$ maps pointwise residuals to non-negative costs:

\mathcal{L}(\phi) = \int_0^1 L(x(\phi(t)) - y(t)) \, dt

Common choices for $L$ include:

Squared loss: $L(u) = \|u\|_2^2$
Absolute loss: $L(u) = \|u\|_1$
Huber loss: Squared for small deviations, linear for outliers

Cumulative warp regularization

The cumulative warp $\phi(t) - t$ measures how much the warped time deviates from the original time at each moment. The penalty function $R^{\text{cuml}} : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\}$ maps this deviation to a cost:

\mathcal{R}^{\text{cuml}}(\phi) = \int_0^1 R^{\text{cuml}}(\phi(t) - t) \, dt

This term limits overfitting, similar to ridge or lasso regularization in machine learning. The larger $\lambda^{\text{cuml}}$ , the less $\tau$ deviates from $t$ .

Instantaneous warp regularization

The instantaneous rate of time warping $\phi'(t) - 1$ measures how quickly the warping is changing at each moment. The penalty function $R^{\text{inst}} : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\}$ maps this rate to a cost:

\mathcal{R}^{\text{inst}}(\phi) = \int_0^1 R^{\text{inst}}(\phi'(t) - 1) \, dt

This is the key to eliminating singularities. By penalizing large instantaneous warping rates, we directly encourage smooth warping functions. The larger $\lambda^{\text{inst}}$ , the smoother the time warping function.

Explore the impact of the smoothing regularization below to see how it improves over the classic DTW formulation.

Classic DTW

GDTW

\lambda^{\text{inst}}

1.0

Interactive Demo

The demo below shows how all of the terms in the objective interact. The top chart shows the signals and their alignment via warp lines. Below it, three charts display $\phi(t)$ directly (dotted diagonal = identity), the cumulative deviation $\phi(t) - t$ , and the instantaneous warping rate $\phi'(t) - 1$ . Adjust the regularization parameters to see how they control smoothness and total warping.

A sine wave $x(t) = \sin(10\pi t)$ warped by a quadratic time function $\phi(t) = t^2$ , so $y = x \circ \phi$ .

\lambda^{\text{inst}}

(Smoothness)1

Higher values result in smoother warping functions. Lower values may produce singularities.

\lambda^{\text{cuml}}

(Cumulative Warp)0.1

Higher values limit the total amount of warping, preventing over-fitting.

Loss & Penalty

Solving via dynamic programming with refinement

The optimization problem above is infinite-dimensional and generally non-convex. But the state dimension is one — scalar $\phi(t)$ with control $u(t) = \phi'(t)$ — so it can be solved exactly by discretization and dynamic programming.

Discretization

We discretize time with $N$ values and assume $\phi$ is piecewise linear, so it is fully specified by the vector $\tau \in \mathbb{R}^N$ where $\tau_i = \phi(t_i)$ . We then discretize the values each $\tau_i$ can take into $M$ linearly spaced possibilities between lower and upper bounds $l_i$ and $u_i$ .

This yields a directed graph with $MN$ nodes, where each node $(i, j)$ represents assigning $\tau_i$ to the $j$ -th discretized value. Each node at time step $i$ has $M$ outgoing edges to nodes at step $i+1$ , for a total of $(N-1)M^2$ edges. The discretized objective decomposes naturally onto this graph: each node carries the loss plus cumulative regularization cost for that $(t_i, \tau_j)$ pair, and each edge carries the instantaneous regularization cost for the slope between adjacent knot points. The total cost of a path from $(1, 1)$ to $(N, M)$ is exactly our discretized objective, and standard dynamic programming finds the globally optimal path in $O(NM^2)$ time.

Baking slope constraints into the grid

The slope bounds $S_{\min}$ and $S_{\max}$ are hard constraints on $\phi'$ — the minimum and maximum allowed rate of time warping. In general, slope constraints can be enforced by assigning infinite cost to edges whose slope falls outside the allowed range. But this wastes computation on infeasible nodes that can never appear in an optimal path.

Instead, we choose the bounds on $\tau_i$ so that every node in the grid is reachable from $\tau_1 = 0$ and can reach $\tau_N = 1$ without violating the slope constraints:

l_i = \max\{\, S_{\min}\, t_i,\; 1 - S_{\max}(1-t_i) \,\}, \qquad u_i = \min\{\, S_{\max}\, t_i,\; 1 - S_{\min}(1-t_i) \,\}.

Together they carve out a parallelogram-shaped feasible region. Every point inside this region is feasible, so dynamic programming never encounters infeasible nodes or edges. Provided $M > N / S_{\max}$ , the grid spacing is fine enough that adjacent columns always overlap, and the slope constraints are fully encoded in the grid geometry.

The visualization below shows the computational grid for a single iteration. Hover over grid points to see the warping line each point represents. Points near the diagonal ( $\tau \approx t$ ) produce nearly vertical warping lines (minimal warping), while points far from the diagonal create slanted lines (significant time distortion). The optimal path (in orange) traces the sequence of alignments that minimizes total cost. Adjust $S_{\min}$ and $S_{\max}$ to see how the slope bounds reshape the feasible region.

Iterative refinement

After solving the discretized problem, the accuracy of $\tau^*$ is limited by the spacing between adjacent grid values. To reduce this discretization error without increasing $M$ , we shrink the bounds $[l_i, u_i]$ toward the current solution by a factor $\eta$ :

l_i^{(q+1)} = \max\!\left\{\, \tau_i^{*(q)} - \eta\,\frac{u_i^{(q)} - l_i^{(q)}}{2},\; l_i^{(0)} \right\}, \qquad u_i^{(q+1)} = \min\!\left\{\, \tau_i^{*(q)} + \eta\,\frac{u_i^{(q)} - l_i^{(q)}}{2},\; u_i^{(0)} \right\}

The same $M$ grid points now span a narrower range, so the spacing between them is finer. Clipping against the original bounds $l_i^{(0)}$ and $u_i^{(0)}$ ensures the slope constraints remain satisfied. Each iteration costs $O(NM^2)$ — the same as the initial solve — and convergence typically occurs within three to four iterations.

Looking ahead

This post introduced the core idea behind GDTW: by recasting dynamic time warping as a continuous-time optimization problem with explicit regularization, we can eliminate the singularities that have plagued classical DTW for decades. But this framework opens the door to far more than just smoother alignments.

In upcoming posts, we'll explore these ideas in much greater depth:

Group Signal Alignment — jointly aligning collections of signals to a common reference via iterative time warping
Warping Basis Functions — extracting principal modes of amplitude and timing variability via SVD of aligned signals and warps
Inside the GDTW Solver — a detailed look at the computational machinery: distance matrices, the forward pass, backtracking, and iterative refinement
Computing Gradients Through Time Warping — how to differentiate through the warping process, enabling end-to-end learning
GDTW as a Differentiable Loss Function — using GDTW directly as a training objective for neural networks, including denoising autoencoders and time-series GANs

The unifying theme across all of these is that time warping, when treated as a first-class optimization variable rather than a preprocessing step, becomes a remarkably flexible primitive for signal processing and machine learning. Stay tuned.