A General Optimization Framework for Dynamic Time Warping

Dave Deriso · 2026

Given two signals xx and yy, the goal of dynamic time warping (DTW) is to find a warping function ϕ\phi such that xϕyx \circ \phi \approx y. In other words, instead of modifying amplitude, we modify (or "warp") time to align the two signals.

This problem arises naturally whenever you compare signals that share the same shape but differ in timing. A cardiologist comparing heartbeats across patients sees the same P-wave, QRS complex, and T-wave — but each occurs at slightly different times and speeds. An ADC clock hang stalls the digitized signal while the reference keeps running, then races to catch up. A multimodal waveform may stretch or compress differently across each of its lobes. In each case, a meaningful comparison requires first removing the timing differences — and that means finding the right warping function.

Loading...

The figures above show three such problems. In each, the blue and orange signals share the same underlying shape, but their timing differs — the gray lines show the pointwise correspondence found by a warping function. Simple operations like shifting or scaling the time axis cannot fix these misalignments; the timing distortion varies across the signal, so only a nonlinear warping function can bring them into register.

Classic DTW

DTW is the classic tool for this, used wherever signals must be aligned before they can be compared. The classic DTW algorithm (Sakoe & Chiba, 1978) discretizes both signals into NN time steps and fills in a cost matrix DRN×ND \in \mathbb{R}^{N \times N} via the recurrence

D(i,j)=x(τj)y(ti)+min{D(i+1,j)D(i+1,j+1)D(i,j+1)D(i,j) = \|x(\tau_j) - y(t_i)\| + \min \begin{cases} D(i+1,\, j) \\ D(i+1,\, j+1) \\ D(i,\, j+1) \end{cases}

where each cell accumulates the local distance plus the cheapest path from one of three neighbors. The optimal path is then recovered by backtracking from D(N,N)D(N, N) to D(1,1)D(1, 1). The three allowed transitions — right, up, or diagonal — enforce monotonicity and continuity, while the boundary conditions D(1,1)D(1,1) and D(N,N)D(N,N) anchor the endpoints.

The interactive demo below shows the classic DTW algorithm in action. The distance matrix DD is shown as a heatmap with the optimal path overlaid. Try adjusting NN to see how resolution affects the alignment.

NN50

Despite its popularity, DTW has a longstanding problem: the warping path is restricted to a discrete grid, so the resulting warping function is piecewise constant — a staircase. Flat segments of the staircase are singularities, where one signal's time stands still while the other advances, collapsing many time points onto one. The comparison below illustrates this issue.

Classic DTW

Most approaches to reducing singularities fall into two camps: transforming the input signals (via derivatives, square-root velocity functions, adaptive down-sampling, or wavelet ensembles) to indirectly influence warping smoothness, or varying the continuity constraints by expanding allowable transitions via step patterns. Both are ad-hoc techniques applied outside the optimization itself, and both require hand-tuning for different signal types.

A continuous-time optimization framework

Instead of relying on pre- or post-processing, we handle singularities directly within the optimization. We pose DTW in continuous time, choosing a warping function ϕ:[0,1][0,1]\phi: [0,1] \to [0,1] that solves the following optimization problem:

minimizeL(ϕ)+λcumlRcuml(ϕ)+λinstRinst(ϕ)subject toϕ(0)=0,ϕ(1)=1\begin{array}{ll} \text{minimize} & \mathcal{L}(\phi) + \lambda^{\text{cuml}} \cdot \mathcal{R}^{\text{cuml}}(\phi) + \lambda^{\text{inst}} \cdot \mathcal{R}^{\text{inst}}(\phi) \\ \text{subject to} & \phi(0) = 0, \quad \phi(1) = 1 \end{array}

where λcuml,λinstR+\lambda^{\text{cuml}}, \lambda^{\text{inst}} \in \mathbb{R}_+ are hyperparameters that control the trade-off between alignment fidelity and warping regularity. The loss L\mathcal{L} penalizes misalignment, while the two regularization terms Rcuml\mathcal{R}^{\text{cuml}} and Rinst\mathcal{R}^{\text{inst}} penalize cumulative warping (limiting overfitting, analogous to ridge regularization) and the instantaneous warping rate (directly producing smoother warping functions), respectively. Let's unpack each term.

The loss functional

The loss measures how well the time-warped signal approximates the target. The penalty function L:RR+{+}L : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\} maps pointwise residuals to non-negative costs:

L(ϕ)=01L(x(ϕ(t))y(t))dt\mathcal{L}(\phi) = \int_0^1 L(x(\phi(t)) - y(t)) \, dt

Common choices for LL include:

  • Squared loss: L(u)=u22L(u) = \|u\|_2^2
  • Absolute loss: L(u)=u1L(u) = \|u\|_1
  • Huber loss: Squared for small deviations, linear for outliers

Cumulative warp regularization

The cumulative warp ϕ(t)t\phi(t) - t measures how much the warped time deviates from the original time at each moment. The penalty function Rcuml:RR+{+}R^{\text{cuml}} : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\} maps this deviation to a cost:

Rcuml(ϕ)=01Rcuml(ϕ(t)t)dt\mathcal{R}^{\text{cuml}}(\phi) = \int_0^1 R^{\text{cuml}}(\phi(t) - t) \, dt

This term limits overfitting, similar to ridge or lasso regularization in machine learning. The larger λcuml\lambda^{\text{cuml}}, the less τ\tau deviates from tt.

Instantaneous warp regularization

The instantaneous rate of time warping ϕ(t)1\phi'(t) - 1 measures how quickly the warping is changing at each moment. The penalty function Rinst:RR+{+}R^{\text{inst}} : \mathbb{R} \to \mathbb{R}_+ \cup \{+\infty\} maps this rate to a cost:

Rinst(ϕ)=01Rinst(ϕ(t)1)dt\mathcal{R}^{\text{inst}}(\phi) = \int_0^1 R^{\text{inst}}(\phi'(t) - 1) \, dt

This is the key to eliminating singularities. By penalizing large instantaneous warping rates, we directly encourage smooth warping functions. The larger λinst\lambda^{\text{inst}}, the smoother the time warping function.

Explore the impact of the smoothing regularization below to see how it improves over the classic DTW formulation.

Classic DTW

GDTW

1.0

Interactive Demo

The demo below shows how all of the terms in the objective interact. The top chart shows the signals and their alignment via warp lines. Below it, three charts display ϕ(t)\phi(t) directly (dotted diagonal = identity), the cumulative deviation ϕ(t)t\phi(t) - t, and the instantaneous warping rate ϕ(t)1\phi'(t) - 1. Adjust the regularization parameters to see how they control smoothness and total warping.

A sine wave x(t)=sin(10πt)x(t) = \sin(10\pi t) warped by a quadratic time function ϕ(t)=t2\phi(t) = t^2, so y=xϕy = x \circ \phi.

Loading...
λinst\lambda^{\text{inst}}(Smoothness)1
*

Higher values result in smoother warping functions. Lower values may produce singularities.

λcuml\lambda^{\text{cuml}}(Cumulative Warp)0.1
*

Higher values limit the total amount of warping, preventing over-fitting.

Loss & Penalty

Solving via dynamic programming with refinement

The optimization problem above is infinite-dimensional and generally non-convex. But the state dimension is one — scalar ϕ(t)\phi(t) with control u(t)=ϕ(t)u(t) = \phi'(t) — so it can be solved exactly by discretization and dynamic programming.

Discretization

We discretize time with NN values and assume ϕ\phi is piecewise linear, so it is fully specified by the vector τRN\tau \in \mathbb{R}^N where τi=ϕ(ti)\tau_i = \phi(t_i). We then discretize the values each τi\tau_i can take into MM linearly spaced possibilities between lower and upper bounds lil_i and uiu_i.

This yields a directed graph with MNMN nodes, where each node (i,j)(i, j) represents assigning τi\tau_i to the jj-th discretized value. Each node at time step ii has MM outgoing edges to nodes at step i+1i+1, for a total of (N1)M2(N-1)M^2 edges. The discretized objective decomposes naturally onto this graph: each node carries the loss plus cumulative regularization cost for that (ti,τj)(t_i, \tau_j) pair, and each edge carries the instantaneous regularization cost for the slope between adjacent knot points. The total cost of a path from (1,1)(1, 1) to (N,M)(N, M) is exactly our discretized objective, and standard dynamic programming finds the globally optimal path in O(NM2)O(NM^2) time.

Baking slope constraints into the grid

The slope bounds SminS_{\min} and SmaxS_{\max} are hard constraints on ϕ\phi' — the minimum and maximum allowed rate of time warping. In general, slope constraints can be enforced by assigning infinite cost to edges whose slope falls outside the allowed range. But this wastes computation on infeasible nodes that can never appear in an optimal path.

Instead, we choose the bounds on τi\tau_i so that every node in the grid is reachable from τ1=0\tau_1 = 0 and can reach τN=1\tau_N = 1 without violating the slope constraints:

li=max{Sminti,  1Smax(1ti)},ui=min{Smaxti,  1Smin(1ti)}.l_i = \max\{\, S_{\min}\, t_i,\; 1 - S_{\max}(1-t_i) \,\}, \qquad u_i = \min\{\, S_{\max}\, t_i,\; 1 - S_{\min}(1-t_i) \,\}.

Together they carve out a parallelogram-shaped feasible region. Every point inside this region is feasible, so dynamic programming never encounters infeasible nodes or edges. Provided M>N/SmaxM > N / S_{\max}, the grid spacing is fine enough that adjacent columns always overlap, and the slope constraints are fully encoded in the grid geometry.

The visualization below shows the computational grid for a single iteration. Hover over grid points to see the warping line each point represents. Points near the diagonal (τt\tau \approx t) produce nearly vertical warping lines (minimal warping), while points far from the diagonal create slanted lines (significant time distortion). The optimal path (in orange) traces the sequence of alignments that minimizes total cost. Adjust SminS_{\min} and SmaxS_{\max} to see how the slope bounds reshape the feasible region.

Loading...

Iterative refinement

After solving the discretized problem, the accuracy of τ\tau^* is limited by the spacing between adjacent grid values. To reduce this discretization error without increasing MM, we shrink the bounds [li,ui][l_i, u_i] toward the current solution by a factor η\eta:

li(q+1)=max ⁣{τi(q)ηui(q)li(q)2,  li(0)},ui(q+1)=min ⁣{τi(q)+ηui(q)li(q)2,  ui(0)}l_i^{(q+1)} = \max\!\left\{\, \tau_i^{*(q)} - \eta\,\frac{u_i^{(q)} - l_i^{(q)}}{2},\; l_i^{(0)} \right\}, \qquad u_i^{(q+1)} = \min\!\left\{\, \tau_i^{*(q)} + \eta\,\frac{u_i^{(q)} - l_i^{(q)}}{2},\; u_i^{(0)} \right\}

The same MM grid points now span a narrower range, so the spacing between them is finer. Clipping against the original bounds li(0)l_i^{(0)} and ui(0)u_i^{(0)} ensures the slope constraints remain satisfied. Each iteration costs O(NM2)O(NM^2) — the same as the initial solve — and convergence typically occurs within three to four iterations.

Loading...

Looking ahead

This post introduced the core idea behind GDTW: by recasting dynamic time warping as a continuous-time optimization problem with explicit regularization, we can eliminate the singularities that have plagued classical DTW for decades. But this framework opens the door to far more than just smoother alignments.

In upcoming posts, we'll explore these ideas in much greater depth:

The unifying theme across all of these is that time warping, when treated as a first-class optimization variable rather than a preprocessing step, becomes a remarkably flexible primitive for signal processing and machine learning. Stay tuned.