A General Optimization Framework for Dynamic Time Warping
Dave Deriso · 2026
Given two signals and , the goal of dynamic time warping (DTW) is to find a warping function such that . In other words, instead of modifying amplitude, we modify (or "warp") time to align the two signals.
This problem arises naturally whenever you compare signals that share the same shape but differ in timing. A cardiologist comparing heartbeats across patients sees the same P-wave, QRS complex, and T-wave — but each occurs at slightly different times and speeds. An ADC clock hang stalls the digitized signal while the reference keeps running, then races to catch up. A multimodal waveform may stretch or compress differently across each of its lobes. In each case, a meaningful comparison requires first removing the timing differences — and that means finding the right warping function.
The figures above show three such problems. In each, the blue and orange signals share the same underlying shape, but their timing differs — the gray lines show the pointwise correspondence found by a warping function. Simple operations like shifting or scaling the time axis cannot fix these misalignments; the timing distortion varies across the signal, so only a nonlinear warping function can bring them into register.
Classic DTW
DTW is the classic tool for this, used wherever signals must be aligned before they can be compared. The classic DTW algorithm (Sakoe & Chiba, 1978) discretizes both signals into time steps and fills in a cost matrix via the recurrence
where each cell accumulates the local distance plus the cheapest path from one of three neighbors. The optimal path is then recovered by backtracking from to . The three allowed transitions — right, up, or diagonal — enforce monotonicity and continuity, while the boundary conditions and anchor the endpoints.
The interactive demo below shows the classic DTW algorithm in action. The distance matrix is shown as a heatmap with the optimal path overlaid. Try adjusting to see how resolution affects the alignment.
Despite its popularity, DTW has a longstanding problem: the warping path is restricted to a discrete grid, so the resulting warping function is piecewise constant — a staircase. Flat segments of the staircase are singularities, where one signal's time stands still while the other advances, collapsing many time points onto one. The comparison below illustrates this issue.
Classic DTW
Most approaches to reducing singularities fall into two camps: transforming the input signals (via derivatives, square-root velocity functions, adaptive down-sampling, or wavelet ensembles) to indirectly influence warping smoothness, or varying the continuity constraints by expanding allowable transitions via step patterns. Both are ad-hoc techniques applied outside the optimization itself, and both require hand-tuning for different signal types.
A continuous-time optimization framework
Instead of relying on pre- or post-processing, we handle singularities directly within the optimization. We pose DTW in continuous time, choosing a warping function that solves the following optimization problem:
where are hyperparameters that control the trade-off between alignment fidelity and warping regularity. The loss penalizes misalignment, while the two regularization terms and penalize cumulative warping (limiting overfitting, analogous to ridge regularization) and the instantaneous warping rate (directly producing smoother warping functions), respectively. Let's unpack each term.
The loss functional
The loss measures how well the time-warped signal approximates the target. The penalty function maps pointwise residuals to non-negative costs:
Common choices for include:
- Squared loss:
- Absolute loss:
- Huber loss: Squared for small deviations, linear for outliers
Cumulative warp regularization
The cumulative warp measures how much the warped time deviates from the original time at each moment. The penalty function maps this deviation to a cost:
This term limits overfitting, similar to ridge or lasso regularization in machine learning. The larger , the less deviates from .
Instantaneous warp regularization
The instantaneous rate of time warping measures how quickly the warping is changing at each moment. The penalty function maps this rate to a cost:
This is the key to eliminating singularities. By penalizing large instantaneous warping rates, we directly encourage smooth warping functions. The larger , the smoother the time warping function.
Explore the impact of the smoothing regularization below to see how it improves over the classic DTW formulation.
Classic DTW
GDTW
Interactive Demo
The demo below shows how all of the terms in the objective interact. The top chart shows the signals and their alignment via warp lines. Below it, three charts display directly (dotted diagonal = identity), the cumulative deviation , and the instantaneous warping rate . Adjust the regularization parameters to see how they control smoothness and total warping.
A sine wave warped by a quadratic time function , so .
Higher values result in smoother warping functions. Lower values may produce singularities.
Higher values limit the total amount of warping, preventing over-fitting.
Solving via dynamic programming with refinement
The optimization problem above is infinite-dimensional and generally non-convex. But the state dimension is one — scalar with control — so it can be solved exactly by discretization and dynamic programming.
Discretization
We discretize time with values and assume is piecewise linear, so it is fully specified by the vector where . We then discretize the values each can take into linearly spaced possibilities between lower and upper bounds and .
This yields a directed graph with nodes, where each node represents assigning to the -th discretized value. Each node at time step has outgoing edges to nodes at step , for a total of edges. The discretized objective decomposes naturally onto this graph: each node carries the loss plus cumulative regularization cost for that pair, and each edge carries the instantaneous regularization cost for the slope between adjacent knot points. The total cost of a path from to is exactly our discretized objective, and standard dynamic programming finds the globally optimal path in time.
Baking slope constraints into the grid
The slope bounds and are hard constraints on — the minimum and maximum allowed rate of time warping. In general, slope constraints can be enforced by assigning infinite cost to edges whose slope falls outside the allowed range. But this wastes computation on infeasible nodes that can never appear in an optimal path.
Instead, we choose the bounds on so that every node in the grid is reachable from and can reach without violating the slope constraints:
Together they carve out a parallelogram-shaped feasible region. Every point inside this region is feasible, so dynamic programming never encounters infeasible nodes or edges. Provided , the grid spacing is fine enough that adjacent columns always overlap, and the slope constraints are fully encoded in the grid geometry.
The visualization below shows the computational grid for a single iteration. Hover over grid points to see the warping line each point represents. Points near the diagonal () produce nearly vertical warping lines (minimal warping), while points far from the diagonal create slanted lines (significant time distortion). The optimal path (in orange) traces the sequence of alignments that minimizes total cost. Adjust and to see how the slope bounds reshape the feasible region.
Iterative refinement
After solving the discretized problem, the accuracy of is limited by the spacing between adjacent grid values. To reduce this discretization error without increasing , we shrink the bounds toward the current solution by a factor :
The same grid points now span a narrower range, so the spacing between them is finer. Clipping against the original bounds and ensures the slope constraints remain satisfied. Each iteration costs — the same as the initial solve — and convergence typically occurs within three to four iterations.
Looking ahead
This post introduced the core idea behind GDTW: by recasting dynamic time warping as a continuous-time optimization problem with explicit regularization, we can eliminate the singularities that have plagued classical DTW for decades. But this framework opens the door to far more than just smoother alignments.
In upcoming posts, we'll explore these ideas in much greater depth:
- Group Signal Alignment — jointly aligning collections of signals to a common reference via iterative time warping
- Warping Basis Functions — extracting principal modes of amplitude and timing variability via SVD of aligned signals and warps
- Inside the GDTW Solver — a detailed look at the computational machinery: distance matrices, the forward pass, backtracking, and iterative refinement
- Computing Gradients Through Time Warping — how to differentiate through the warping process, enabling end-to-end learning
- GDTW as a Differentiable Loss Function — using GDTW directly as a training objective for neural networks, including denoising autoencoders and time-series GANs
The unifying theme across all of these is that time warping, when treated as a first-class optimization variable rather than a preprocessing step, becomes a remarkably flexible primitive for signal processing and machine learning. Stay tuned.