EchoTracker includes two stages as shown in the figure, “Initialization“ and “Iterative reinforcement“. The approach follows a two-fold coarse-to-fine strategy inspired by TAPIR. In the initial stage, trajectories are initialized based on the coarse resolution of the feature maps using a coarse network. Subsequently, in the second stage, the trajectories are iteratively refined using fine-grained feature maps by a fine network, thus constituting a two-fold coarse-to-fine approach. This technique not only speeds up computation but also prevents the loss of important information due to downsampling. Although the networks in both stages estimate trajectories independently, they exploit point locations from the first frame to maintain spatial correlation and estimate coherent trajectories. Additionally, frame flow, representing the difference between consecutive frames, is naively passed to the model to make it aware of global appearance changes. The model can run on ultrasound sequences of any length and with any number of query points, depending on available memory.