With the recent surge in sports tracking projects, many inspired by Skalski’s popular soccer tracking project, there’s been a notable shift towards using automated player tracking for sport hobbyists. Most of those approaches follow a well-recognized workflow: collect labeled data, train a YOLO model, project player coordinates onto an overhead view of the sphere or court, and use this tracking data to generate advanced analytics for potential competitive insights. Nonetheless, on this project, we offer the tools to bypass the necessity for labeled data, relying as a substitute on GroundingDINO’s zero-shot tracking capabilities together with a Kalman filter implementation to beat noisy outputs from GroundingDino.
Our dataset originates from a set of broadcast videos, publicly available under an MIT License due to Hayden Faulkner and team.¹ This data includes footage from various tennis matches through the 2012 Olympics at Wimbledon, we concentrate on a match between Serena Williams and Victoria Azarenka.
GroundingDINO, for those not familiar, merges object detection with language allowing users to provide a prompt like “a tennis player” which then leads the model to return candidate object detection boxes that fit the outline. RoboFlow has an excellent tutorial here for those fascinated with using it — but I actually have pasted some very basic code below as well. As seen below you’ll be able to prompt the model to discover objects that will very rarely if ever be tagged in an object detection dataset like a dog’s tongue!
from groundingdino.util.inference import load_model, load_image, predict, annotateBOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25
# processes the image to GroundingDino standards
image_source, image = load_image("dog.jpg")
prompt = "dog tongue, dog"
boxes, logits, phrases = predict(
model=model,
image=image,
caption=TEXT_PROMPT,
box_threshold=BOX_TRESHOLD,
text_threshold=TEXT_TRESHOLD
)
Nonetheless, distinguishing players on an expert tennis court isn’t so simple as prompting for “tennis players.” The model often misidentifies other individuals on the court, resembling line judges, ball people, and other umpires, causing jumpy and inconsistent annotations. Moreover, the model sometimes fails to even detect the players in certain frames, resulting in gaps and non-persistent boxes that don’t reliably appear in each frame.
To handle these challenges, we apply a couple of targeted methods. First, we narrow down the detection boxes to simply the highest three probabilities from all possible boxes. Often, line judges have the next probability rating than players, which is why we don’t filter to only two boxes. Nonetheless, this raises a brand new query: how can we mechanically distinguish players from line judges in each frame?
We observed that detection boxes for line and ball personnel typically have shorter time spans, often lasting just a couple of frames. Based on this, we hypothesize that by associating boxes across consecutive frames, we could filter out those that only appear briefly, thereby isolating the players.
So how will we achieve this sort of association between objects across frames? Fortunately, the sphere of multi-object tracking has extensively studied this problem. Kalman filters are a mainstay in multi-object tracking, often combined with other identification metrics, resembling color. For our purposes, a basic Kalman filter implementation is sufficient. In easy terms (for a deeper dive, check this text out), a Kalman filter is a technique for probabilistically estimating an object’s position based on previous measurements. It’s particularly effective with noisy data but in addition works well associating objects across time in videos, even when detections are inconsistent resembling when a player will not be tracked every frame. We implement a whole Kalman filter here but will walk through a number of the essential steps in the next paragraphs.
A Kalman filter state for two dimensions is sort of easy as shown below. All now we have to do is keep track of the x and y location in addition to the objects velocity in each directions (we ignore acceleration).
class KalmanStateVector2D:
x: float
y: float
vx: float
vy: float
The Kalman filter operates in two steps: it first predicts an object’s location in the following frame, then updates this prediction based on a brand new measurement — in our case, from the thing detector. Nonetheless, in our example a brand new frame could have multiple latest objects, or it could even drop objects that were present within the previous frame resulting in the query of how we will associate boxes now we have seen previously with those seen currently.
We decide to do that by utilizing the Mahalanobis distance, coupled with a chi-squared test, to evaluate the probability that a current detection matches a past object. Moreover, we keep a queue of past objects so now we have an extended ‘memory’ than simply one frame. Specifically, our memory stores the trajectory of any object seen during the last 30 frames. Then for every object we discover in a brand new frame we iterate over our memory and find the previous object almost certainly to be a match with the present given by the probability given from the Mahalanbois distance. Nonetheless, it’s possible we’re seeing a completely latest object as well, wherein case we must always add a brand new object to our memory. If any object has <30% probability of being related to any box in our memory we add it to our memory as a brand new object.
We offer our full Kalman filter below for those preferring code.
from dataclasses import dataclassimport numpy as np
from scipy import stats
class KalmanStateVectorNDAdaptiveQ:
states: np.ndarray # for two dimensions these are [x, y, vx, vy]
cov: np.ndarray # 4x4 covariance matrix
def __init__(self, states: np.ndarray) -> None:
self.state_matrix = states
self.q = np.eye(self.state_matrix.shape[0])
self.cov = None
# assumes a single step transition
self.f = np.eye(self.state_matrix.shape[0])
# divide by 2 as now we have a velocity for every state
index = self.state_matrix.shape[0] // 2
self.f[:index, index:] = np.eye(index)
def initialize_covariance(self, noise_std: float) -> None:
self.cov = np.eye(self.state_matrix.shape[0]) * noise_std**2
def predict_next_state(self, dt: float) -> None:
self.state_matrix = self.f @ self.state_matrix
self.predict_next_covariance(dt)
def predict_next_covariance(self, dt: float) -> None:
self.cov = self.f @ self.cov @ self.f.T + self.q
def __add__(self, other: np.ndarray) -> np.ndarray:
return self.state_matrix + other
def update_q(
self, innovation: np.ndarray, kalman_gain: np.ndarray, alpha: float = 0.98
) -> None:
innovation = innovation.reshape(-1, 1)
self.q = (
alpha * self.q
+ (1 - alpha) * kalman_gain @ innovation @ innovation.T @ kalman_gain.T
)
class KalmanNDTrackerAdaptiveQ:
def __init__(
self,
state: KalmanStateVectorNDAdaptiveQ,
R: float, # R
Q: float, # Q
h: np.ndarray = None,
) -> None:
self.state = state
self.state.initialize_covariance(Q)
self.predicted_state = None
self.previous_states = []
self.h = np.eye(self.state.state_matrix.shape[0]) if h is None else h
self.R = np.eye(self.h.shape[0]) * R**2
self.previous_measurements = []
self.previous_measurements.append(
(self.h @ self.state.state_matrix).reshape(-1, 1)
)
def predict(self, dt: float) -> None:
self.previous_states.append(self.state)
self.state.predict_next_state(dt)
def update_covariance(self, gain: np.ndarray) -> None:
self.state.cov -= gain @ self.h @ self.state.cov
def update(
self, measurement: np.ndarray, dt: float = 1, predict: bool = True
) -> None:
"""Measurement shall be a x, y position"""
self.previous_measurements.append(measurement)
assert dt == 1, "Only single step transitions are supported as a result of F matrix"
if predict:
self.predict(dt=dt)
innovation = measurement - self.h @ self.state.state_matrix
gain_invertible = self.h @ self.state.cov @ self.h.T + self.R
gain_inverse = np.linalg.inv(gain_invertible)
gain = self.state.cov @ self.h.T @ gain_inverse
new_state = self.state.state_matrix + gain @ innovation
self.update_covariance(gain)
self.state.update_q(innovation, gain)
self.state.state_matrix = new_state
def compute_mahalanobis_distance(self, measurement: np.ndarray) -> float:
innovation = measurement - self.h @ self.state.state_matrix
return np.sqrt(
innovation.T
@ np.linalg.inv(
self.h @ self.state.cov @ self.h.T + self.R
)
@ innovation
)
def compute_p_value(self, distance: float) -> float:
return 1 - stats.chi2.cdf(distance, df=self.h.shape[0])
def compute_p_value_from_measurement(self, measurement: np.ndarray) -> float:
"""Returns the probability that the measurement is consistent with the expected state"""
distance = self.compute_mahalanobis_distance(measurement)
return self.compute_p_value(distance)
Having tracked every detected object over the past 30 frames, we will now devise heuristics to pinpoint which boxes almost certainly represent our players. We tested two approaches: choosing the boxes nearest the middle of the baseline, and picking those with the longest observed history in our memory. Empirically, the primary strategy often flagged line judges as players each time the actual player moved away from the baseline, making it less reliable. Meanwhile, we noticed that GroundingDino tends to “flicker” between different line judges and ball people, while real players maintain a somewhat stable presence. Because of this, our final rule is to choose the boxes in our memory with the longest tracking history because the true players. As you’ll be able to see within the initial video, it’s surprisingly effective for such a straightforward rule!
With our tracking system now established on the image, we will move toward a more traditional evaluation by tracking players from a bird’s-eye perspective. This viewpoint enables the evaluation of key metrics, resembling total distance traveled, player speed, and court positioning trends. For instance, we could analyze whether a player incessantly targets their opponent’s backhand based on location during some extent. To perform this, we’d like to project the player coordinates from the image onto a standardized court template viewed from above, aligning the angle for spatial evaluation.
That is where homography comes into play. Homography describes the mapping between two surfaces, which, in our case, means mapping the points on our original image to an overhead court view. By identifying a couple of keypoints in the unique image — resembling line intersections on a court — we will calculate a homography matrix that translates any point to a bird’s-eye view. To create this homography matrix, we first have to discover these ‘keypoints.’ Various open-source, permissively licensed models on platforms like RoboFlow may also help detect these points, or we will label them ourselves on a reference image to make use of within the transformation.
After labeling these keypoints, the following step is to match them with corresponding points on a reference court image to generate a homography matrix. Using OpenCV, we will then create this transformation matrix with a couple of easy lines of code!
import numpy as np
import cv2# order of the points matters
source = np.array(keypoints) # (n, 2) matrix
goal = np.array(court_coords) # (n, 2) matrix
m, _ = cv2.findHomography(source, goal)
With the homography matrix in hand, we will map any point from our image onto the reference court. For this project, our focus is on the player’s position on the court. To find out this, we take the midpoint at the bottom of every player’s bounding box, using it as their location on the court within the bird’s-eye view.
In summary, this project demonstrates how we will use GroundingDINO’s zero-shot capabilities to trace tennis players without counting on labeled data, transforming complex object detection into actionable player tracking. By tackling key challenges — resembling distinguishing players from other on-court personnel, ensuring consistent tracking across frames, and mapping player movements to a bird’s-eye view of the court — we’ve laid the groundwork for a strong tracking pipeline all without the necessity for explicit labels.
This approach doesn’t just unlock insights like distance traveled, speed, and positioning but in addition opens the door to deeper match analytics, resembling shot targeting and strategic court coverage. With further refinement, including distilling a YOLO or RT-DETR model from GroundingDINO outputs, we could even develop a real-time tracking system that rivals existing industrial solutions, providing a robust tool for each coaching and fan engagement on the planet of tennis.