This can be a guest blog post by the Pollen Robotics team. We’re the creators of Reachy, an open-source humanoid robot designed for manipulation in the actual world.
Within the context of autonomous behaviors, the essence of a robot’s usability lies in its ability to grasp and interact with its environment. This understanding primarily comes from visual perception, which enables robots to discover objects, recognize people, navigate spaces, and way more.
We’re excited to share the initial launch of our open-source pollen-vision library, a primary step towards empowering our robots with the autonomy to know unknown objects. This library is a rigorously curated collection of vision models chosen for his or her direct applicability to robotics. Pollen-vision is designed for ease of installation and use, composed of independent modules that will be combined to create a 3D object detection pipeline, getting the position of the objects in 3D space (x, y, z).
We focused on choosing zero-shot models, eliminating the necessity for any training, and making these tools immediately usable right out of the box.
Our initial release is concentrated on 3D object detection—laying the groundwork for tasks like robotic grasping by providing a reliable estimate of objects’ spatial coordinates. Currently limited to positioning inside a 3D space (not extending to full 6D pose estimation), this functionality establishes a solid foundation for basic robotic manipulation tasks.
The Core Models of Pollen-Vision
The library encapsulates several key models. We would like the models we use to be zero-shot and versatile, allowing a wide selection of detectable objects without re-training. The models also must be “real-time capable”, meaning they need to run no less than at just a few fps on a consumer GPU. The primary models we selected are:
- OWL-VIT (Open World Localization – Vision Transformer, By Google Research): This model performs text-conditioned zero-shot 2D object localization in RGB images. It outputs bounding boxes (like YOLO)
- Mobile Sam: A light-weight version of the Segment Anything Model (SAM) by Meta AI. SAM is a zero-shot image segmentation model. It might probably be prompted with bounding boxes or points.
- RAM (Recognize Anything Model by OPPO Research Institute): Designed for zero-shot image tagging, RAM can determine the presence of an object in a picture based on textual descriptions, laying the groundwork for further evaluation.
Start in only a few lines of code!
Below is an example of find out how to use pollen-vision to construct an easy object detection and segmentation pipeline, taking only images and text as input.
from pollen_vision.vision_models.object_detection import OwlVitWrapper
from pollen_vision.vision_models.object_segmentation import MobileSamWrapper
from pollen_vision.vision_models.utils import Annotator, get_bboxes
owl = OwlVitWrapper()
sam = MobileSamWrapper()
annotator = Annotator()
im = ...
predictions = owl.infer(im, ["paper cups"])
bboxes = get_bboxes(predictions)
masks = sam.infer(im, bboxes=bboxes)
annotated_im = annotator.annotate(im, predictions, masks=masks)
OWL-VIT’s inference time will depend on the variety of prompts provided (i.e., the variety of objects to detect). On a Laptop with a RTX 3070 GPU:
1 prompt : ~75ms per frame
2 prompts : ~130ms per frame
3 prompts : ~180ms per frame
4 prompts : ~240ms per frame
5 prompts : ~330ms per frame
10 prompts : ~650ms per frame
So it’s interesting, performance-wise, to only prompt OWL-VIT with objects that we all know are within the image. That’s where RAM is beneficial, because it is fast and provides exactly this information.
A robotics use case: grasping unknown objects in unconstrained environments
With the article’s segmentation mask, we will estimate its (u, v) position in pixel space by computing the centroid of the binary mask. Here, having the segmentation mask may be very useful since it allows us to average the depth values contained in the mask slightly than inside the complete bounding box, which also accommodates a background that will skew the common.
One strategy to try this is by averaging the u and v coordinates of the non zero pixels within the mask
def get_centroid(mask):
x_center, y_center = np.argwhere(mask == 1).sum(0) / np.count_nonzero(mask)
return int(y_center), int(x_center)
We are able to now usher in depth information so as to estimate the z coordinate of the article. The depth values are already in meters, however the (u, v) coordinates are expressed in pixels. We are able to get the (x, y, z) position of the centroid of the article in meters using the camera’s intrinsic matrix (K)
def uv_to_xyz(z, u, v, K):
cx = K[0, 2]
cy = K[1, 2]
fx = K[0, 0]
fy = K[1, 1]
x = (u - cx) * z / fx
y = (v - cy) * z / fy
return np.array([x, y, z])
We now have an estimation of the 3D position of the article within the camera’s reference frame.
If we all know where the camera is positioned relative to the robot’s origin frame, we will perform an easy transformation to get the 3D position of the article within the robot’s frame. This implies we will move the tip effector of our robot where the article is, and grasp it ! 🥳
What’s next?
What we presented on this post is a primary step towards our goal, which is autonomous grasping of unknown objects within the wild. There are just a few issues that also need addressing:
- OWL-Vit doesn’t detect every part each time and will be inconsistent. We’re on the lookout for a greater option.
- There isn’t a temporal or spatial consistency to this point. All is recomputed every frame
- We’re currently working on integrating some extent tracking solution to reinforce the consistency of the detections
- Grasping technique (only front grasp for now) was not the main focus of this work. We can be working on different approaches to reinforce the grasping capabilities when it comes to perception (6D detection) and grasping pose generation.
- Overall speed may very well be improved
Try pollen-vision
Wanna try pollen-vision? Take a look at our Github repository !
