Pixels have an area

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code.

The tile vs the point

There are generally two different and incompatible perspectives on pixels:

One school of thought considers a pixel to be an infinitesimally small point, and thus an image is a grid of points sparsely sampling the color information of the captured scene. The other school of thought considers a pixel to be a tiny square (sometimes rectangular) area, whose color is the summary of all light in that area; think of a screen pixel or a camera's receptor.

Both of these are valid, depending on the context they are used in. However, it can be important to understand the difference to avoid subtle mistakes. Computer vision and image processing libraries typically treat pixels as having an area.

One common situation where this becomes important is when one wants to resample vectors. For example, in ViT, when we change the resolution of the input, we resample the position embeddings. More precisely, if we train a ViT with patch-size 16px² on 224px² images, that model has 14² input tokens, and thus 14² learned position embedding vectors. If we want to increase the resolution of the model to, say, 448px², we now need to turn these 14² position embedding vectors into 28² position embedding vectors. Intuitively, since we stretch the image, we'd want to stretch the position embeddings. It sounds like resizing a 14x14 "image" of position embeddings to a 28x28 "image" of position embeddings.

Except that, not quite. In the official ViT implementation, we use the rather obscure scipy.ndimage.zoom API instead of cv2.resize, tf.resize, or similar. Why is that? It is because position embeddings indicate points, not areas, hence we need to resample, not resize them. That's a mouthful, so here's a visual representation of the difference between the two:

Pixels with an area: resizing

Resizing using PIL, cv2, {jax,tf}.image, torchvision, ...
Source values Source support Target support Target values

Points without area: resampling

Resampling using scipy.ndimage, scipy.interpolate, ...
Source values Source support Target support Target values

There is a very related common gotcha in computer vision which almost everyone gets wrong the first time they deal with bounding boxes. Bounding-boxes are usually specified in normalized image coordinates, 0.0 being top-left and 1.0 being bottom-right.

But: what is top-left? What is bottom-right? Is it the center of the bottom-right pixel, or the bottom-right "end" of the image? The difference manifests itself as an off-by-one error when converting between normalized coordinates and actual pixels:

xrel = xpixwidth vs xrel = xpixwidth + 1

The picture below illustrates this. For larger boxes, this does not really matter in practice, but for small boxes this mistake can have a large effect on metrics such as intersection over union (IoU). While modeling, it is safe to do everything in relative (0..1) coordinates and not think about this, there will be no mistake. It is when converting from relative coordinates to pixel values, both on the user input and output side, that one needs to be careful to stay consistent. Vast majority of codebases follow the pixel center convention.

Pixel center (left) vs pixel border (right) for bounding-box coordinates:
(0.0, 0.0) (1.0, 1.0) = (4, 6) (0.0, 0.0) (1.0, 1.0) = (5, 7)