The tile vs the point
There are generally two different and incompatible perspectives on pixels:
One school of thought considers a pixel to be an infinitesimally small point, and thus an image is a grid of points sparsely sampling the color information of the captured scene. The other school of thought considers a pixel to be a tiny square (sometimes rectangular) area, whose color is the summary of all light in that area; think of a screen pixel or a camera's receptor.
Both of these are valid, depending on the context they are used in. However, it can be important to understand the difference to avoid subtle mistakes. Computer vision and image processing libraries typically treat pixels as having an area.
One common situation where this becomes important is when one wants to resample vectors. For example, in ViT, when we change the resolution of the input, we resample the position embeddings. More precisely, if we train a ViT with patch-size 16px² on 224px² images, that model has 14² input tokens, and thus 14² learned position embedding vectors. If we want to increase the resolution of the model to, say, 448px², we now need to turn these 14² position embedding vectors into 28² position embedding vectors. Intuitively, since we stretch the image, we'd want to stretch the position embeddings. It sounds like resizing a 14x14 "image" of position embeddings to a 28x28 "image" of position embeddings.
Except that, not quite.
In the official ViT implementation, we use the rather obscure scipy.ndimage.zoom
API
instead of cv2.resize, tf.resize, or similar.
Why is that? It is because position embeddings indicate points, not areas, hence we need to resample, not resize them.
That's a mouthful, so here's a visual representation of the difference between the two:
Pixels with an area: resizing
Points without area: resampling
Whether the left or the right one is done depends on the library you use. Some functions even let you choose, with a flag often called align_corners or similar. TensorFlow1 had it, but it's disappeared in v2.
Related: box coordinates - pixel corner or pixel center?
There is a very related common gotcha in computer vision which almost everyone gets wrong the first time they deal with bounding boxes. Bounding-boxes are usually specified in normalized image coordinates, 0.0 being top-left and 1.0 being bottom-right.
But: what is top-left? What is bottom-right? Is it the center of the bottom-right pixel, or the bottom-right "end" of the image? The difference manifests itself as an off-by-one error when converting between normalized coordinates and actual pixels:
The picture below illustrates this. For larger boxes, this does not really matter in practice, but for small boxes this mistake can have a large effect on metrics such as intersection over union (IoU). While modeling, it is safe to do everything in relative (0..1) coordinates and not think about this, there will be no mistake. It is when converting from relative coordinates to pixel values, both on the user input and output side, that one needs to be careful to stay consistent. Vast majority of codebases follow the pixel center convention.
If you'd like to read even further on the topic, Yuxin Wu (of Detectron fame) wrote a very in-depth post on the topic a few years earlier. Yuxin contributed to several interesting discussions on the topic in the past.