Various Image Tasks

Why Create This Page

I want to create this page to test out various tasks that need to be accomplished with Machine learning models with respect to images. Some of these tasks include:

Image Feature Extraction

Image Feature extraction is the task of extracting semantically meaningful features given an image., This has many use cases, including similarity and image retrieval. Moreover, most computer vision models can be used for image feature extraction, whether one can remove the task specific head (image classification, object detection, etc.) and get the features. These features are very useful on a higher level: edge detection, corner detection and so on.

I am using the google/vit-base-patch16-384 model for image feature extraction. The Vision Transformer (ViT) model pre-trained on Imagenet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. This model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion.

Image Captioning

Image captioning is the task of predicting a caption for a given image. Common real world applications of it include aiding visually impaired people that can help them navigate through different situations. Therefore, image captioning helps to improve content accessibility for people by describing images to them.

I use the Salesforce/blip-image-captioning-large model for image captioning. This model proposes BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.

Text to Image Generation

Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts.
Text to Image

Text to image models that are comparable to that provided by large tech companies today are impossible to run cheaply on a rented GPU. They are also not very cheap even when calling through an API. I recommend OpenAI's DALL·E 3 model for text-to-image generation though, having used it a few times.

Depth Estimation

Depth estimation is the task of predicting depth of objects present in an image.
Depth Estimation

I am using the Intel/dpt-large model for depth estimation. The model was trained on 1.4 million million images for monocular depth estimation. It was first introduced in the paper Vision Transformers for Dense Prediction.

Object Detection

Object detection is the computer vision task of detecting instances (such as humans, buildings, or cars) in an image. Object detection models receive an image as input and output coordinates of the bounding boxes and associated labels of the detected objects. An image can contain multiple objects, each with its own bounding box and a label (e.g. it can have a car and a building), and each object can be present in different parts of an image (e.g. the image can have several cars). This task is commonly used in autonomous driving for detecting things like pedestrians, road signs, and traffic lights. Other applications include counting objects in images, image search, and more.

I use the facebook/detr-resnet-50 model for object detection. The DETR model [was] trained end-to-end on COCO 2018 object detection (118k annotated images). The model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for class labels and a MLP for the bounding boxes. Yu can use the raw model for object detection.

Image Segmentation

In digital image processing and computer vision, image segmentation is teh proces sof partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of image segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.

I am using the nvidia/segformer-b1-finetuned-ade-512-512 model for image segmentation. The SegFormer model [was] fine-tuned on ASE20k at resolution 512x512. It consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks. You can use the raw model for semantic segmentation.

Keypoint Detection

Many applications benefit from features localized in (x,y)(x,y) ( image registration, panorama stitching, motion estimation + tracking, recognition, ...). Desirable properties of keypoint detector: Accurate localization, invariance against shift, rotation, scale, brightness change. Robustness against noise, high repeatability.

I am using the magic-leap-community/superpoint model for keypoint detection. This model is the result of a self-supervised training of a fully-convolutional network for interest point detection and description. The model is able to detect interest points that are repeatable under homographic transformations and provide a descriptor for each point.

Optical Character Recognition

Optical character recognition or Optical Character Reader (OCR) is the electronic or mechanical conversion of images of types, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example, the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).

OCR models that have the ability to accurately capture the text of a large document or image require a GPU to be run, so I am not going to demonstrate the functionality here.