Various Image Tasks
Why Create This Page
I want to create this page to test out various tasks that need to be accomplished with Machine learning models with respect to images. Some of these tasks include:
- Image Feature Extraction
- Image Captioning
- Image Safe Search
- Text to Image Generation
- Depth Estimation
- Object Detection
- Image Segmentation
- Keypoint Detection
- Optical Character Recognition
Image Feature Extraction
Image Feature extraction is the task of extracting semantically meaningful features given an image., This has many use cases, including similarity and image retrieval. Moreover, most computer vision models can be used for image feature extraction, whether one can remove the task specific head (image classification, object detection, etc.) and get the features. These features are very useful on a higher level: edge detection, corner detection and so on.
I am using the google/vit-base-patch16-384 model for image feature extraction. The Vision Transformer (ViT) model pre-trained on Imagenet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384.
This model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion.
Image Safe Search
Image safe search is important for sites and applications that intend to promote content to people of all ages. It is the task of finding inappropriate content in an image and flagging that inappropriate content.
I use the Falconsai/nsfw_image_detection model to detect not safe for work images. If you intend to have a lot of people uploading images to your site, then most external APIs that detect inappropriate images for you are cost-prohibitive:
This Fine-Tuned ViT is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. The overall objective of [the] training process was to impart the model with a deep understanding of visual cues, ensuring its robustness and competence in tacking the specific task of NSFW image classification.
This model is intended to be used for NSFW Image Classification. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.
Text to Image Generation
Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts.
Text to image models that are comparable to that provided by large tech companies today are impossible to run cheaply on a rented GPU. They are also not very cheap even when calling through an API. I recommend OpenAI's DALL·E 3 model for text-to-image generation though, having used it a few times.
Depth Estimation
Depth estimation is the task of predicting depth of objects present in an image.
I am using the Intel/dpt-large model for depth estimation. The model was trained on 1.4 million million images for monocular depth estimation. It was first introduced in the paper Vision Transformers for Dense Prediction.
Object Detection
Object detection is the computer vision task of detecting instances (such as humans, buildings, or cars) in an image. Object detection models receive an image as input and output coordinates of the bounding boxes and associated labels of the detected objects. An image can contain multiple objects, each with its own bounding box and a label (e.g. it can have a car and a building), and each object can be present in different parts of an image (e.g. the image can have several cars). This task is commonly used in autonomous driving for detecting things like pedestrians, road signs, and traffic lights. Other applications include counting objects in images, image search, and more.
I use the facebook/detr-resnet-50 model for object detection. The DETR model [was] trained end-to-end on COCO 2018 object detection (118k annotated images).
The model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for class labels and a MLP for the bounding boxes. Yu can use the raw model for object detection.
Image Segmentation
In digital image processing and computer vision, image segmentation is teh proces sof partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of image segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.
I am using the nvidia/segformer-b1-finetuned-ade-512-512 model for image segmentation. The SegFormer model [was] fine-tuned on ASE20k at resolution 512x512.
It consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks. You can use the raw model for semantic segmentation.
Keypoint Detection
Many applications benefit from features localized in ( image registration, panorama stitching, motion estimation + tracking, recognition, ...). Desirable properties of keypoint detector: Accurate localization, invariance against shift, rotation, scale, brightness change. Robustness against noise, high repeatability.
I am using the magic-leap-community/superpoint model for keypoint detection. This model is the result of a self-supervised training of a fully-convolutional network for interest point detection and description. The model is able to detect interest points that are repeatable under homographic transformations and provide a descriptor for each point.
Optical Character Recognition
Optical character recognition or Optical Character Reader (OCR) is the electronic or mechanical conversion of images of types, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example, the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).
OCR models that have the ability to accurately capture the text of a large document or image require a GPU to be run, so I am not going to demonstrate the functionality here.