Various Image-Related AI Tasks

Why Create This Page

I want to create this page to test out various tasks that need to be accomplished with Machine learning models with respect to images. Some of these tasks include:

Image Feature Extraction

Image Feature extraction is the task of extracting semantically meaningful features given an image., This has many use cases, including similarity and image retrieval. Moreover, most computer vision models can be used for image feature extraction, whether one can remove the task specific head (image classification, object detection, etc.) and get the features. These features are very useful on a higher level: edge detection, corner detection and so on.

I am using the google/vit-base-patch16-384 model for image feature extraction. The Vision Transformer (ViT) model pre-trained on Imagenet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. This model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion.

Upload an image in the form below and submit the form to perform image feature extraction.

Image Captioning

Image captioning is the task of predicting a caption for a given image. Common real world applications of it include aiding visually impaired people that can help them navigate through different situations. Therefore, image captioning helps to improve content accessibility for people by describing images to them.

I use the Salesforce/blip-image-captioning-large model for image captioning. This model proposes BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.

Upload an image in the form below and submit the form to get an AI-generated caption for the image.

Image Safe Search

Image safe search is important for sites and applications that intend to promote content to people of all ages. It is the task of finding inappropriate content in an image and flagging that inappropriate content.

I use the Falconsai/nsfw_image_detection model to detect not safe for work images. If you intend to have a lot of people uploading images to your site, then most external APIs that detect inappropriate images for you are cost-prohibitive:

This Fine-Tuned ViT is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. The overall objective of [the] training process was to impart the model with a deep understanding of visual cues, ensuring its robustness and competence in tacking the specific task of NSFW image classification. This model is intended to be used for NSFW Image Classification. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.

Upload an image in the form below and submit the form to perform image safe search.

Text to Image Generation

Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts.

Text to image models that are comparable to that provided by large tech companies today are impossible to run cheaply on a rented GPU. They are also not very cheap even when calling through an API. I recommend OpenAI's DALL·E 3 model for text-to-image generation though, having used it a few times.

Depth Estimation

Depth estimation is the task of predicting depth of objects present in an image.

I am using the Intel/dpt-large model for depth estimation. The model was trained on 1.4 million million images for monocular depth estimation. It was first introduced in the paper Vision Transformers for Dense Prediction.

Upload an image in the form below and submit the form to perform depth estimation.

Object Detection

Object detection is the computer vision task of detecting instances (such as humans, buildings, or cars) in an image. Object detection models receive an image as input and output coordinates of the bounding boxes and associated labels of the detected objects. An image can contain multiple objects, each with its own bounding box and a label (e.g. it can have a car and a building), and each object can be present in different parts of an image (e.g. the image can have several cars). This task is commonly used in autonomous driving for detecting things like pedestrians, road signs, and traffic lights. Other applications include counting objects in images, image search, and more.

I use the facebook/detr-resnet-50 model for object detection. The DETR model [was] trained end-to-end on COCO 2018 object detection (118k annotated images). The model is an encoder-decoder transformer with a convolutional backbone. Two heads are added on top of the decoder outputs in order to perform object detection: a linear layer for class labels and a MLP for the bounding boxes. Yu can use the raw model for object detection.

Upload an image in the form below and submit the form to perform object detection.

Image Segmentation

In digital image processing and computer vision, image segmentation is teh proces sof partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of image segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.

I am using the nvidia/segformer-b1-finetuned-ade-512-512 model for image segmentation. The SegFormer model [was] fine-tuned on ASE20k at resolution 512x512. It consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks. You can use the raw model for semantic segmentation.

Upload an image in the form below and submit the form to perform image segmentation.

Keypoint Detection

Many applications benefit from features localized in $(x,y)$ ( image registration, panorama stitching, motion estimation + tracking, recognition, ...). Desirable properties of keypoint detector: Accurate localization, invariance against shift, rotation, scale, brightness change. Robustness against noise, high repeatability.

I am using the magic-leap-community/superpoint model for keypoint detection. This model is the result of a self-supervised training of a fully-convolutional network for interest point detection and description. The model is able to detect interest points that are repeatable under homographic transformations and provide a descriptor for each point.

Upload an image in the form below and submit the form to perform keypoint detection.

Optical Character Recognition

Optical character recognition or Optical Character Reader (OCR) is the electronic or mechanical conversion of images of types, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example, the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).

OCR models that have the ability to accurately capture the text of a large document or image require a GPU to be run, so I am not going to demonstrate the functionality here.

Open AI Moderation API

The moderations endpoint is a tool you can use to check whether text or images are potentially harmful. Once harmful content is identified, developers can take corrective action like filtering content or intervening with user accounts creating offending content. The moderation endpoint if free to use.

The models available for this endpoint are:

omni-moderation-latest: This model and all snapshots support more categorization options and multi-modal inputs.
text-moderation-latest: Older model that supports only text inputs and fewer input categorizations. The newer omni-moderation models will be the best choice for new applications.

Upload an image below and submit the form to see what the free open ai moderations API says about the content.

Various Image Tasks