GPT4-V: Experiments for Object Detection

4 min readNov 16, 2023

GPT-4 Vision (GPT-4V) boasts an extensive knowledge base, capable of answering complex queries about the contents and relationships of objects within images. This capability has led to intriguing discussions about whether GPT-4V could enhance or even replace traditional object detection models, which pinpoint objects’ locations in images.

How to do real time people tracking and recognition using DL

In an era of rapidly evolving technology, the utilization of deep learning for people recognition and tracking has…

medium.com

This article will delve into our findings from experimenting with GPT-4V for object detection. We aim to illuminate why specialized, fine-tuned models might be better suited for this task, adding depth to the question: “What impact will GPT have on object detection?”

Let’s dive in!

Exploring GPT-4V’s Object Detection Abilities

In an initial experiment, we tasked GPT-4V with detecting a car in a photo and supplying the bounding box coordinates. Contrary to our expectations, the coordinates GPT-4V returned did align with the cars actual location.

Consider this scenario: we have an image with a car, and our goal is to pinpoint its exact location.

We uploaded the image with the following prompt:

**User input**
Return the coordinates of the car in the following image in the format x0, y0, x1, y1, class bounding box coordinates.

and we got the following response:

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from skimage import io
from skimage.measure import find_contours

# Load the image
image_path = '/mnt/data/oldtimer-1197800_640.jpg'
image = io.imread(image_path)

# Display the image
fig, ax = plt.subplots()
ax.imshow(image)

# Since we can't use ML models, let's try a simple approach using find_contours
# on the assumption that the car will have a distinctive edge compared to the background.

# Convert to grayscale for find_contours
gray_image = np.mean(image, axis=2).astype(np.uint8)

# Find contours at a constant value
contours = find_contours(gray_image, level=0.8)

# Assuming the largest contour will be the car
largest_contour = max(contours, key=lambda x: x.shape[0])

# Get the bounding box coordinates
x0, y0 = np.min(largest_contour, axis=0)
x1, y1 = np.max(largest_contour, axis=0)

# Create a Rectangle patch
rect = patches.Rectangle((y0, x0), y1-y0, x1-x0, linewidth=1, edgecolor='r', facecolor='none')

# Add the patch to the Axes
ax.add_patch(rect)

plt.show()

(x0, y0, x1, y1)

In order to checked the results, we asked GPT4 to draw the bounding boxes:

**User input**
Use the car coordinates to draw a bounding box

# Since the initial approach did not yield a correct bounding box, 
# we will visually inspect the image and estimate the coordinates manually.
p
# We will now draw the bounding box around the car manually by estimating the coordinates
# Display the image again
fig, ax = plt.subplots()
ax.imshow(image)

# Estimate the bounding box coordinates
# These coordinates are estimated based on visual inspection of the image and may not be precise
x0, y0 = 150, 100  # Top-left corner
x1, y1 = 150 + 400, 100 + 250  # Bottom-right corner (width, height)

# Create a Rectangle patch
rect = patches.Rectangle((y0, x0), y1 - y0, x1 - x0, linewidth=2, edgecolor='r', facecolor='none')

# Add the patch to the Axes
ax.add_patch(rect)

plt.show()

(x0, y0, x1, y1)

So will we be able to also classify the car? Lets see.

**User input**
What is the model of the car in the image?
**GPT4 Response**
The car in the image is a classic model, which appears to be from the mid-20th century. However, without specific expertise in classic cars or additional identifying details, it’s challenging to determine the exact model from the image alone.
Typically, identifying the model of a classic car would involve examining specific design features, such as the grille, badges, body shape, and other distinctive characteristics that could be matched to known models from car manufacturers of the era.
If you have any interest in a more detailed analysis or identification, I would recommend consulting a classic car expert or using an online community or forum where enthusiasts and experts may be able to provide more insight.

Conclusion: Assessing GPT-4V’s Role in Object Detection and Classification

Our experiments with GPT-4V in object detection have revealed significant insights. While GPT-4V demonstrates an impressive breadth of knowledge, its current capabilities in precise object localization and classification, especially in complex scenarios like identifying classic car models, are limited. This exploration underscores the importance of specialized, fine-tuned models for tasks requiring high accuracy and specificity.

The future of AI in object detection and classification is promising, but as of now, it seems clear that a combination of AI models, including both general and specialized ones, will be necessary for the most effective and accurate results. GPT-4V’s evolution will undoubtedly continue to impact the field, and we eagerly anticipate its advancements. For now, experts and custom solutions remain indispensable for certain intricate tasks, highlighting the harmonious blend of human expertise and AI in addressing complex challenges.

Developing Object Detection and Mouse Repositioning for Video Games with Python and YOLOv8

Video games have come a long way from their humble beginnings, with advanced graphics and gameplay mechanics becoming…