Computer Vision Basics: How AI Actually "Sees" Your Photos

If you have ever unlocked your phone with your face, searched Google Photos for “beach,” or watched a self-driving car demo and thought “how on earth can a computer see?”, you are asking about computer vision.

To you, an image is a picture of your dog on the couch. To a computer, it starts as a giant grid of numbers – pixel values. The magic of modern AI is that it can turn that grid of numbers into structure and meaning: “there is a dog here, that blob is the couch, and this scene looks like a living room.”

In this post, you will walk through the basics of how that works—no PhD required. We will start with how images are represented, move into how neural networks like convolutional neural networks (CNNs) process them, and then look at common vision tasks such as image classification, object detection, and segmentation. Along the way you will see how the same core ideas power everything from simple filters to advanced systems behind search in Google Photos and many vision APIs. Google’s own tutorials describe very similar building blocks under the hood.

By the end, you will understand enough to read AI product announcements with a more critical eye, and to start tinkering with basic vision code yourself using libraries like OpenCV and modern models provided via APIs or tools like ChatGPT, Claude, and Gemini.

What is computer vision, really?

Computer vision is a field of AI that lets computers interpret and understand visual data—images and video—in a way that is useful for tasks in the real world.

Typical computer vision systems aim to:

Recognize what is in an image (a cat, a car, a tumor, a road sign)
Locate where things are (bounding boxes, masks, keypoints)
Understand how things relate (this lane belongs to this car, this tissue boundary belongs to that organ)

Modern computer vision is dominated by deep learning, especially CNNs and, increasingly, vision transformers. CNNs are a type of neural network specifically designed to work with grid-like data such as images and are now considered the de‑facto standard for many vision tasks. IBM’s overview of CNNs notes that they replaced a lot of older, manual feature engineering in vision with automated feature learning.

Before we get into CNNs, you need to understand what an image looks like to a computer.

How AI “sees” pixels: images as numbers

A typical digital image is a 2D grid of pixels. Each pixel has color values, often represented in RGB (red, green, blue):

A grayscale image can be represented as a matrix of shape H × W (height by width), each entry a brightness value.
A color image is often represented as H × W × 3, one channel for each color.

To the model, that is just a tensor (a multi‑dimensional array) of numbers. The job of a vision model is to turn that raw tensor into higher‑level features, like:

Edges
Textures
Corners
Parts (eyes, wheels, leaves)
Whole objects (faces, cars, trees)

This is where convolutions come in.

Convolutional neural networks: from edges to objects

A convolutional neural network (CNN) processes images in layers. Each layer applies a set of small filters (also called kernels) that slide across the image and compute new values—this operation is the “convolution.”

Hugging Face’s community computer vision course explains that a convolution takes a small matrix and moves it across the input, computing sums of elementwise products to produce a feature map that highlights specific patterns, such as edges or textures. Their intro to CNNs includes simple visual examples of this process.

Conceptually, here is what happens as you go deeper into a CNN:

Early layers: edges and simple patterns
Filters behave like edge detectors, spotting horizontal, vertical, or diagonal changes in pixel intensity. These are similar to classic image processing filters you might see in OpenCV tutorials.
Middle layers: textures and parts
By combining many edges, the network starts recognizing textures (fur, grass, metal) and parts of objects (eyes, wheel rims, leaves).
Deep layers: whole objects and categories
Near the end, the network has abstract representations that strongly respond to whole objects or specific configurations, like “dog‑shaped thing” or “stop sign‑like pattern.”

Research summaries and teaching notes on CNNs emphasize this hierarchical feature learning: simple to complex as you go layer by layer. Recent lecture slides on CNNs describe how features such as edges, textures, and objects are learned and can appear anywhere in the image while retaining their meaning.

After several convolution and pooling layers, the network typically flattens the features and passes them through fully connected layers to output predictions, such as class probabilities.

Common computer vision tasks (and what they mean)

When you see an AI feature advertised in a product, it is usually one of a small set of standard computer vision tasks. Clarifai’s developer docs outline several of the most popular: image classification, object detection, and segmentation. Their guide is a good reference for the differences.

Here is how they break down:

1. Image classification

Image classification answers the question: “What is in this image overall?”

Input: one image
Output: one or more labels (e.g., “cat,” or “cat + couch”)

Your phone recognizing “food” photos, or a model telling you “this X‑ray looks normal vs. abnormal,” are image classification problems. Under the hood, a CNN ingests the whole image and outputs a label.

Google’s Machine Learning Crash Course includes an image classification practicum that shows how such a model can learn to power search in Google Photos by mapping images to labels that users can query later. That practicum walks through how the model is trained and evaluated.

2. Object detection

Object detection answers: “What is in this image, and where are the things?”

Input: one image
Output: a set of bounding boxes plus labels, like “dog at (x1, y1, x2, y2), person at (…)”

Instead of just saying “there are cars here,” detection will draw rectangles around each car. This is vital for applications like:

Self‑driving cars (detecting other vehicles, pedestrians, traffic lights)
Retail analytics (counting people entering a store)
Security cameras (flagging intrusions in specific regions)

Detection models often build on CNNs but add extra heads that predict box coordinates and labels.

3. Image segmentation

Segmentation goes one step further and reasons at the pixel level. IBM’s overview explains that image segmentation partitions an image into coherent groups of pixels—segments—that correspond to objects or regions. Their description distinguishes segmentation from classification and detection.

There are a few common flavors:

Semantic segmentation: Label every pixel with a class (“road,” “sky,” “car”) but do not distinguish between different instances of the same class.
Instance segmentation: Separate individual objects of the same class (car #1 vs. car #2), giving each its own mask.
Panoptic segmentation: Combine both, so every pixel belongs to a specific instance or background region.

Segmentation is heavily used in:

Medical imaging (tracing tumors, organs, or lesions)
Autonomous driving (precisely understanding drivable areas and lane markings)
Satellite imagery (mapping buildings, fields, forests)

Classic tools vs modern AI: where OpenCV fits in

If you want to get hands‑on with computer vision, you will very quickly meet OpenCV. It is an open source computer vision and image processing library started at Intel and now maintained by the OpenCV.org foundation. Official docs describe it as “the world’s largest resource of Computer Vision” and a go‑to toolkit for image processing, video analysis, and more. The OpenCV “Get Started” page positions it as the foundation for many vision projects.

OpenCV gives you:

Basic image I/O (read/write images and video)
Classic operations: blurring, edge detection, color conversions
Geometric transforms: rotation, scaling, perspective
Feature detectors: corners, keypoints, descriptors
Integration points with deep learning frameworks (DNN module, ONNX models, etc.)

You can think of OpenCV as the “Swiss Army knife” for low‑level vision operations, while modern deep learning frameworks (PyTorch, TensorFlow, JAX) and APIs power the learning of features and end‑to‑end models.

In practice, real systems often combine both:

Preprocess with OpenCV (resize, normalize, crop)
Run a CNN or transformer model for classification/detection/segmentation
Post‑process outputs with OpenCV (draw boxes, masks, overlays)

How this connects to ChatGPT, Claude, Gemini and friends

You might associate tools like ChatGPT, Claude, and Gemini with text, but the newest versions are multimodal: they can take images as input. When you upload a screenshot or a photo, under the hood there is usually:

A vision encoder (often CNN‑like or transformer‑based) that turns the image into a set of feature vectors.
A language model that reasons over those features and generates a natural language explanation.

While vendors do not always publish exact model internals, the same basic computer vision ideas apply:

Images are tensors of pixel values.
A stack of learned layers extracts features (edges → textures → objects → scene).
Higher‑level components reason over those features to answer questions or follow instructions.

So when you ask ChatGPT to “describe what is happening in this diagram,” you are indirectly using a computer vision system very similar in spirit to the CNNs and segmentation models described above, just deeply integrated with a powerful language model.

Where to go next if you want to learn or build

If this overview has demystified things a bit, you might be wondering what to do next. Here are a few concrete, low‑friction steps:

Play with tutorials that explain CNNs visually
Resources like Hugging Face’s computer vision course and Google’s ML Crash Course image classification practicum are designed for developers coming from regular software backgrounds, not math PhDs. Start with something like their “introduction to convolutions” and step through the animations to see how filters detect edges and patterns.
Install OpenCV and do a tiny project
Use OpenCV in Python to:
- Load an image
- Convert it to grayscale
- Run an edge detector
- Draw contours around objects
  Even this simple pipeline will make the “images as matrices” idea very concrete.
Experiment with pre‑built vision models
Use a hosted API or a framework hub (for example, via cloud providers, or tools that let you call image classification or detection models from Python/JavaScript) instead of training from scratch. Focus on:
- Feeding in your own images
- Inspecting outputs (labels, boxes, masks)
- Thinking about what the model “saw” and where it struggles

If you do those three things, you will go from “AI is magic” to “I can reason about what this vision system is doing and when I should or should not trust it”—which is exactly the kind of literacy you want as AI becomes part of everyday software.

Read other posts

< [Free vs Paid AI Tools: What You Actually Get for Your Money ] :: [AI Standards Wars: Who Really Sets the Rules for Global AI? ] >