From RANSAC to ResNet: A Brief History of Computer Vision

14 min readDec 25, 2023

What I Learned Over an Entire Semester of Grad School, in 13 Minutes
Part I: Classical Era (circa 1982–2012): The First Principles of Computer Vision Emerge
- Geometric Computer Vision for Scene Understanding
- Geometric Computer Vision for Classification
Part II: Deep Learning Era (post-2012): Implicit Representations Drive Computer Vision Forward
- Deep Learning for Scene Understanding
- Deep Learning for Classification
Conclusions
Further Reading

What I Learned Over an Entire Semester of Grad School, in 14 Minutes

Like the pyramids of Egypt, computer vision history carries many secret treasures. Photo by The New York Public Library on Unsplash.

Disclaimer: you can’t understand the entirety of computer vision (abbreviated “CV”) in 14 minutes. But this blog will at least give you a deeper intuition of things.

For context, 5 months ago I thought that I knew CV pretty well. After all:

I used to teach people how to implement a 2D convolution function of their own in Python as a TA in undergrad.
I designed, and built a prototype sentiment classifier using TinyYOLOv2 for a research project at iQ3Connect.
I joined PTC, and currently work on MLOps for the team that is working to improve the capabilities of Windchill+ for certain use cases (which are out of scope of this post) that heavily rely upon metric learning in CNNs.

The point is, even though I thought I knew a lot — the truth is, I was actually only familiar with the deep learning era of CV. And then grad school happened.

But there’s more to this field than that. This past Fall 2023 semester gave me a tremendous wake-up call to the classical era of CV. I think that’s important to share, because when I tell people I’m studying CS 558, aka Computer Vision (CV) at Stevens Institute of Technology, it usually sparks the same kinds of follow-up questions:

Could you explain the kinds of models you work with?
How would you go about solving [insert problem] in my domain? (usually, my domain is something like robotics, computer graphics, etc.)

These are the big picture questions of CV; and getting familiar with the various answers that people have come up to approach them over the years will help you better appreciate CV, from the automated object recognition in the Photos app on iOS, to photorealistic 3D reconstruction included in several XR apps today.

So let’s get started!

Part I: Classical Era (circa 1982–2012): The First Principles of Computer Vision Emerge

Images are all around us. But images don’t just happen. There’s a gargantuan variety of environments, events, cultures etc., that are captured on the cameras today, whether they be on our phones, mounted atop traffic lights, satellites in space, etc.. The study of how such images are created in the real-world might be called computational photography nowadays. Or, if they’re created in CGI environments, then that probably falls under computer graphics.

But, we’re not here to look at how images are created. CV studies how to go in the opposite direction of image formation: image understanding.

In a sentence, CV is the study of how computers can take a certain image, and process it enough to understand the source environment in which it was taken, whether it be in terms of geometry (i.e.., the shapes in the scene), semantics (i.e., the names of objects in the scene), and more.

This understanding is impacted by how well we know how CV works. In the classical era, we tried to figure that out by answering a few key questions:

How do we pick out the most useful aspects of an image (i.e., features) for the purpose of ML?
How do we describe the features of an image using math (e.g., parametric models for shapes like lines)?
How can we track those features, if we’re given a dataset of images that are of the same scene, but are taken from different viewpoints?

Geometric Computer Vision for Scene Understanding

Imagine this: you’re scrolling through the National Geographic’s account on Instagram. You come across an image of something like this young polar bear in Norway — instead it’s not a normal image. A normal image is a static, flat 2D photo — but what’s on your display is a moving photo, akin to one of the portraits in Hogwarts Castle. It lets you rotate around the polar bear, and truly see the world from its point of view.

If you can imagine such a piece of content, you’ve just taken your first steps in grasping the promise of scene understanding.

Data Preprocessing

Let’s say we have a dataset of images. Great. What are we going to do with them?

Well, let’s step back: If this was an intro course on ML, what would we do (let’s say, if someone handed you a dataset that was just a CSV file) — you’d preprocess it! Duh :)

The same answer applies in CV. The most common way of preprocessing images is to apply filters to them. And what are filters?

High school answer: it’s a box of numbers
Undergrad answer: it’s a tensor, that we dot with the image to distinguish certain features that we care about
Grad School answer: it’s a discrete representation of a 2D kernel function, the convolution thereof with the image results in a response function that’s more amenable to further CV processing

(OK, I admit these 3 answers above all refer to the same thing, but I made the grad school answer above arbitrarily complex and wordy. The grad students will appreciate the dry humor here though :).

What are some things filters can help us do to images? Well, think of the use cases we have for preprocessing in other domains of ML:

Denoising an image
Normalizing values (in this case, of the pixels in the image)

Since we’re dealing with images, there are some unique use cases for preprocessing as well:

Emphasizing contrast between the left/right sides of an image
Blurring an image (which oddly enough, can sometimes be one of the techniques we use for image denoising)
See this blog for more details / interactive visualizations

Feature Engineering

Ok! So we can preprocess the data we want to use for some algorithm. Are we ready to do some CV?

Not. Just. Yet. Oftentimes, even a filtered image has too much useless information to be practical for CV. This was especially true in the classical era, when datasets tended to be smaller than the ones we have today (looking at you, ImageNet) — thus, we need a way to further distill our images down to just the features that will be the most useful to learn on.

Gradient-Based Image Feature Detection

The two most common types of features we want to detect in an image are arguably edges and corners (note that corners are just a special kind of edge, so it’s really 1 type). They could be found using the approaches below:

They work by using the gradient of an image, which can be computed by special kinds of filters. Their output is just a new, transformed version of the image.

Parameter Estimation (aka Model Fitting )

A note on language here: for a lot of us coming to CV with a background in ML are used to hearing the words “fitting”, and thinking of the training step of some ML algorithm, e.g., support vector machines. But in classical CV, a model can refer to much more than that — in this case, we’re talking about geometric models — e.g., think of an algorithm that could tell you where all the lines in an image are. This is taking a step beyond something like the edge detector (as mentioned above), as we can generalize the approaches to virtually any kind of geometric shape — that’s why classical CV is often conflated with the term “geometric CV”. The other strength of these kinds of approaches is that we can encode the detected features mathematically — this allows for a much more compact representation, that just depends on a few parameters (e.g., the slope and intercept, if we’re talking about a line model).

Here are some of the main techniques to highlight here:

Least Squares
RANSAC
Hough Transform

Image Alignment

One of the major problems we want to solve in scene understanding, once we can find features in one image, is: can we track how they would move, as a function of moving the camera around the original 3D scene where the image(s) were first captured?

There are lots of subproblems in the field of scene understanding. Namely, the one we studied in the CV course at Stevens was image alignment. To elaborate, you’ll learn how we can “stitch” together photos that are of the same scene, but one has slightly been shifted to the left/right, so that they form a larger panorama (see example image below). Using what we know so far, this actually isn’t too bad to implement — it comes down to detecting features in both images, using a model fitting technique (e.g., RANSAC) to estimate an unknown function representing the camera motion between the two images, and then using matrix math to overlay one image on top of the other, as appropriate.

Example: Aligning Two Images of a Two

Step 1: Have 2 images of the same scene that have an overlapping region.

Step 2: Detect the features in each image (in this case, we use Harris corner detection).

Step 3: Estimate the parameters of the shift that transformed the image from the left to the right.

Step 4: “Stitch” the two images into one, so that they show 1 unified panorama.

As this example shows, using geometric techniques we can learn a mathematical representation of the camera motion taking place in between the images of a given dataset. This idea plays directly into the approaches for higher-dimensional scene understanding, like Structure from Motion (SfM) . After all, if we can sufficiently “reconstruct” a panorama of images in 2D — then in 3D, why shouldn’t we be able to reconstruct full holograms of real-world scenes, if given enough data?

(Hint: there is a whole area of active research in this area. For those looking for fun deep learning projects to work on, please explore using tools like NeRFStudio!)

Geometric Computer Vision for Classification

Classification is another important and highly related to scene understanding in CV. Why? Because once we know how to detect features, those features can be used not only to estimate the relationships between images in a dataset; but also, to recognize what’s in the image itself.

From Detection to Description

Detected features are complicated. Alone, they are often hard to use for classifying objects, without first truly doing your homework to understand the nuances of your specific application.

For example, let’s say you’re building an app to help kids eat more healthily. You want it to be able to recognize occurrences of fruits in an image. You could say you’re using color as a feature in your color — but then, how would you be able to tell an apple from a strawberry?
Alternatively, you might say you want to use scale… but then, how will you be able to recognize fruits when the camera zooms in or outwards?

Questions like these are why we usually don’t use detected features by themselves — we build feature descriptors instead. Feature descriptors are an umbrella term, but it (informally) refers to any kind of representation we use to contain all the characteristics of a feature, that we want to look for in an image.

Feature descriptors can take many forms, but the simplest is usually creating a histogram. For example, a detected feature might simply be the texture you find in different fruits in an image. Whereas, a feature descriptor would be for each fruit, you create a histogram of the texture on that fruit. Why is that powerful?

Key: in the training phase of a classifier, if you take the histogram of textures for multiple fruits, then you have created a sort of numerical “index” of the dataset — this unlocks your ability to apply ML.
Continuing with the example, you could now take a new test image, extract a feature descriptor for one its fruits, and compare it to the histograms created during training. The one with the highest similarity can then be used to label the fruit in the new image!

Sidebar: the kind of data structure (i.e., the index of histograms) used to enable this kind of labeling is often referred to as a Visual Bag of Words (VBoW). For folks familiar with Natural Language Processing (NLP), the benefits/use cases of using this kind of representation are parallel with those of using a bag of words for text corpora.

Another Example of Using a Visual Bag of Words: Image Data Association

To make the idea of VBoWs more tangible, consider the additional scenario for segmentation: data association.

Let’s say you have a dataset, maybe based on the popular SfM Camera Calibration Benchmark, that contains images of a whole bunch of different locations, each from a variety of different view points. Our goal is before even classifying the objects, we’d like to reorganize the images such that images of the same location, are grouped together.
For the data scientists in the room, this is a good scenario in which to apply clustering!
To do this, we started by describing features. In this example, our descriptors for the features in a single image will be more sophisticated than using a histogram. Instead we’ll use SIFT (a popular technique described in the OpenCV docs).
After that, we need to go a more “global” representation of the image — that is, we create a histogram of the SIFT features in each image — this will act as our VBoW across the dataset.
Finally, we will use a technique like K-Means to group the photos into location-wise groups.

Why is that cool? When visualizing the images, we might see we have photos of 1 scene that has a tractor, next to photos of a totally different scene with something like a water fountain. This heterogeneity makes scene understanding much more difficult. But after using an approach like this, we can visualize one of the K clusters our algorithm automatically found, and see that they’re all from the same scene:

A cluster of images automatically found by applying K-Means clustering to a set of VBoWs of an image dataset, with multiple different locations.

Part II: Deep Learning Era (post-2012): Implicit Representations Drive Computer Vision Forward

Time moves on, neural networks take off in 2012 thanks to the breakthroughs of Geoffrey Hinton’s lab, using neural networks specifically for CV. New key questions arise in the field:

How do approaches for image classification that use deep learning differ from those in the classical era of CV?
How are they similar?

Deep Learning for Scene Understanding

The course at Stevens actually didn’t go too deep into this area during the semester. So we’ll leave it as a question mark for now.

(Brownie points if you caught the pun above :).

Deep Learning for Classification

To recap, in the era before deep learning became ubiquitous in ML, image classification pipelines boiled down to something akin to the following:

Data Preprocessing
Feature Engineering (Detection + Description)
Using some kind of ML algorithm (most commonly an SVM, or Ensemble like Random Forest) to predict labels by comparing the index of descriptors built up in training, to those computed on test images.

On its surface, image classification via deep learning may seem like a mysterious black box — but the truth is, the rules of this game don’t change!

Rise of The CNNs

CNNs are a type of deep learning model developed specifically for dealing with images. They have been popular since 2012 till this day (even with the rise of more modern approaches like transformers). They essentially build on the three-step process used in the classical era:

Automate the process of engineering features for classification (i.e., steps 1 and 2 above), as we’re no longer handcrafting them manually.
Get an output prediction at the end (same as step 3 above).
Use an iterative optimization technique called backpropagation through time to eventually reach a high accuracy.

What’s unique to CNNs is how they do step 1 above — what is the replacement of hand-crafted feature engineering? Convolutional layers.

At a high level, convolutional layers are a set of trainable filters used to preprocess a set of images. As we learned above, filters are used to transform images into a representation more amenable for learning. So once they are optimized, CNNs can achieve stunning resulting in terms of robust generalization, for applications like image classification (more details on this will be linked in the “Further Reading” section below).

Towards of the end of this part of the course, Stevens discussed several famous CNN architectures:

For the sake of brevity, I’ll let you read into the technical details of these architectures on your own. The key takeaway is, research shows as time went, each of these models added more layers than the last. The more layers they had, the better performance they achieved, thanks to the power of convolutions.

Conclusions

To those of you who are curious about CV, and ask questions — we’ve come full circle. Because when I tell people I’m studying Computer Vision (CV) at Stevens Institute of Technology, I usually get asked the following kinds of questions:

Could you explain the kinds of models you work with?
How would you go about solving [insert problem] in my domain? (usually, my domain is something like robotics, computer graphics, etc.)

And the answer I give is: CV is about more than just understanding the various models and techniques. And that there are numerous applications that don’t necessarily have to be related to robots navigating deep sea environments, or AR/VR — it could be a simple app you make to help kids eat cleaner.

As I write this, there are still so many unanswered questions in computer vision. Questions for you and I to keep exploring, and (as I would encourage you) to keep trying to solve. And then someday, maybe you’ll be in the pages of CV history as well. :)

From RANSAC to ResNet: A Brief History of Computer Vision

Table of Contents