Scientific Dissection of Human Vision
Why has image recognition been such a difficult task? Decades ago, when computers became capable of crunching numbers of thousands of pixels, we started wondering if we could give a computer the ability to see. Image processing soon became a big domain. We learned concepts such as brightness and contrast, hue and saturation, and shadows and highlights. We then developed edge detectors, and we soon started realizing that object recognition was a super-hard problem. Hard-coding to decode concepts from data that have noise and high variance becomes extremely tedious. But, then came neural networks! Convolutional Neural Networks became the de-facto standard to classify images. Neural Networks are curve-fitters, or as mathematicians call them, universal function approximators. They are great at generalizing. I am, however, a little unimpressed with the flexibility they offer. Firstly, training a neural-net has to be preceded by curating a data-set of hundreds of thousands of labelled images. When you train a CNN to classify labels from a given image, you are simply giving a computer a problem to solve, whose answers you might never come to know. You get gifted with a blackbox that takes inputs and gives you outputs that you asked for. The blackbox also comes with constraints in terms of how much information you will be given about a test image that you supplied to classify. My mind did not agree with this method, and hence I am going to follow a classical approach and find methods to recognize objects and concepts from images without using a strict convolutional neural network.
Dissecting an Image
Properties of a Pixel
In terms of RGB data, a pixel can contain 0-100% red, 0-100% green, and 0-100% blue colors. A grayscale pixel can contain only the luminosity from 0 to 100%. 100% is generally equal to 255 for 8 bits per channel. To derive luminosity from 3 colors for a pixel, we average out the three color values of that pixel.
The complexity starts to grow when you start placing pixels next to or below one another, and low-level properties emerge.
Properties of an Image
Global image properties are intrinsic characteristics of the entire domain of images that exist regardless of the objects within that image.
Brightness — Brightness of an image means how far are all the pixels of an image considered together, from blackness, or a luminosity of 0. When we increase or decrease the contrast, we simply add to or subtract from all the color values of all the pixels, a brightness factor.
Luminance Contrast — Luminance Contrast of an image is how wide is the gap between the darkest pixels of the image, and the brightest pixels of the image. To increase the contrast, we increase the luminosity of the lightest pixels, and decrease the luminosity of darkest pixels, and vice-versa for decreasing.
Color Contrast — Color Contrast of an image is how wide is the gap between the least red pixels and the most red pixels, the least green pixels and the most green pixels, and the least blue pixels and the most blue pixels. The calculation for the increase or decrease of color contrast is the same as that of Luminance Contrast, but each color channel is applied separately.
Feature Patterns
Think about this, no matter how bright or contrasted an image is, the objects inside it will still be recognizable. This means that pixel values change, but something still remains intact — the spatial distribution of pixels. This trickles down from a full view of the image to a tiny section of the image. The fact that pixels that show a shoe lie below the pixels that show the leg is good enough for us to recognize that it is a persons leg. Yet, how did you know that it was a person’s shoe? It was because pixels in the different regions of the shoe were in a particular spatial arrangement. The laces were in the middle, the sole was at the bottom, and the shoe texture on the sides.
What’s important here is a higher level pattern feature is made from many lower level pattern features. Of course, you can dissect only so many times, there would be a point when the features disappear into pixelated images that make no sense.
The spatial arrangement should be picked up, and should be improvable for each object type with increasing number of images, however, we need to structure the algorithm such a way that it does not require thousands of images per category. Also, we need our algorithm to be knowledgeable about why it thinks a certain object exists in the image unlike neural networks.
Dimensionality
A problem we will encounter when trying to create links between different features, is the 2-dimensional nature of images, the change gradients have to be recorded in 360 degrees. Also, we need the algorithm to work irrespective of the size of the object in the test image, therefore, the feature patterns that are detected in the test image should be recorded in a relative manner, without capturing the actual number of pixels within the spatial distribution.
How do we create a haystack of a 2-dimensional arrangement of features?
Concepts
Humans are adept at generalizing and finding commonalities between different samples of an object. Higher level features can be called concepts.
An ability to form concepts on the fly when pinpointed would be great, since that would demonstrate one-shot learning through only a few samples. Of course, one image of 512 by 512 pixels can represent objects of millions of types, and therefore we can’t take the neural networks route. We have to look past them for a while.
The algorithm needs to have the power of finding commonalities between things apart from just dissecting images into features, and generating a formula from them.
Detecting low resolution features by training from high resolution features is how a bike shown at a distance in a test image can be picked up.
Theory of Commonalities
The reason two objects are referred to by the same label is because they have something in common. Mining for those commonalities can be our training. Many times an object, such as a bus, has many sub-types like red bus, school bus, new bus, old bus, double decker, single decker etc. All these categories also have their own commonalities. The commonalities between multiple school buses still have the commonalities that of all buses, however, they would have additional commonalities comprising of similar features of school buses. Therefore, each category needs to host and its sub categories. Each category and sub-category will store only the commonalities at that level. Buses will host most basic commonalities, but as we go more specific, the commonalities between the parent category will supplement the commonalities of the child category.
Programming
Feature Encoding
Features have to be captured mathematically, cached, and stored ready to be compared. Let’s symbolize each unique feature as a vector. The direction of the vector would show the direction of change, and the magnitude of the vector will show how much change occurred.
The vector of each PPP (Pixels per [super]-pixel) can be computed using matrix arithmetic.
With some experimentation, we have found that a gradient can be obtained using numerical differentiation method. A gradient provides a direction for each point within a PPP. We then take an average of dXs and dYs separately, and finally calculate the angle component of the vector by taking an ArcTan(Avg dy/ avg dx). The magnitude is given by dX+dY at this point, since higher dX and dY mean there’s higher continuous contrast within the PPP.
By extension therefore, we limit the number of possible features to 180 * 100 = 18000 minus constraints placed by the limited size of the PPP. That’s not too big for a computer to store in RAM. We can separately store the average color of the each PPP for color distinction. Colors too can be rounded off to the nearest 100th, in order to save memory.
To do this, compute the mean of the image, for each PPP, compute the mean, and see which regions of the PPP fall into white, and which into black. This will give us the direction of the vector. The magnitude of that vector will be the mean of that PPP.
After extracting vectors, the image would look something like this:
Now that we have gotten vectors of each PPP pointing perpendicular to change in luminosity, let’s find ways to represent their spatial distribution.
Choosing a key for a vector:
direction in degrees rounded + magnitude rounded to the first decimal
Spatial Distribution Encoding
There’s a way to encode 2-dimensional relationship into a 1-dimensional string: by mentioning every unit-change in direction by a unique character.
Because we want to encode 2-dimensional numbers, we can use n, e, w, s, a, b, c, and d for north, east, west, south, north-east, north-west, south-east, south-west respectively.
Therefore, a random line-trace from the matrix below
[
15, 20, 3
2, 15, 18
19, 4, 45]
can be represented as, 15s2w15s4.
It was observed through experimentation that storing the spatial distribution encoding for each key-pair of increasing lengths is memory-intensive. Therefore, we can instead store 15s2w15s4 as 15.s.2.w.15.s.4 in object format where 15.s.3.x.y will be placed inside 15.s.
The data structure can be:
a{{}=class_names, lastDir:’’, D.b{{}=class_names, lastDir:’’, D.e{{}=class_names, D.f{…}}}}
Possible loopholes of this type of spacial encoding:
If we store the above spacial encoding example 15.s.2.w.15.s.4 as it is, when a test image is shown, we want the system to match 15.s.2.w.15.s.4 as well as just 2.w.15.s.4. Therefore, we will store each vector as a child of its previous vector.
The memory capacity for storing things this way would be enormous. We should find a way to match a portion of a feature without having to store it separately.
Storage of Features-to-class Mapping
In order to efficiently store the features, we would allocate an array element inside the feature-data of every spatial distribution. This will allow us to accurately pick up class names from features from a test image.
Unsupervised Learning
Obviously, as the system sees more and more images, certain features start repeating. The repeating features are cached as commonalities. The mere fact that some features are repeating, it means, that those images have things in common. This wisdom can be learned without supplying labels. Labels are when the commonalities are given a name.
We can perform unsupervised learning by passing the image from the features cache to get primed. Each image will then result in vectors that are common with the global cache.
Testing
Finally, the testing procedure should take an input image, get all its vectors, loop through x and y, and see if the entry at x and y has a key set in the learned memory. If yes, see if any of its neighbors is added as a child of the current matched object of the cache. Repeat this process for all X and Y.
Scoring
Scoring is important. Simpler objects with basic shapes will have many single vectors that match complex test images with a lot of noise. We need to penalize single vector matches, and incentivize multi-vector matches. We can square the number of vectors that match consecutively.