“Punched cards” object recognition

4 min readNov 10, 2020

We are animals in our long evolutionary history. We born to survive: to hunt, spot predators, and hide.

A key survival skill is the ability to identify prey or predator in milliseconds. Due to the limitations of our optical system, we can only see clearly a certain area of the field of view. Jeff Hawkins describes it as looking at the world through a straw:

High-resolution areas of the field of view

Our eyes make many saccades when we see a new object:

Eye saccades. An early study by Yarbus, 1967

For “animal” reasons, object recognition should be done in the shortest possible time and an energy-efficient way. I’d like to avoid the math here because I’m pretty sure our brains aren’t solving optimization problems when it’s learning to recognize objects. It simply receives all the relevant electrical impulses from the optic nerves and matches them as quickly as possible with known objects.

What the time and energy efficiency implies?

The number of saccades is minimal, ideally one-shot object recognition
The saccade trajectory is the shortest and is directed to the next optimal point, necessary for quick recognition of similar objects
The number of electrical impulses (neuron firings) is minimal to minimize energy consumption

In most cases, it is impossible to recognize the entire object without many saccades:

Sometimes it is enough to see the eye to recognize an animal:

“Numenta” proposed the idea that our brain performs object recognition at multiple levels of detail: from tiny high-resolution senses to an entire ensemble of multiple body perceptions. But on each layer, the brain tries to recognize an object.

Saccades do not apply to object recognition in digital images. The machine is not limited by the eye’s field of view. It can simultaneously process any specific area of the image, analyze combinations of its different parts.

Interesting facts about object recognition by artificial neural networks:

Animals are often recognized by their body texture rather than shape
One pixel attack can easily “break” recognition

Here the machine is better able to fill in the gaps in the image than the human. So it is generally possible to automatically reconstruct and recognize objects without having the full information:

Denoising for ray tracing (filling the gaps)

Multiple parts of the image can potentially help identify an object faster. Having a given training set, it’s possible to find the pixels that are most representative for a given label. It is like viewing a test image sample through various preprocessed “punched cards”. Sometimes it only takes one “card” to recognize an object. Sometimes more. Each subsequent choice of the “punched card” should be optimal in order to eliminate the ambiguity of objects.

The approach of using “Transformers” for object detection by calculating the “attention” of each pixel of the image feels similar

How to find the optimal “punched cards”? Select randomly “holes” with a given sparsity and identify the most representative ones on the training set!

But if we choose the positions of the “holes” at random, it is definitely possible to switch from the image data to the raw binary representation: “punch” bits, not pixels. So “punched card” becomes a binary “sensor” on the input bits. At this level, the nature of the data becomes irrelevant (until the samples have structural similarity)

The advantages of this approach:

Single-shot object recognition starting from one sample per label
Ability to update the most optimal “punched cards” after each new training sample: continual learning
Minimal computational efforts on object recognition (sparse binary operations), easy parallelization
Small dependency on the specific pixels: robust to noise
Robust to any kind of training data pixel permutations
No back-propagation required
Interpretability

Potential issues:

It is a known fact that the cortex visual area is sensitive to the object edges. This is not the case for this “punched cards” model
A statistical approach, no “smart” logic: hard to gain SOTA-comparable accuracy
Dependency on the training data structural similarity

I developed a proof-of-concept in C# and applied it to QMNIST and Fashion-MNIST datasets. “Punched cards” are ordered in descending order of the number of the unique training set binary input vectors

For QMNIST, the results are promising:

Correct handwritten digits recognitions out of 60000 test samples

Fashion-MNIST numbers are disappointing:

Correct object recognitions out of 10000 test samples

As the bit length of the “punched card” increases, all binary input vectors become unique. So the accuracy is plateauing

The main question that needs to be answered further: how to determine the optimal “punched cards” for a given training set when all the training samples’ binary inputs are unique? It is also interesting to try a multi-layer “punched cards” approach.

“Punched cards” object recognition

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Petr Kovalev

No responses yet