How the Theory of Active Perception

How the Theory of Active Perception is Going to Change Computer Vision Technology (the Terminator, Only a Kinder One, will Become a Reality)

Imagine James Cameron’s Terminator, but the one equipped with modern computer vision technology. How much would such a robot be able to see and recognize walking down the street? Apparently, it would be more like a blind puppy rather than an intimidating robot: modern computer vision technology would not allow it to make as little as a few steps down a busy street. For example, Tesla can only recognize a very limited number of objects, which is — compared to people or the Terminator from the movie — next to nothing, but is nevertheless considered a breakthrough in computer vision technology. Viewed against the recognition abilities demonstrated by the human brain, though, this technology remains primitive.

The trick is that a neural network taught to recognize car numbers will not be able to recognize one in a photo or video if, besides a car number, there is someone’s face in it. Such recognition system integrated in a car and doing well when it comes to recognizing other cars and sources of light will not be able to either notice an elk on the road or — even more so — read (recognize) a shop sign near the highway. This will very likely require additional neural networks or additional learning, which translates into additional resources. Just imagine the amount of resources needed to recognize objects in all possible classes. But the Terminator from the movie apparently could see (classify and recognize objects and solve other standard vision-related tasks) “like a human being”. We believe it is possible even as we speak — with the help of TAPe-based computer vision. We have already made the first steps in this direction and obtained some interim results.

Recognition without Convolution

Thus, we have developed a video-comparing technology which can be used, for example, to search for and recognize, in real time, hundreds of thousands of specific video clips on thousands of channels, in movie libraries, and video hosting services. Currently, searching for videos based on other videos does not make up an even remotely significant share of video search requests—everyone searches for videos using texts in the first place: titles, names, descriptions, or tags. It is what people are used to, but in reality, it is neither handy nor precise—you have to process a lot of unnecessary information (for example, tons of repeated content) before you find what you really want to. Searching for videos using videos with the help of TAPe would make the process so much more convenient and straightforward. And all we need to solve the above-mentioned tasks is a single server without any graphics cards.

One of the reasons behind such efficiency is that our algorithms do not imply what is referred to as the convolution method, which is the most resource-intensive operation that no modern computer vision neural network can do without. The human brain does not need this type of operation, and our technology is built so as to follow the processing pattern used by the human brain, or the language of thought. The TAPe-based technology, similarly to the human brain, processes any image right away as a whole, and the recognition results are not conditional upon the absence of noise.

Simultaneous Reading
of Key Features

The second reason why the technology is so efficient is that it can simultaneously get a map of any image’s key features at any level of detail. “Simultaneously” means that the features are read all together. And the number of those key features is minimally sufficient to solve any computer vision tasks. What does it mean?

Currently, developers teach image recognition neural networks by first applying several dozens of feature detectors to images that will be used by the neural network to learn, thus generating feature maps. The efficiency of image and video recognition by modern algorithms is in many ways conditional upon the feature map adequacy and precision: while some developers may need, for example, 100 features to recognize faces, others will only employ 80, and still others will use 150.

The Theory of Active Perception allows avoiding this painstaking process: by modeling the way the brain works, it “reads” the features needed to recognize an image all at once — according to TAPe, this is exactly how the brain recognizes information. Unlike standard neural networks, TAPe-based technologies do not need prior learning to find features in that pixel array. According to the Theory, an image (in the broadest sense of the word) read by the human visual analyzer is “automatically” broken down by the brain into those very features that are constant and do not change, irrespectively of the tasks. And our technology does not require breaking down the image into pixels either. According to TAPe, any object (image) has a sufficient number of minimum features, and it was TAPe that helped us develop an algorithm allowing the reading of those features.

What's next

As we already said at the beginning of the article, we believe that the Theory of Active Perception, if used in developing computer vision technology, will result, firstly, in an essentially different architecture of neural networks and other similar algorithms within the framework of what is referred to as artificial intelligence. Additionally, it seems likely enough that the architecture of computer processors may become different, and renewed. And even more than that, we are convinced that a broader perspective suggests an essentially new information processing method. Indeed, the 0’s and 1’s system that computers are based on is ingeniously simple and at the same time efficient, but this is still not enough for computers to keep up with the brain in terms of efficiency. And by describing the “mathematics”, or rather the processing done by the human innate perception mechanism, the Theory of Active Perception offers possibilities of bridging, or at least substantially reducing, the gigantic gap between technology imitating human brain operation and the brain itself.

Back to the Theory of Active Perception