The Theory of Active Perception (TAPe)
Describes—with the Help of Group Theory—the Way the Human Brain Perceives Information, while Discovery of the Theory of Isomorphism between TAPe and Natural Human Language Suggests a New Information Processing Method
We have developed the Theory of Active Perception, TAPe, that describes the way the human brain perceives information. The Theory is based on a mathematical model that relies on group theory. We have also discovered the isomorphism between TAPe and natural human language. All of that suggests a new information processing method that could be applied in a wide variety of areas.

At this stage, it is computer technology we are referring to in the first place. New principles of building both neural network and computer processor architecture might be possible. As an additional example, technologies developed using TAPe can help make a leap forward in computer vision: the Theory can be used to create algorithms that will be able to recognize any image with the same apparent ease as the human brain does. While what is referred to as artificial intelligence uses large image databases to learn to recognize objects in a class (for example, faces, fingerprints, tiger skin patterns, and so on, with each separate class requiring a separate AI instance), our brain can accomplish all of those tasks at once. Using TAPe, we create a technology that is able to do about the same.

Brief Description of the Core Principles in the Theory of Active Perception

We will refrain from giving a detailed mathematical description of the Theory here, as it constitutes our know-how. Nevertheless, we are going to disclose its basic principles.

We believe that the Theory of Active Perception mathematically describes what is referred to as the language of thought . It is important to understand that the brain, naturally, does not make use of the mathematics we are familiar with, and that it is not a computer dealing with 1's and 0's. But should we draw a parallel with computers, the brain rather deals with elements and symbols that constitute a system, a kind of “alphabet” that we decided to call languagemathics. We will take the liberty of using this newly coined term, as we are convinced that it allows describing the essence of brain processes used to perceive information in the most accurate manner. What can be conventionally referred to as language elements (“letters”) interact with one another according to the mathematical laws of group theory, thus generating new, more complex elements (“words” and “sentences”). And it is this process of the elements interacting with one another and generating new elements that the Theory of Active Perception describes.

Strictly speaking, it is not AI, but a neural network–based classifier. However, we will henceforth stick with a more colloquial term, which is more widespread and routinely understood by everyone.
The first person to have put forward a language of thought (mentalese) hypothesis, as far back as in the 1970s, was Jerry Fodor, an American philosopher and psycholinguist. He also suggested that the internal mental language is a means of coding information, while the predicates of that language are innate. Fodor’s hypotheses chime in with generative linguistics and innate language structure theory developed by Noam Chomsky, an American linguist.
Key Components of the Theory of Active Perception

The Theory of Active Perception uses a finite number of elements that, according to certain laws, are pooled in groups at three different levels. The first-level elements amount to a couple of dozens, they can merge with one another and generate second-level elements. The second-level elements are already estimated at a couple of hundreds, and they can also merge with one another and generate third-level elements. The third-level elements, in their turn, already stand at a few tens of thousands, and they represent more complex objects. It is with their help that information is recognized, for example an image. There is a minimally sufficient number of the first-level, second-level, and third-level elements, meaning they make up the exact amount necessary to perceive any piece of information.

Now, how does it relate to the human brain? Based on TAPe, we believe that the human brain uses certain filters to perceive information (visual information, for example). In TAPe, those filters are represented by the first-level elements. They constitute raw data—those exact features used in recognition technology. To recognize an image, the brain needs a minimum number of those feature filters. Apparently, when the human visual analyzer perceives (“sees”) certain information, the filter “assumes” a part of the information load, and this information is used in a neural network. We neither know how exactly that happens, nor is that important to us. What is important, though, is that any kind of visual information, according to TAPe, can be broken down into the first-level elements.

We believe that music is a good analogy for this type of element structure and interaction between them. Thus, the first-level elements in music are represented by notes.

The second-level elements are constituted, in the first place, by the laws according to which the first-level elements form certain connections in certain sequences between themselves, so that they get joined in groups of elements. Those laws, together with groups of elements resulting from connections between the first-level elements, make up the second-level elements. Were we to go further with the musical analogy, then the second-level elements would be the chords built with notes. The notes are combined according to a certain law, otherwise there would be no chords.

Finally, the third-level elements that describe any visual information to a T result from joining or combining the second-level elements. Sometimes they can be made up only of the first-level elements or of combinations of the first-level and second-level elements (while the second-level elements can only be made up of several first-level elements). Those variations amount to a few tens of thousands, which is not that many and is already enough to recognize any image at all, even under conditions of a priori uncertainty.

In our musical analogy, it is the third-level elements that would constitute the music itself. Music results from combinations of chords and/or notes. Sometimes music can be made up of repetitions of a single note, for example the C note.

Laws Governing TAPe Elements
Group theory serves as the basis for mathematical description of the Theory of Active Perception. Group elements are interconnected in such a way that one level of elements generates another level of elements. Relations between those elements are antitransitive.

Antitransitivity leads to a rigid hierarchy of elements: they follow a single possible pattern depending on the values they take. Knowing how the first-level elements have behaved, we can surely tell what will become of them further—what second-level and third-level groups will be activated. That is useful, among other things, for recognition speed: both for when the characteristic features (attributes, traits) are set and for when the actual recognition takes place.
It is appropriate to recall at this point that a transitive expression has the following format: “A=B B=C A=C.” There also exists the notion of intransitivity, meaning the absence of the transitivity property. It can be expressed by the following phrases: “wolves feed on deer, and deer feed on grass, but wolves do not feed on grass” or “Mary hates Ann, Ann hates Sophie, but it is not certain that Mary hates Sophie.” Antitransitivity describes such relations between any three elements that are not transitive. This means that a relation can be called antitransitive if this relation between elements (a, b) and elements (b, c) excludes the possibility of the same relation between elements (a, c).

Hierarchical Element Structure
So, if we know the first-group values, we can pin down what the second-level and third-level groups of elements will be.

Using the first-level, second-level, and—even more so—third-level elements, it is possible to recognize any image at all. It is very likely that the brain does not need to make calculations up to the third level every time: we do not scrutinize an object each and every time, slight recognition is often enough. Besides, the brain is able to build the image of an object that we have seen many times before without resorting to deep recognition. With all the seeming diversity of the existing images, their number is finite insofar as the number of words in a language is finite as well. The number of the third-level elements is sufficient for the brain to be able to recognize any images, even under conditions of a priori uncertainty. Modern computer vision technology, unlike the human brain, cannot recognize images under conditions of a priori uncertainty. On the contrary, it requires, if you will, “a priori certainty”, meaning the neural network “must know” what exactly and where it is trying to find. Again, the brain can very well do without it.

So, TAPe can help develop technologies to be used to build recognition algorithms for any image in any class without both prior learning and prior tasking. Learning will be happening while the recognition process is underway, as it happens to people who learn as they live and who, in the process of such natural learning, often “re-solve” the same recognition tasks over and over again.

Isomorphism between the Theory of Active Perception and the Language of Thought
While working on the Theory of Active Perception, we have noticed that its structure is similar to that of a natural language (that is, a language used by people for communication). This similarity got us interested, and we dug deeper into the theories on the origin of language: in particular, we studied the works by Noam Chomsky, Jerry Fodor, Svetlana Burlak, and researchers representing allied sciences, as well as by philosophers who addressed the issues of information perception. And it is thanks to our Theory of Active Perception that we discovered the isomorphism between the Theory of Active Perception and the natural language. The structures of those two systems are isomorphic—that is, they are similar to one another.

Why is this isomorphism so important? Because, firstly, it confirms the Theory of Active Perception and, secondly, studying or analyzing the structure of the natural language will help progress faster towards studying the possibilities of using the Theory of Active Perception in computer vision.

When we refer to isomorphism between TAPe and the natural language, we imply what follows:

● The elements in the natural language, similarly to those in the Theory of Active Perception, are grouped together according to certain laws at three different levels; those laws are the same for both systems.
● In the natural language, the first-group elements interact with one another according to certain laws and generate the second-group elements, which, in their turn, generate the third-group elements—exactly as it happens with the TAPe elements as well.
● Even the number of elements in the natural language and in TAPe is roughly the same, though it is the isomorphism between elements and connections that matters rather than their number being equal.

Why is it that any person is able to acquire any language from birth, how exactly does the human brain perceive a complex system such as the grammar of a language, what exact laws govern the way the word-like elements are grouped together in a language—those are the questions that Noam Chomsky (together with thousands of other researchers around the world) tried (and is still trying) to answer. But he did not go further than developing a set of rather general concepts in terms of why it is that different elements of the language of thought interact with one another in this exact way and generate new elements (meanings).

But his theories and concepts in what regards the origin and organization of language drew our attention to the isomorphism between the Theory of Active Perception and the language of thought. The similarity of structures in the Theory of Active Perception and the language is not surprising: people have an innate ability to perceive the language, from birth they are capable of discerning human speech from any other noises, and, sure enough, the way our brain perceives the language is isomorphic to the way it perceives other types of information, such as visual information.

In his works, Chomsky does not use this very term, “the language of thought”, but puts forward a hypothesis that language, as an innate system, started, at some point in history, to be used by people as a tool for thought in the first place and only later — as a means of communication. What matters here is the distinction between the language as some kind of innate system and language as speech or text which are external interpretations of that internal innate system. So, it is clear to us that when Chomsky or other researchers refer to an innate ability to acquire a natural language, or to universal grammar, what is implied in the first place is a kind of language of origin, a protolanguage that is inherent to the human brain—languagemathics.

How the Theory of Active Perception is Going to Change Computer Vision Technology (the Terminator, Only a Kinder One, will Become a Reality)
Imagine James Cameron’s Terminator, but the one equipped with modern computer vision technology. How much would such a robot be able to see and recognize walking down the street? Apparently, it would be more like a blind puppy rather than an intimidating robot: modern computer vision technology would not allow it to make as little as a few steps down a busy street. For example, Tesla can only recognize a very limited number of objects, which is — compared to people or the Terminator from the movie — next to nothing, but is nevertheless considered a breakthrough in computer vision technology. Viewed against the recognition abilities demonstrated by the human brain, though, this technology remains primitive.

The trick is that a neural network taught to recognize car numbers will not be able to recognize one in a photo or video if, besides a car number, there is someone’s face in it. Such recognition system integrated in a car and doing well when it comes to recognizing other cars and sources of light will not be able to either notice an elk on the road or — even more so — read (recognize) a shop sign near the highway. This will very likely require additional neural networks or additional learning, which translates into additional resources. Just imagine the amount of resources needed to recognize objects in all possible classes. But the Terminator from the movie apparently could see (classify and recognize objects and solve other standard vision-related tasks) “like a human being”. We believe it is possible even as we speak — with the help of TAPe-based computer vision. We have already made the first steps in this direction and obtained some interim results.

Recognition without Convolution
Thus, we have developed a video-comparing technology which can be used, for example, to search for and recognize, in real time, hundreds of thousands of specific video clips on thousands of channels, in movie libraries, and video hosting services. Currently, searching for videos based on other videos does not make up an even remotely significant share of video search requests—everyone searches for videos using texts in the first place: titles, names, descriptions, or tags. It is what people are used to, but in reality, it is neither handy nor precise—you have to process a lot of unnecessary information (for example, tons of repeated content) before you find what you really want to. Searching for videos using videos with the help of TAPe would make the process so much more convenient and straightforward. And all we need to solve the above-mentioned tasks is a single server without any graphics cards.

One of the reasons behind such efficiency is that our algorithms do not imply what is referred to as the convolution method, which is the most resource-intensive operation that no modern computer vision neural network can do without. The human brain does not need this type of operation, and our technology is built so as to follow the processing pattern used by the human brain, or the language of thought. The TAPe-based technology, similarly to the human brain, processes any image right away as a whole, and the recognition results are not conditional upon the absence of noise.

Simultaneous Reading
of Key Features
The second reason why the technology is so efficient is that it can simultaneously get a map of any image’s key features at any level of detail. “Simultaneously” means that the features are read all together. And the number of those key features is minimally sufficient to solve any computer vision tasks. What does it mean?

Currently, developers teach image recognition neural networks by first applying several dozens of feature detectors to images that will be used by the neural network to learn, thus generating feature maps. The efficiency of image and video recognition by modern algorithms is in many ways conditional upon the feature map adequacy and precision: while some developers may need, for example, 100 features to recognize faces, others will only employ 80, and still others will use 150.

The Theory of Active Perception allows avoiding this painstaking process: by modeling the way the brain works, it “reads” the features needed to recognize an image all at once — according to TAPe, this is exactly how the brain recognizes information. Unlike standard neural networks, TAPe-based technologies do not need prior learning to find features in that pixel array. According to the Theory, an image (in the broadest sense of the word) read by the human visual analyzer is “automatically” broken down by the brain into those very features that are constant and do not change, irrespectively of the tasks. And our technology does not require breaking down the image into pixels either. According to TAPe, any object (image) has a sufficient number of minimum features, and it was TAPe that helped us develop an algorithm allowing the reading of those features.

As we already said at the beginning of the article, we believe that the Theory of Active Perception, if used in developing computer vision technology, will result, firstly, in an essentially different architecture of neural networks and other similar algorithms within the framework of what is referred to as artificial intelligence. Additionally, it seems likely enough that the architecture of computer processors may become different, and renewed. And even more than that, we are convinced that a broader perspective suggests an essentially new information processing method. Indeed, the 0’s and 1’s system that computers are based on is ingeniously simple and at the same time efficient, but this is still not enough for computers to keep up with the brain in terms of efficiency. And by describing the “mathematics”, or rather the processing done by the human innate perception mechanism, the Theory of Active Perception offers possibilities of bridging, or at least substantially reducing, the gigantic gap between technology imitating human brain operation and the brain itself.

Made on
Tilda