What is Mind?
What is Consciousness?
How Can We Build and Understand Intelligent Systems?
May 2015: Last month, SPIE awarded me a leadership award for my role in the historical development of neural networks from late 1960’s to now. To go with that, I gave a colorful slide show and a citeable paper (see google scholar) for a general audience on that history.
New in 2009: Intelligence in the Brain: A theory of how it works and how to build it, Neural Networks, in press (electronic version available online March 29, 2009).
Added in 2016: Click here for a brief history of how I developed the concept of reinforcement learning by approximating dynamic programming (RLADP) from the 1960s through 1988, with the key papers I wrote on that subject in that period, including the 1987 paper which showed how RLADP can explain the major structures of the brain. Click here to see the slides for my keynote plenary talk at the IEEE winter school on big data in computational intelligence -- covering emerging risks and opportunities (both far larger than you probably imagine!), and also the key mathematical principles for accuracy in prediction and data analysis. Click here to see video of the actual talk, with far more life and detail, for the risks and opportunities part. Click here for our new paper in Quantum Information Processing which discusses what kinds of new work in basic physics would be needed in order to build or understand intelligent systems making full use of the most that quantum computing has to offer.
Understanding the Mind: The Subjective Approach Versus the Objective Approach
At a conference in
New in 2008: People have
been studying the brain and the mind for centuries, of course. Why is it that
human society has yet to come up with a basic functional understanding of how
brains work, that would also fit with and explain or subjective experience? The
papers posted here actually do provide a new understanding – but if you
are an expert on brain research, you would want to know what is
the new approach that makes this possible. In April, 2008, we had a
meeting for all the people in the
New in 2010: The “big picture” here is a simplified summary of the levels of intelligence or consciousness we see in nature and in mathematics. But of course, “sanity” in human life involves more than the kind of sanity I refer to here. For a more complete discussion of other aspects of sanity and human development, see the (updated) page on human potential.
New in 2007: Click here for an updated technical review on intelligence in the brain, new research opportunities, and the connection between the scientific and the subjective points of view. Also new: Here are the slides and here you can find the video transcript of a talk on “mathematics and the brain” for high school mathematics students. The video transcript may be easier to follow if you bring up the slides on your computer at the same time.
A Scientific/Engineering View of Consciousness and How to Build It
At the first big international conference on consciousness,
held at the United Nations University in
Here is one of the key ideas in that paper: intelligence or “consciousness,” as we see it in nature, is not an “either-or” sort of thing. Nor is it just a matter of degree, like temperature or IQ. Rather, we generally see a kind of staircase, of ever-higher levels of intelligence. (To be precise – once we actually start building these systems, we see something more like a kind of ordered lattice, but let me start from the simpler version of the story.) Each level represents a kind of fundamental qualitative advance over the one below it, just as color television is a fundamental improvement over black and white television.
The central challenge to solid mathematical, scientific understanding of intelligence today is to understand and replicate the highest level of intelligence that we can find even in the brain of the smallest mouse. This is clearly far beyond the qualitative level of capabilities or functionality that we find in even the most advanced engineering systems today. It is even further beyond the simple “Hebbian” models or the “Q-learning” families of models in vogue today in neuroscience, models which typically could not handle even a typical 5-variable real world nonlinear control problem. The brain is living proof that it is possible to develop systems to learn, predict and act upon their environment in a way which is far more powerful and far more universal than one would expect, based on the conventional wisdoms in statistics and control theory. (In every discipline I have ever studied, it is important to learn the difference between the powerful ideas, the real knowledge and the common popular conventional wisdoms.)
Another key idea in that paper: the brain as a whole system is an “intelligent controller,” a system which learns to output external actions as a function of sensor inputs and memory and internal actions, so as to achieve the “best” possible results. What is “best” is defined by a kind of performance measure or utility function, which is defined by the user when we build these systems, and defined by a kind of inborn “primary motivation system” when we look at organisms in nature. (In fact, however, the performance measure inserted into an artificial brain looks “inborn” from the viewpoint of that brain.) In actuality, I have been trying to figure out how to build these kinds of systems since I was 14 or 15, based in part on the inspiration of John Von Neumann’s classic book, The Theory of Games and Economic Behavior, the first book to formulate a clear notion of rationality and of cardinal utility functions. The claim is not that mammals are “born rational,” but that they are born with a learning mechanism evolved to strive ever closer to the outcomes desired by the organism, as much as possible.
Click here for a simplified explanation of how we learn in big jumps and small jumps; this little piece is subtitled “creativity, backpropagation and evolutionary computing made simple.” It also addresses both everyday human experience and the kind of intelligence we see n economic and political systems as well.
In this view, capabilities like expectation, memory, emotional value assessment and derivative calculation are all essential subsystems of the overall intelligent control architecture. Evolution also tries to provide the best possible starting point for the learning process, but this does not imply that the organism could not have learned it all on its own, with enough time and enough motivation.
Some people believe that effects from quantum mechanics are essential to understanding even the lowest levels of consciousness or intelligence. Others believe that “quantum computing” effects cannot possibly be useful at all, in any kind of intelligence. In my view, the truth is somewhere between these two extremes. Even the human brain probably does not use “quantum computing” effects, but we could build computers at a higher level of consciousness which do. If you are interested in that aspect, see comments on quantum mind.
New in December 2009: slides on brain-like prediction (2 megs), the statistical principles which make it possible to build a universal system which learns to “predict anything” with inputs far more complex than traditional learning or statistics systems ever could. (Slides prepared for lecture at IEEE Latin American Summer School on Computational Intelligence. These are substantially more complete than my earlier (2007) slides on how to get more accurate prediction (slides only, no text, 300K), a keynote talk for the 2007 international competition in forecasting. “Cognitive prediction” was one the two most important streams of research leading us towards brain like intelligence, as in the recent NSF funding announcement on COPN.) Click here for slides and text (1.5 megs) of a more mathematical explanation, given as a talk in the 2010 Erdos Lectures conference.
The bulk of my own work in the neural network field is aimed at replicating the “basic mammal level” of intelligence. Thus I will first list and describe some papers aimed at that level:
1. A general tutorial on neural networks presented at the IEEE Industrial Electronics Conference (IECON) 2005, slightly updated from the version I gave at the International Joint Conference on Neural Networks (IJCNN) 2005. (4 megabyte file).
(Here is a shortened 1 meg, 60 slide pdf talk to be given
2. A general
current overview of Adaptive or Approximate Dynamic Programming (ADP), the
lead chapter of the Handbook of Learning and Approximate Dynamic Programming,
IEEE Press, 2004. (Si, Barto, Powell and Wunsch eds.) For more complete
information on ADP, and for important background papers leading up to that
book, see www.eas.asu.edu/~nsfadp.
The idea of building “reinforcement learning systems” by use of ADP
was first proposed in my
elements of intelligence,” Cybernetica (
3. A review
of how to calculate and use ordered derivatives when building complex
intelligent systems. I first proved
the chain rule for ordered derivatives back in 1974, as part of my Harvard PhD
thesis. The concept propagated from there in many directions, sometimes called “backpropagation,”
sometimes called the “reverse method or second adjoint
method for automatic differentiation,” and sometimes used in “adjoint circuits.” But it is a general and powerful
principle, far more powerful than is understood by those who have only
encountered second-hand or popularized versions. The review here was published
in Martin Bucker, George Corliss, Paul Hovland,
Uwe Naumann & Boyana Norris (eds),
Automatic Differentiation: Applications, Theory and Implementations,
4. P. Werbos, Backpropagation through time: what it does and how to do it. (1 megabyte). Proc. IEEE, Vol. 78, No. 10, October 1990. A slightly updated version appears in my book The Roots of Backpropagation: From Ordered Derivaives to Neural Networks and Political Forecasting, Wiley 1994. Among the most impressive real-world applications of neural networks to date is the work at Ford Research, which has developed a bullet-proof package for using backpropagatoin through time to train time-lagged recurrent networks (TLRN). In one simulation study, they showed how TLRNs trained with their package can estimate unobserved state variables with higher acuracy than extended Kalman filters, and can do as well as particle filters at much less computational cost:
5. Chapters 3 , 10 and 13 of the Handbook of Intelligent Control, White and Sofge eds, Van Nostrand, 1992. Those chapters provide essential mathematical details and concepts which even now cannot be found elsewhere. (1-2-megabyte files).
Chapter 3 provides a kind of general introduction to ADP, and to how to integrate the two major kinds of subsystem it requires – a trained model of the environment (the real focus of chapter 10), and the “Critic” and “Action” components (chapter 13) which provide what some call “values,” “emotions,” “shadow prices,” “cathexis” or a “control Liapunov function”, and a “controller” or “motor system” or “policy” or “stimulus-response system.” The mathematics itself is universal, and essentially provides one way to unify all these disparate-sounding concepts. There are some advocates of reinforcement learning who like simple designs which require no model of the environment at all; however, the ability to adapt to complex environments does require some use of a trained model or “expectation system,” and half the experiments in animal learning basically elaborate on how the expectation systems in animal brains actually behave. ADP design does strive to be highly robust with respect to the model of its environment, but it cannot escape the need to have such a model.
Chapter 3 also mentions a design idea which I call “syncretism,” which I have written about in many obscure venues, starting from my paper “Advanced forecasting for global crisis warning and models of intelligence,” General Systems Yearbook, 1977 issue. (See also my list of supplemental papers.) I deeply regret that I have not had time to work more on this idea, because it plays a central role in overcoming certain all-pervasive dilemmas in intelligent systems and in statistics, and it is essential to understanding certain aspects of human experience as described by Freud. In essence – even today, there is a division between those who try to make predictions based on some kind of trained or estimated general model of their environment, and those who try to make predictions based on the similarity of present cases to past examples. Examples of the former are econometric or time-series models or Time-Lagged Recurrent Networks(TLRN), or hybrid S/TLRN, the most general neural form of that class of model (see chapter 10.) Examples of the latter are heteroassociative memory systems, Kohonen’s prototype systems, and most of the modern “kernel” methods.
But in actuality, both extreme cases have many crucial limitations. To overcome these limitations, when learning a simple static input-output relationship, one needs to combine the two in a principled way. That is what syncretism is about. One maintains a global model, which is continually updated based on learning from current experience and on learning from memory.
(Chris Atkeson has at least implemented a part of this, and shown how useful it is in robotics.) But one also monitors how well the global model has explained or fitted each important memory so far. When one encounters a situation similar to an unexplained or undigested memory, one’s expectation is a kind of weighted combination of what the global model and the past example would predict. In the end, we move towards a mathematical version of Freud’s image of the interplay between the “ego” and the “id,” between a person’s integrated global understanding and the (undigested) memories which perturb it. However, in this model, global understanding may be perturbed just as easily by undigested euphoric experiences (outcomes more pleasant than expected) as by traumatic experiences. Again, see the UNU paper for citations to papers which discuss the neuroscience and psychology correlates in more detail.
Chapter 10 mainly aims at the issue of learning to understand the dynamics of one’s environment. For example, it addresses how to reconcile traditional probabilistic notions of how to learn time-series dynamics, versus “robust” approaches which have recently become known of as “empirical risk approaches” ala Vapnik. Chapter 10 shows how a pure “robust” method of his type – first formulated and applied in my 1974 PhD thesis – substantially outperforms the usual least squares methods in predicting simulated and actual chemical plants. It also explains how the “pure robust” method is substantially different from both the “parallel” and “series” system identification methods used in adaptive control, even though it seems similar at first glance. I referred to that chapter, and to the example of ridge regression, when – in visiting Johann Suykens years ago – I suggested to him that one could substantially improve the well-known SVM methods by accounting for these more general principles. Suykens’ modified version of SVM is now part of the best state of the art in data mining – but it is only a first step in this direction. Today’s conventional wisdom in data mining has not yet faced up to key issues regarding causality and unobserved variables which world-class econometrics already knew about decades ago; chapter 10 cites and builds upon that knowledge. Chapter 10 also gives details about how to apply the chain rule for ordered derivatives in a variety of ways.
Chapter 13 mainly addresses ways to adapt Critic and Action systems. It shows how to adapt any kind of parameterized differentiable Critic or Action system; it uses a notation which allows one to plug in a neural network, an adaptable fuzzy logic system, an econometric system, a soft gain-scheduling system, or anything else of that sort. It describes a variety of designs in some detail, including Dual Heuristic Programming (DHP), which has so far outperformed all the other reinforcement learning systems in controlling systems which involve continuous variables. I first proposed DHP in my 1977 paper (cited above), but chapter 13 contains the first real consistency proof – and it contains the specific details needed to meet the terms of that proof.
Chapter 13 also shows how the idea of a Critic network can be turned on its head, to provide a real-time time-forwards way to train the time-lagged recurrent networks described in chapter 10. It also describes the Stochastic Encoder/Decoder/Predictor design, which addresses the full general challenge of adaptive nonlinear stochastic time-series modeling.
4. Finally, for a more complete history and a more robust extension of these methods, see my 1998 paper on the mathematical relations between ADP and modern control theory (optimal, robust and adaptive). A brief summary of the important conclusions of that paper may be found in my supplemental papers.
But again, all of this addresses the “subsymbolic” kind of intelligence that can be found in the brains of the smallest mouse. Thus I claim that even the smallest mouse experiences the interplay between traumatic experiences and understanding that Freud talked about. Even the smallest mouse experiences the flow of “cathexis” or emotional value signals that Freud talked about. These provide the foundation which higher intelligence is built upon. It builds upon these systems, and does not replace or transcend them.
If you look closely at this work, you will see that by 1992 I was beginning to question the relatively simple model of intelligence as an emergent phenomenon which I first sketched out in “Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research,” IEEE Trans. SMC, Jan./Feb. 1987. In 1997, I sketched out a kind of alternative, more complex model which was essentially a unification of my previous model with some of the ideas in a classic paper by Jim Albus. The goal was to attach the minimum possible a priori structure, while still coping with the prodigious complexity in time and space that organisms must deal with. But I now begin to see a third alternative, intermediate between the two.
A clue to this new alternative came in the plenary talk by Petrides at IJCNN in 2005. A crude simplification of his talk: “We have studied very carefully the real functioning of the very highest areas of the frontal lobes. The two most advanced areas appear to be designed to answer the two most fundamental questions in human life: (1) where did I leave my car this time in the parking lot; and (2) what was it I was trying to do anyway?” The newly discovered “loops” between the cerebral cortex, the basal ganglia and the thalamus do not represent hierarchical levels, as was once thought; rather, they represent alternative qualitative types of window into the system (alternative types of output with corresponding error measures). The appearance of a hierarchy of tasks within tasks, or time levels within time levels, is actually generated by a system that looks more like a kind of generalized “subroutine calling” procedure, in which “hierarchy” is an emergent result of recursive function calls. Again – I would have written more about this by now, if we were not living in a world whose very survival is at stake or if I had no way of improving the global probability of survival.
2. Beyond the “Mouse Level”
We do not have functioning designs or true mathematical models of the higher levels of intelligence, but I claim that we can develop a much better preliminary or qualitative understanding of what we can build at those higher levels by fully accounting for what we have learned at the subsymbolic level.
Many of my thoughts on these lines are somewhat premature, scientifically, and I find it hard to take time to write up details which people are simply not yet ready for. For now I include just three papers here:
“third person” viewpoint described in Kuhn’s famous book on the philosophy and history of science.
firmly scientific in spirit (as in Francis Bacon’s historic efforts), it does represent a “first person viewpoint.” There are many
philosophers, like Chalmers and the existentialists, who stress the fact that we all ultimately start from a first-person viewpoint.
In my homepage, I mentioned my view that a rational person should never feel obliged to “choose a theory” and be committed to it forever. Rather, we should entertain a menu of vivid, clear, different theories; we should constantly strive to improve all of them, and we should constantly be ready to adapt the level of probability we assign to them. This follows the concept of rationality developed by John Von Neumann, which is also the foundation for “decision analysis” as promulgated by Howard Raiffa.
Yet when we observe human behavior all around us, we see people who “choose” between possible theories like a vain person in a clothing store – trying them on, preening, looking at themselves in the mirror, and then buying one. (The best dressed of all dress up like Cardinals.) They then feel obliged to fight to the death for a particular theory as if it were part of their body, regardless of evidence or of objective, good judgment. Is this really just one of many examples proving that human brains (let alone mouse brains) totally lack any kind of tendency at all towards rationality or intelligence as I have described it? Does it totally invalidate this class of model of the mind?
Not really – and I was fully aware of such behavior when I first started to develop this kind of theory. What it shows is that humans are halfway between two classical views of the human mind. In one view (espoused by B.F. Skinner and one of sides of the philosopher Wittgenstein), humans play “word games” without any natural tendency to treat words and symbols as if they had any meaning at all; words and theories are truly treated on an equal footing with other objects seen in the world, like pants and dresses. In the opposite view (espoused by the “other” Wittgenstein!), humans are born with a kind of natural tendency to do “symbolic reasoning,” in which words have meanings and the meanings are always respected; however, because modern artificial intelligence often treats “symbolic reasoning” as if symbols were devoid of meaning, it is now more precise to call this “semiotic intelligence.” (There are major schools of semiotics within artificial intelligence, and even Marvin Minsky has given talks about how to fill in the gap involving “meaning.”)
My claim is that the first is a good way of understanding mouse-level intelligence. But human intelligence is a kind of halfway point between the two. Humanity is a kind of early prototype species, like the other prototype species which have occurred in the early stages of a “quantum leap” in evolution, as described by the great scientist George Gaylord Simpson, who originated many of the ideas now attributed to Stephen Jay Gould. (Though Gould did, of course, have important new ideas as well.) As an early prototype, it has enough of the new capabilities to “conquer the world” as a single species, but not enough to really perfect these capabilities in a mature or stable way.
Unfortunately, the power of our new technology – nuclear technology, especially, but others as well – is so great that the continued survival of this species may require a higher level of intelligence than what we are born with. Only by learning to emulate semiotic intelligence (or even something higher) do we have much of a chance of survival. The “semiotic” level of intelligence has a close relation to Freud’s notion of “sanity.” Unfortunately, Freud sometimes uses the word “ego” to represent global understanding, sometimes to represent the symbolic level of human intelligence, and sometimes in other ways; however, the deep and empirically-rooted insights there are well worth trying to disentangle.
We do not yet now exactly what a fully evolved “semiotic intelligence” or sapient would really look like. Some things have to be learned, because of their complexity. (For example, probability theory has to be learned, before the “symbolic” level of our mind can keep up with the subsymbolic level, in paying attention to the uncertainties in our life.) Sometimes the best that evolution can do is to create a strong predisposition and ability to learn something. But certainly we humans have a lot to learn, in order to cope more effectively with all of the megachallenges listed on my homepage.
Finally, I should summarize my own personal view of the levels of intelligence, stretching from the “mouse level” up to the highest that I can usefully imagine. First, between the mouse to the human are actually some sublevels, as I discussed in some of these papers; there are evolutionary stages like the first development of “mirror neurons,” learning from vicarious experience, transmission of experience through trance-like dances, languages which function as “word movies,” and so on. Above the level of today’s human is the sapient level. Somewhere above that is when the sapient level gets coupled with the full power of quantum computing effects. Much higher than that is the best that one could do by building a truly integrated “multiagent system” made up of components at the quantum sapient level, with some sort of matrix to hold them together; I think of this as “multimodular intelligence,” and I feel that it would be radically different in many respects from today’s relatively feeble multiagent systems and from the conflict-prone or tightly regulated social organizations we see all around us in everyday life. Still, it would have to rely heavily on the same universal principles of feedback, learning and so on. But as one goes up the ladder, one does begin to get further away from what we really know as yet…