COMMENTS ON THE SYMBOL GROUNDING PROBLEM

Andrzej J. Pindor,
University of Toronto, Computers and Communications
pindor@utirc.utoronto.ca

.QP
.B
Abstract
.R

A solution to the symbol grounding problem proposed by Harnad requires giving
a system both linguistic and sensorimotor capacity indistnguishable from those
of a human. The symbols are then grounded by the fact that analog sensorimotor
projections on transducer surfaces, coming from the real world objects,
and successively formed sensory invariants of nonarbitrary shape constrain
the symbol combinations over and above what is imposed by syntax, and tie
the symbols to those real objects.

It is argued here that the full sensorimotor capacity may indeed be
a crucial factor, since it is capable of providing the symbols (corresponding
to language terms) with a deep underlying structure, which creates a network
of intricate correlations among them at a level of primitive symbols based
in inputs from the transducers. On the other hand the nonarbitrary shapes
of sensory invariants as well as the analog nature of sensorimotor projections
seem to be of no consequence. Grounding is then seen as coming from this
low-level correlation structure and, once known, could in principle be
programmed into a system without a need of transducers.

.RE

In a series of papers Stevan Harnad has suggested a solution to
the "symbol grounding" problem (Harnad 1990, Harnad 1993, Harnad
1993a). The essence of the problem is that symbols manipulated by
digital computers or even neural nets (in SIM or IMP
implementations, see Harnad 1993a) do not seem to be about
anything in particular - their only meaning comes from the mind
of an interpreter. The symbols themselves are manipulated according
to syntactic rules, on the basis of their shapes only, these
shapes being unrelated to what the symbols can be interpreted as
standing for. This lack of meaning of the symbols is, Harnad
claims, evident for instance from the fact that one cannot learn
Chinese from a Chinese-Chinese dictionary (Harnad 1993a).
Consequently, there is no guarantee that a TT-passing system, say
in Chinese, really understands Chinese - it may be simply
manipulating symbols (Chinese characters) syntactically, without
any regard for what these symbols are _about_.
Harnad suggests that symbols of a system can be grounded if
(and only if) the system can pass a Total Turing Test, i.e. have
both linguistic _and_ sensorimotor capacity totally indistinguishable
from our own. Such a system would have to be equipped with a full range
of transducers giving it a complete, as he calls it, robotic capacity.
Harnad then proposes a more detailed model describing how
the robotic capacity leads to the grounding of symbols. He argues that
inputs from senses (or sensors for a robot) in the form of analog
"sensory projections" connect to symbols of the system (i.e. language
terms) through sensory invariants of _nonarbitrary shape_ (these
invariants, he suggests, could be extracted from the sensory projections,
in the case of a robot, using neural nets). This fact, according to him,
puts additional constraints on symbol combinations, over and
above syntactic constraints, and results in grounding. Symbols
are about real world objects whose sensory projections invariants
they correspond to (Harnad 1993a).
Before discussing Harnad's suggested solution to the symbol
grounding problem it may be appropriate to comment on his use of
the term "symbol". In most of his arguments this word is used in
the meaning of a computational token capable of being interpreted
as being _about something_, i.e. corresponding to a language
term. However he also talks about "...the manipulations of
physical 'symbol tokens' on the basis of syntactic rules that
operate only on the 'shapes' of the symbols (...), as in a
digital computer or its idealization, a Turing machine
manipulating, say, 0's and 1's " (Harnad 1993a). This indicates
that he considers a digital computer 0's and 1's also as
(perhaps primitive) symbols, which the higher level symbols,
capable of being interpreted as corresponding to language terms,
are built of. This is an important point, relevant to my
discussion below of Harnad's stress on the analog nature of
'sensory projections'.
The main idea of Harnad's model, the need for a system to have
the full sensorimotor capability of a human being in order for its
symbols to be grounded, expresses the fact that the terms of
a language we use are not defined by their relationships to
(correlations with) other language terms only - they are defined
by a cooperation, so to speak, of all sensory inputs at our disposal.
When we say, for instance, "cat" , understanding of this term involves
all experiences we have had with cats - through vision, touch, smell
etc. A single language dictionary (like Chinese-Chinese in Harnad's
example) can only relate language terms among themselves.
Relating them to real world objects requires full sensorimotor
capacity indistinguishable from our own (Harnad 1993). No
surprise that a TT-passing system, which demonstrably does not
have such a capacity (for instance Searle's Chinese Room, see
Searle 1980) can be suspect with respect to its understanding of
the language it seems so expertly to use. After all we do not
expect a person blind from birth to understand how colours
influence our interaction with the world, regardless of the
amount of verbal explanations.
How the sensorimotor inputs lead to grounding of top-level symbols
(i.e. language terms) is another story and below I criticize two
aspects of the Harnad's model.
The first aspect which I would like to discuss is his claim of
"nonarbitrary shapes" of sensory invariants extracted from sensory
projections from real world objects onto a system's transducer
surfaces. The word 'shape" above has to be interpreted in a somewhat
generic sense - in the case of senses other than vision and touch it
must mean some particular feature of the sensory invariants which is
somehow fixed by the nature of an object (phenomenon) it corresponds
to. This "nonarbitrariness of shape", in Harnad's eyes, imposes a
constraint on a system's symbols assigned to represent such an
invariant.
To what extent are the shapes of sensorimotor projections
'nonarbitrary'? I will consider below several examples indicating
that the shapes of the sensorimotor projections seem to be to a
large extend dependent on the physical nature of the transducers,
which are, in a sense, results of evolutionary 'accidents'
(naturally optimized within an accessible range of physical
parameters) and are then to a large degree arbitrary.

1. Colours.

Colour vision is dependent on six types of cells in the eye's
retina sensitive to light in various parts of the spectrum
(DeValois, Abramov and Jacobs 1966, DeValois and Jacobs 1968).
Two of these have to do with perception of blue and yellow, two
with perception of red and green and two are sensitive only to
intensity of light within the 'visible range'. Terms 'blue',
'yellow', 'green' and 'red' refer to various ranges of light
wavelength covering the visible portion of the spectrum. Now, it
is most likely an evolutionary 'accident' how the visible
spectrum is divided into these four regions. With a somewhat
different chemistry the ranges of sensitivity of the colour cells
might have been different, resulting in a different colour
perception. One can also conceivably imagine that, had the
evolution of the human eye gone somewhat differently, we might
have ended up with a colour vision mechanism distinguishing three
or five colours. Consequently, sensory projections of real
objects, coming from the colour vision system, would have
different "colour shapes", which are to a large extent determined
by the physical nature of the _transducers_ and not the objects
themselves.

2. Visual shapes.

Due to the nature of human eye optics, projections of real
objects on the eye's retina are already distorted - for instance
many straight lines in the outside world project on the retina as
curved lines. In addition, as is well known, these projections
are upside down. The fact that we see the real world objects
"right way up" is a result of the brain learning to _correlate_
shapes of sensory projections from the visual system with other
sensory projections. If we perform an experiment in which
subjects are made to wear glasses inverting the image falling on
the retina upside down (so that it now correspond to the "real"
situation), the subjects are at first very confused and have
difficulties moving around, grabbing objects etc. However, after
a certain time there seems to be a discontinuous transition to a
state in which the subjects report that they see everything
"normally" and have no more problems performing tasks requiring
vision. Obviously, their brains have learned to _correlate_ the
new "shapes" of the sensory projections from the vision system
with other sensorimotor projections.
A similar effect arises if we try to trace a pattern (say with
a stylus) looking not at the pattern itself, but at its
reflection in a mirror. Initially we are quite confused, but if
we persist at the task after a while it becomes as natural as
tracing the original pattern - the brain learns to compensate for
the reversal of left and right.
One could also speculate that if in the distant past
evolution chose a slightly different route we might have ended up
with eyes more like those of insects - sensory projections of our
visual system, coming from real world objects, would be very
different and there is no reason to doubt that our brain would
learn to deal with such a situation.
We see again that the shapes of the sensory projections are in
some sense arbitrary, determined by the physical nature of the
transducers.

3. Touch.

Let us perform a very simple experiment - we cross the index
finger with the middle finger of our right hand in such a way
that the tip of the middle finger is to the left of the tip of
the index finger (and vice versa). Now if we touch a small round
object with these two fingers simultaneously (i.e. the object
touches the left side of the tip of the index finger and the
right side of the tip of the middle finger) we have an impression
that we are touching two objects and not one.
We see that even such basic information about real objects as
whether we deal with a single object or with two separate objects
cannot be reliably extracted from a single sensory projection -
we need _correlations_ from other sensory projections to form a
picture which makes sense.

The above examples seem to put in doubt Harnad's claim that
"nonarbitrary shapes" of sensorimotor projections from real
objects onto transducer surfaces are a crucial element of symbol
grounding. The shapes of the sensorimotor projections are shown to be
arbitrary to a large extent and it is the _correlations_ among
these projections which appear to play a dominant role.
Harnad illustrates categorization process leading to the
grounding of the category 'names" by an imaginary example of
learning to distinguish between edible and poisonous mushrooms
(Harnad 1993). It is interesting to note that in his example the
grounding of the mushroom names ("mushrooms" for the edible ones
and "toadstools" for the poisonous ones) takes place on the basis
of _correlations_ between various sensory projections. _Shapes_
of the projection invariants do not to enter in any way.

The second aspect of Harnad's model is his claim that the
sensorimotor projections coming from the system's transducers, fed
subsequently to a neural net for the purpose of categorisation
(extraction of invariants), are analog. For instance he writes
(Harnad 1993):
"...it [Harnad's model] is 3-way (analog-connectionist-symbolic)
with the connectionist component just a place-holder for any
mechanism able to learn invariants in the analog sensorimotor
projections that allow the system to do categorisation"
and further down:
"...performance requirements of such a T3 [i.e. TTT] -scale robot
depend essentially on analog and other nonsymbolic forms of
internal structure and function."
However nowhere in his arguments does Harnad convincingly show
that this analog feature of the input (in the form of
sensorimotor projections) to neural nets which do the invariant
extraction is, in fact, essential. Any analog signal can be
approximated with arbitrary accuracy by a digital signal. Since
neural nets can have only finite sensitivity, whether they are
fed an analog signal or a correspondingly finely graded digitized
signal cannot matter for further processing. Once we accept this,
these digitized signals from the transducers (sensorimotor
projections) can be viewed as primitive symbols, in the same
spirit as 0's and 1's of a Turing machine. All further processing
can be considered as symbol manipulations which one way or another
lead to construction of high-level symbols representing language
terms (category names). This may very well happen with the use of
neural nets to extract invariants from sensory projections and perhaps
perform categorization. Since any neural net may be emulated using
a suitably programmed digital computer, all these steps can be achieved
without a need for analog devices.

The above analysis suggests that full robotic capacity
of a system might provide high-level symbols with a deeper structure
based in correlations among the primitive symbols, the sources of which
are inputs from sensorimotor transducers. Symbol grounding would then
be achieved by the presence of such an underlying structure, which
would give the symbols a much richer (and more intricate) set of
relationships than can be offered by a (single-language) dictionary.
These relationships mirror the experiences of interacting with the real
world, making the symbols effective in such an interactions and justifying
the claim that the symbols are grounded.
It is nevertheless worth pointing out that there does not seem to
be a reason why the underlying structure discussed above, once
established, could not be built (programmed) into a symbolic system,
without a need to give the system the full robotic capacity. Such
a system would be capable of passing the TT and should perhaps also be
considered to posses understanding of the language it uses.
There is one more aspect of the grounding problem, as
discussed above, which requires mentioning. There are situations
when we deal with concepts defined solely using language, without
a reference to sensorimotor projections from real world objects.
Such situations arise, for instance, in the case of mathematics.
If we consider abstract set theory or abstract group theory, we
define objects (sets, group elements) purely syntactically and
then proceed to draw all possible conclusions concerning the
consequences of these definitions. In spite of the fact that the
symbols we manipulate do not require grounding in sensorimotor
projections from real world objects, and the manipulations depend
only on shapes of these symbols (which are completely arbitrary),
we do talk about "understanding" mathematics (abstract set
theory, abstract group theory, etc.). It is clear that
understanding in this case means a knowledge of (or ability to
deduce) _correlations_ among symbols of increasing complexity,
arising from definitions of basic symbols from which these higher
level symbols are constructed.
In conclusion, it is argued above that even though two aspects
of Harnad's model for symbol grounding seem unjustified:

- shapes of sensorimotor projections from real objects onto
transducer surfaces do not appear to be relevant and hence cannot
play a role in restricting symbol combinations;
- importance of the analog nature of the sensorimotor
projections, fed subsequently to neural nets for invariant
feature extraction, is not apparent (there are reasons to think
that these projections might just as well be digitized leaving us
with pure symbol manipulations);

the main idea of the model - TTT capacity - may be crucial
for symbol grounding. It may be the combinations of various sensorimotor
experiences with real objects which lead to the formation of a deep
structure underlying the high-level symbols which provides
(epistemological) meaning of language terms. This structure underlying
the symbols may be somewhat akin to to the semantic structure of
language J. Katz is attempting to establish in "The Metaphysics of
Meaning" (Katz 1990), although he takes a definitely platonic view,
whereas the structure referred to here has a very specific sensorimotor
basis.

There also appears a possibility that if a symbolic system works
on the basis of digitized inputs, corresponding to sensorimotor
projections coming from transducers, as basic symbols, it might
posses understanding without TTT capability. Possibility of
ascribing understanding to a purely symbolic system seems in
accordance with using the term "understanding" in the case of
abstract mathematics, where (mathematical) terms used are
described verbally only, without recourse to the full sensorimotor
capacities of a human being.

.B
References
.R

DeValois, R.L., I. Abramov, and G.H. Jacobs, (1966) Analysis of
Response Patterns of LGN Cells. Journal of the Optical Society of
America 56:966-77.

DeValois, R.L., and G.H. Jacobs, (1968) Primate Color Vision.
Science 162:533-40.

Harnad, S. (1990) The Symbol Grounding Problem.
Physica D 42: 335-346.

Harnad, S. (1993) Symbol Grounding is an Empirical problem:
Neural Nets are Just a Candidate Component. Proceedings of the
Fifteenth Annual Meeting of the Cognitive Society. NJ: Erlbaum

Harnad, S. (1993a) Grounding Symbols in the Analog World with
Neural Nets. Think 2:12-78 (Special Issue on "Connectionism
versus Symbolism" D.M.W Powers & P.A. Flach, eds.).

Katz, J.J. (1990) The Metaphysics of Meaning, MIT Press, Cambridge,
Massachusetts.

Searle, J. R. (1980) Minds, brains and programs.
Behavioral and Brain Sciences 3: 417-424.