Viterbi Decoder
                           ---------------
                      Linas Vepstas March 2013

This directory contains code for a prototype Viterbi decoder for Link
Grammar. It barely functions and is currently very slow.


Motivation
==========
The goal of providing this decoder is to present a flexible, powerful
interface for implementing high-level semantic algorithms on top of
the low-level link-grammar syntactic parser, and, in particular,
for steering the parse based on high-level semantic knowledge.  This
allows the parser to move beyond being merely a syntactic parser,
and to become fully integrated with general semantic artificial
intelligence.

A less abstract list of expected benefits include:

 * Incremental parsing: the ability to obtain partial results after
   providing partial sentences, a word at a time.
 * Less sensitivity to sentence boundaries, allowing longer,
   run-on sentences to be parsed far more quickly.
 * Mitigation of the combinatorial explosion of parses.
 * Allow grammatically broken/incorrect chat dialog to be parsed;
   in general, to do better with slang, hip-speak.
 * Enable co-reference resolution and anaphora resolution across
   sentences (resolve pronouns, etc.)
 * Enable annotation of the parse graph with word-sense data, entity
   markers.
 * Allow richer state to be passed up to higher layers: specifically,
   alternate parses for fractions of a sentence, alternative reference
   resolutions.
 * Allow a plug-in architecture, so that plug-ins, employing higher-
   level semantic (AGI) algorithms can provide parse guidance and
   parse disambiguation.
 * Eliminate many of the hard-coded array sizes in the code.
 * Fix the word-count problem during spell-guessing. So, for
   example, if the mis-spelled word "dont" shows up in the input, it
   could be issued as one word ("done") or two ("do n't"). The
   current suffix-stripping/word-issuing algo cannot deal with this
   correctly. This is also an issue for the Russian dictionary, where
   the stem+suffix processing can generate variable word counts.


The data structures used to implement this resemble those of the
OpenCog AtomSpace.  All data classes inherit from a class called Atom
(which is an atomic predicate, in the sense of mathematical logic).
Atoms are typed; the two core types are Links and Nodes. Thus, all data
is represented in the form of a "term algebra" (aka the "Free Theory",
in the sense of model theory).  This structure allows all data to be
represented as (hyper-)graphs, which in turn makes the implementation
of graph algorithms easier to implement.  All these theoretical
considerations provide a natural setting for storing Viterbi state
information.  Put differently, this provide a generic, uniform way of
holding the various partly-finished parses, and effecting state
transformations on them.

Making the internal state directly visible allows low-level syntactic
algorithms, as well as high-level, semantic algorithms to control parsing.
In other words, the intended use of the Viterbi decoder is to provide
a framework for parsing that should make it possible to integrate
tightly (and cleanly) with high-level semantic analysis algorithms.
Thus, reference and anaphora resolution can be done using the same
graph structure as used for parsing; it should also allow graphical
transformations, such as those currently implemented in RelEx.

Since all of the data is represented dynamically (at run-time) by
these (hyper-)graphs composed of atoms, developing custom algorithms
to manipulate the parse becomes easy: there are no strange compile-time
structures to master.  All algorithms can access the data in a uniform,
common way.

One may argue that Viterbi is a more natural, biological way of working
with sequences.  Some experimental, psychological support for this can
be found here:
http://www.sciencedaily.com/releases/2012/09/120925143555.htm
per Morten Christiansen, Cornell professor of psychology.


Why "Viterbi"?
==============
The parser implemented here is called a "Viterbi decoder" because it is
inspired by (and vaguely resembles) the Viterbi algorithm famous from signal
processing.  A characteristic feature of that algorithm is that it maintains
a set of states in parallel. As each new bit is received, some of the states
become inherently inconsistent (e.g. because some checksum is violated),
while other new states become possible.  Once some certain number of bits
have been received, the ones that can be consistently interpreted with the
checksum constraints can be output. The process then repeats with each new bit
streaming in.  Likewise, the code here keeps a set of states in parallel;
each state is tied to some possible pending output that is consistent with
that state.  As each new word comes in, some of the states become invalid
and are discarded (because the word cannot link to that state). When all
states agree on some linkage, that linkage can be confidently output.
As words flow in, the process is repeated.  If there are ambiguities, then
multiple parse alternatives can be output (this is rather unlike the
original Viterbi algorithm, whose goal is to generate only one output,
the single most likely output).  For language parsing, however, it is
useful to convey parse alternatives to higher-order processes, which can
make a final determination (analogous to having a higher-order checksum
functioning within the Viterbi decoder).

I like this analogy because it is vaguely biological as well: or perhaps
I should say "neural net-ish".  The multiple, provisional states that
are kept around are sort of like the activation states of a feed-forward
artificial neural network.  But this is not very deep: the feed-forward
neural net looks like a Hidden Markov Model (HMM), and the Viterbi
algorithm is essentially an HMM algorithm.  No surprise!


Status
======
Currently, the parser can correctly parse many short sentences.  It
currently runs very slowly, as no pruning algorithms have yet been
implemented.  No probabilities or weights have yet been implemented,
groundwork for this is being laid now.


Using the parser
================
To use this parser, you must first configure with the --enable-viterbi
flag:

   configure --enable-viterbi

Then, in the client, switch to the viterbi parser with the ! command
!viterbi  (If confused, try !help and !var for general help).

The vitest.cc file contains unit tests. Currently, it consists of 95
tests, and all but the last of them pass.  More tests will be added.


Design
======
What follows is a writeup of the current design.  It is incomplete.
Parts of it may also be stale & out-of-date.  This part is meant only for
programmers wishing to update this code.  If you are just a user, you
can skip reading this section.


AtomSpace Design
----------------
The concept of Atoms, Nodes, Links used in the implementation here
generally resembles the concepts with the matching names, in OpenCog.
That is, Nodes and Links are two fundamental types of Atoms.  Nodes
have names, whereas Links contain lists of Atoms.  These two types are
used to form hypergraphs, where nodes are labelled, but links are not.
In C++, these are implemented as derived classes.  Each, in turn, has
further subtypes: for example, Word is a special type of Node, and Set
is a subtype of Link.

Each atom has associated with it a 'truth value'.  In OpenCog, truth
values may be rather complicated; in the current implementation, here,
each truth value is just a single floating point number.  It is meant to
be interpreted as the 'log likelihood', or the 'cost' or the 'entropy',
depending on the context.  The word 'cost' comes from link-grammar,
and is used to indicate preference: a parse with the lowest total cost
is the most likely parse.  Note that costs are additive.

Roughly speaking, probability P can be defined as P = exp(-C) where
C is the cost. Traditional OpenCog stores P in truth values; we store
C instead.  It just seems easier to debug using C instead of P.  One
can always convert the one into the other.

It is the intent of the Viterbi parser design to not only use cost
as a guide for selecting and prioritizing parses, but also to eventually
collect statistical data on real-world corpora, and automatically,
through machine learning, update the dictionary with costs and new
disjuncts.  Thus, the hypergraphs will behave somewhat like Bayesian
networks, or perhaps somewhat like Markov Logic Networks.  Detailed
interpretation of costs and how they are distributed and propagated
through the network, both during parsing, and during learning, have
not yet been figured out.  Its kind of complicated :-)

The traditional OpenCog AtomSpace design insists that there can only
ever be one single instance of any given atom at any time.  This is
not yet enforced in the code here, but might be 'real soon now'.  It
is certainly shaping design decisions.


Disjunct Design
---------------
The core underlying concept of link-grammar are 'disjuncts', which are
sets of connectors, each of which must be connected in order for a
disjunct to be satisfied. A given dictionary word may have many
disjuncts; however, only one may be satisfied at a time.  A typical
dictionary entry for a word consists of several connectors composed
with '&' and 'or' operators.  Although it is very tempting to think of
'&' and 'or' as being boolean operators, they are not. First of all,
connectors must be kept in sequential order: they do not commute
under '&' (However, 'or' does commute).  Next, one can distribute
'or' over '&' but not vice versa.  Thus, if A+, B+ and C+ are
connectors, then it is true that

   A+ & (B+ or C+) == A+ & (C+ or B+)
                   == (A+ & B+) or (A+ & C+)
                   == (A+ & C+) or (A+ & B+)

However, one cannot distribute the other way:

   A+ or (B+ & C+) != (A+ or B+) & (A+ or C+)

There are several ways to understand the above prohibition:
 1) Non-copying of connectors: connectors cannot be duplicated. The
    RHS of the above has two copies of A+, but the LHS has just one.
    (Non-copying is quantum-like. How weird is that?).

 2) One might say that 'or' does behave like boolean OR, but '&' does
    not.

 3) One might say that 'or' behaves like 'tensor plus' while '&'
    behaves like 'tensor times' in linear logic. Except that tensor
    times is commutative, and & is not...

 4) Alternately, 'or' behaves sort-of like 'exclusive-or', in that
    only one of the arguments to 'exclusive-or' is allowed to hold.
    An N>2 argument XOR is more correctly called a 'choose-one-of-N'
    function: a logic demultiplexer.

The simplest interpretation seems to be a 'many-worlds' interpretation:
That is, use 4) so that 'or' is 'choose-one-world-of-many', and enforce
1) non-copying of connectors within a single world.  BTW, note that any
connector prefixes with a '@' is explicitly copyable!

In the C++ code, the '&' operator will be called 'Tensand' (tensor-and),
while the 'or' operator will be called 'Plexor' (multiplexed-or).
(TODO: change the names.)

Link-grammar dictionary entries are written as expressions of connectors,
'&' and 'or'.  In order to actually use them in parsing a sentence, it
seems easiest to expand them into disjunctive normal form (DNF). This
makes it easy to determine that one and only one disjunct is being used,
and that all of the connectors in the disjunct are being satisfied.
The various disjoin() methods on the various classes are used to place
expressions into DNF.

To be clear: (A+ & B+) is a disjunct (all of the connectors are
disjoined).  By contrast, A+ & (B+ or C+) is NOT a disjunct; but it
can be disjoined into two of them.

Several complications arise when we want to consider truth values
(probabilities, or 'costs') associated with disjuncts.  First of all,
costs need to be correctly distributed when disjuncts are being formed.
Thus, for example, using [[]] to denote cost (just as in the
link-grammar dicts), we have:

   [[A+]] & (B+ or C+) == ([[A+]] & B+) or ([[A+]] & C+)
                       == [[A+ & B+]] or [[A+ & C+]]
                       == [[A+ & (B+ or C+)]]
                       == A+ & [[B+ or C+]]

while

   A+ & ([[B+]] or C+) == (A+ & [[B+]]) or (A+ & C+)
                       == [[A+ & B+]] or (A+ & C+)

In this last expression, the disjunct [[A+ & B+]] is costly, while
(A+ & C+) is not.

The next issue is that the same disjunct can have different costs when
used with different words. Thus, [[A+ & B+]] may be costly when used
with the verb 'to be' but not when used with the verb 'to go'.  Thus,
the cost annotation must be associated with the word-disjunct pair,
and not with the disjunct alone.  Putting all these different
observations together results in the following representation for an
entry in the dictionary (after it has been placed in DNF):

Dictionary entry as it appear in link-grammar dicts:

   jabberwoky: A+ & ([[B+]] or C+);

Dictionary entry, as represented in the atomspace, after being disjoined:

   Plexor
       WordCset    (cost 2.0)      ; Notice non-zero cost
          Word     "jabberwoky"
          Tensand                  ; Notice cost of zero for tensand
             Connector "A+"
             Connector "B+"

       WordCset                    ; Zero cost here
          Word     "jabberwoky"
          Tensand
             Connector "A+"
             Connector "C+"

Note that the act of disjoining has to have the Plexor 'pass through'
the word labels (but not the Tensand).  So, for example, just after
the dictionary entry is read from the file, but before before it is
placed into the atomspace, it is represented as:

   WordCset
      Word     "jabberwoky"
      Tensand
         Connector "A+"
         Plexor
             Connector "B+"  (cost 2.0)
             Connector "C+"

which is certainly more compact than above, except that:

1) Its very hard to develop an algo that will correctly couple
   connectors during a parse, when the entry is in this form.
   Thus, we want DNF just to make the algos doable.

2) The cost on B+ means that we have to be careful that this connector
   is not in the atomspace.  This is because the atomspace rules allow
   only one, singular, unique atom called "Connector B+", and thus,
   changing its cost would affect all expressions that contain it!

3) Updating costs on Tensands (disjuncts) also raises similar issues:
   the atomspace allows one and only one disjunct of the form
   (Tensand (Connector "A+") (Connector "B+")) in the atomspace, so
   setting its cost would change the perceived cost for any other
   expression that might contain this sub-expression.



Historical dates
================
Idea conceived: April 2008
Coding started: October 2012
First successful long sentence parse: March 2013