Talk:List of self-segregating morphology methods

Jim Henry asked me to add these methods.
--

The following three techniques are a little different from Jim's technique #2, which is their closest equivalent already on the list;


 * 9. You could divide up the phonological segments into the following classes;
 * a. Segments that can be the first segment of a morpheme, but can't be any non-first segment.
 * b. Segments that can't be the first segment of a morpheme, but can be any non-first segment.

Then the morphemes will look like a, ab, abb, abbb, abbbb, ... etc. Morpheme boundaries would occur just previous to each a.


 * 10. You could divide up the phonological segments into the following classes;
 * c. Segments that can be the last segment of a morpheme, but can't be any non-last segment.
 * d. Segments that can't be the last segment of a morpheme, but can be any non-last segment.

Then the morphemes will look like c, dc, ddc, dddc, ddddc, ... etc. Morpheme boundaries would occur just after each c.


 * 11. If you require every morpheme to contain at least two segments, you could divide up the phonological segments into the following classes;
 * e. Segments that can be the first or last segment of a morpheme, but can't be any non-first not-last segment.
 * f. Segments that can't be the first nor last segment of a morpheme, but can be any non-first non-last segment.

Then the morphemes will look like ee, efe, effe, efffe, effffe, ... etc. (Without the two-segment-minimum, ee might be "e, e" or might be "ee". Morpheme boundaries would occur just after each fe and just before each ef, but a string of "ee" morphemes would have to be parsed globally; you couldn't tell how to parse it unless you had the whole thing.

--

The following technique is also different from anything already on the list;


 * 12. Require the last segment of each morpheme to code the length of the morpheme.

--

Self-Segregating Syntax
About: ("auto-" or "self-")-("-isolating" or "-segregating) syntax.

i.e. "unambiguous parseability"; or, "uniquely defined parse-trees uniquely recoverable from the utterance-string".

--

Note: perhaps this belongs in a slightly-different topic. "Self-Segregating Morphology" is about making morpheme-boundaries, syllable-boundaries, and/or word-boundaries, uniquely recoverable from the utterance string.

I am talking here about making phrase-boundaries, clause-boundaries, sentence-boundaries, and syntagm-boundaries uniquely recoverable from the utterance string. In this case, I assume word-boundaries are already established somehow; I don't discuss how.

Many of the methods are parallel, of course.

--

I am going to mention two techniques similar to, and inspired by, what has been written to me by Jonathan Knibb and by And Rosta.

I do not have a proof that I personally comprehend yet to tell me that either one of these techniques is actually bound to work. But I think they're at least worth looking into.

--

Jonathon Knibb has a conlang, "T4", in which every lexical item (every "word") is monomorphemic (so there is no difference between a "morpheme" and a "word"), and in which every syntagm is a "phrase".

T4 satisfies the following recursively-applicable two-rule conditions;
 * 1. A phrase may consist of a single word, or;
 * 2. a phrase may consist of two phrases, separated by an "operator".

By arranging that one can always locate the operator(s) and the phrase-boundaries and the word-boundaries, Jonathan makes T4 unambiguously parseable; there is only one way to assign a tree structure to any given T4 utterance, and it has the property that every non-leaf node has exactly two children and is labeled by an "operator".

--

And. Rosta has a conlang, I think it's called Livagian -- forgive me if I've mis-spelled it, And. -- which marks operators and operands with inflections or desinences about each other so that each syntagm can be uniquely recoverable from the utterance string.

The following discussion is about a generalization of some various techniques, some of which And has used in various incarnations of Livagian up to date.

A syntagm may consist of any two or more constituents; exactly one constituent of a syntagm is its operator, all of the others are its operands.

Each operand constituent is either a word or a syntagm. The operator constituent must be a word, not a more complex syntagm.

(NOTE: The requirement that the operator be a single word, not periphrastic, may not be necessary.)

A word which is not an operator is marked with an inflection indicating that fact. (Maybe that inflection is spelled-out phonologically as a zero.) In effect words so marked are the "leaf-nodes".

If a syntagm is not a constituent of a larger syntagm, its operator-constituent is marked with an inflection indicating that fact. (Maybe that inflection is spelled-out phonologically as a zero.) In effect syntagmas so marked are the "root-nodes".

If a word is an operand, or is the operator of a syntagm which is an operand, it is inflected to indicate the following;
 * 1. It is the first operand
 * OR ELSE it is not the first operand of its operator.
 * 2. It is the last operand
 * OR ELSE it is not the last operand of its operator.
 * 3. It immediately precedes its operator
 * OR it precedes its operator, but not immediately
 * OR it immediately follows its operator
 * OR it follows its operator, but not immediately.

(Counting the "its not an operand and doesn't have an operator" possibility, there are 13 combinations possible; because the first operand can't follow the operator except immediately, and the last operand can't precede the operator except immediately.)

(So the possible markings are: )
 * 1. I'm a root node;
 * 2. I'm my operator's only operand and I immediately precede it;
 * 3. I'm my operator's only operand and I immediately follow it;
 * 4. I'm my operator's first operand, but not its last, and I precede it, and am separated from it;
 * 5. I'm my operator's first operand, but not its last, and I immediately precede it;
 * 6. I'm my operator's first operand, but not its last, and I immediately follow it;
 * 7. I'm my operator's last operand, but not its first, and I immediately precede it;
 * 8. I'm my operator's last operand, but not its first, and I immediately follow it;
 * 9. I'm my operator's last operand, but not its first, and I follow it, but am separated from it;
 * 10. I am neither the first nor the last operand of my operator, and I precede it but am separated from it;
 * 11. I am neither the first nor the last operand of my operator, and I immediately precede it;
 * 12. I am neither the first nor the last operand of my operator, and I immediately follow it;
 * 13. I am neither the first nor the last operand of my operator, and I follow it but am separated from it.

If a word is the operator of a syntagm, it is inflected to indicate the following;
 * 1. how many operands it has
 * 2. whether its first operand;
 * 2a precedes it, but not immediately so
 * 2b immediately precedes it
 * 2c immediately follows it.
 * 3. whether its last operand;
 * 3a immediately precedes it
 * 3b immediately follows it
 * 3c follows it, but not immediately so.

Obviously the number of different combinations available depends on the answer to
 * 1. how many operands it has.

Among the possibilities;
 * 1. I have no operands because I am a leaf-node.
 * 2. I have exactly one operand, which immediately precedes me.
 * 3. I have exactly one operand, which immediately follows me.
 * 4. I have exactly two operands; they both precede me, the last one immediately so.
 * 5. I have exactly two operands; the first immediately precedes me, the last immediately follows me.
 * 6. I have exactly two operands; they both follow me, the first one immediately so.
 * 7. I have exactly three operands; all of them precede me, the last one immediately so.
 * 8. I have exactly three operands; the first one precedes me but is separated from me, while the last one immediately follows me.
 * 9. I have exactly three operands; the last one follows me but is separated from me, while the first one immediately precedes me.
 * 10. I have exactly three operands; they all follow me, the first one immediately so.
 * 11.n. I have exactly n operands (n>3); they all precede me, the last one immediately so.
 * 12.n. I have exactly n operands (n>3); all but the last one precede me; the first is separated from me, while the last immediately follows me.
 * 13.n. I have exactly n operands (n>3); more than one precedes me and more than one follows me; the first one precedes me but is separated from me, and the last one follows me but is separated from me.
 * 14.n. I have exactly n operands (n>3); all but the first one follow me; the first one immediately precedes me, and the last one follows me but is separated from me.
 * 15.n. I have exactly n operands (n>3); all of them follow me; the first one immediately follows me.)

(BTW In natlangs apparently the 13.n. option is not often taken.)

--

I don't know for sure that the above would work. But if it would, it would work for trees of arbitrary depth, unlike some other proposals which run out of steam after a certain rather shallow depth; because a word can get inflected both as an operator and as an operand.

--

= How well, and under what circumstances, will this work? =

I have done some investigating about how well such a scheme works, in terms of making utterances unambiguously parseable.

1. If each operator comes just before its first operand, and each operand comes somewhere after its operator, this scheme will work.

2. If each operator comes just after its last operand, and each operand comes somewhere before its operator, this scheme will work.

Under each of these conditions, there is no difficulty deciding which of two operators to apply first; the one closest to the operand(s) has to be applied first, then the next closest, etc. With "prefix" operators (condition 1), the operator "closest" to the operand(s) will be the latest-uttered one, that is, "the operator furthest to the 'right' must be applied first". With "postfix" operators (condition 2), the operator "closest" to the operand(s) will be the first-uttered one, that is, "the operator furthest to the 'left' must be applied first". So, under conditions 1 or 2, the big question that must be answered to make parsing unambiguous, is, "How many operands does each operator take?". Since this is marked by an inflection on the operator-word, and since, also, the last operand is marked with the "I'm the last operand!" inflection, the type of scheme discussed here _will_ result in unambiguous parseability provided either all operators are in "pre-order" ("Polish Notation"), or else, all operators are in "post-order" ("Reverse Polish Notation").

Now, suppose for purposes of simplicity that all of the operators are _unary_. One thing this implies is that every operator is either prefix-ordered operator (coming just before its first -- and only -- operand) or a postfix-ordered operator (coming just after its last -- and only -- operand).

There is no difficulty deciding which of two "prefix" operators to apply first; the one "on the right" is closer to the operand, so it gets applied first.

There is also no difficulty deciding which of two "postfix" operators to applly first; the one "on the left" is closer to the operand, so it gets applied first.

But when one operator is a prefix operator and the other is a postfix operator, some sort of precedence rule must be used.

3. If all operators are either prefix operators or postfix operators, and all prefix operators are applied before any postfix operators, then the scheme will work, provided that either 3a all prefix operators are unary or 3b all postfix operators are unary.

4. If all operators are either prefix operators or postfix operators, and all postfix operators are applied before any prefix operators, then the scheme will work, provided that either 4a all prefix operators are unary or 4b all postfix operators are unary.

I will now provide an example of why the scheme leaves an ambiguity (possibly resolvable by other means -- still to be determined).

Suppose A and B and C are all unary operators. Suppose A has precedence over both of B and C; and suppose both A and B have precedence over C.

If A and C are prefix operators, and B is a postfix operator, this means that

"A x B" will _always_ be interpreted as

"(A x) B" and

"C x B" will _always_ be interpreted as

"C (x B)"

But should

"A C x B" be interpreted as

"(A (C x)) B" (which violates the "B before C" rule) or as

"A (C (x B))" (which violates the "A before B" rule)?

Obviously a similar problem could come up if B were the prefix operator and A and C were the postfix operators. How then should one interpret

"B x C A"; as

"((B x) C) A" (which violates the "A before B" rule) or as

"B ((x C) A)" (which violates the "B before C" rule)?

--

As far as I know yet, it may still be possible that the scheme will provide unambiguous parseability when all operators are either prefix or postfix and either all prefix operators take precedence over all postfix operators or else all postfix operators take precedence over all prefix operators, even when there are both binary prefix operators and binary postfix operators. But I haven't looked into it yet.

--

If some of the operators are infix operators -- that is, occuring somewhere after (perhaps immediately after) their first operand, yet still somewhere before (perhaps immediately before) their last operand -- then the problem of "associativity" comes up.

Suppose every operator is either unary or binary; and suppose every operator occurs just before its last operand.

Now suppose A is a binary operator among such operators.

Do we interpret

"x A y A z" as

"x A (y A z)" or as

"(x A y) A z"?

--

So far, it's looking a little like the scheme will work only if some additional rules are applied. The simplest sets of such rules -- or at least the easiest ones to think of first -- would appear to be the following:

1. Every operator should be either a prefix operator or a postfix operator. There shouldn't be any infix operators.

2. Either every prefix operator takes precedence over every postfix operator, or every postfix operator takes precedence over every prefix operator.

3. Every n-ary operator for n>=2 must be of the same positional type; that is, either all such operators are prefix operators, or all such operators are postfix operators.

--

= Grammatical category, word-class, and distribution. =

The "functional type" or "grammatical category" or "distributional class" of a syntagm or phrase can be determined in basically two different ways; "Opaque" type-assignment or "Transparent" type-assignment.

In "Opaque" type-assignment, the lexical entry for the operator-word declares that any syntagm of which that word is the "operator" -- the syndetic or ligating word of that syntagm or phrase -- will be of a certain type, regardless of the type(s) of its operand(s).

In "Transparent" type-assignment, the operator's operand(s) may be of any type, (so long as they are all of the same type with each other, in case there are more than one operand); and the syntagm or phrase will then have the same type as the type of (all of) the operand(s).

--

"Transparent" type-assignment is similar to what people call "endocentric" or "the head-word".

--

BTW Jonathon's scheme, and And's scheme also if all operators are binary, are among the grammars which are in "Chomsky Normal Form".

--

And's actual use of the scheme I "imputed" to him, does not necessarily involve absolutely every operator getting a non-zero phonological spell-out. In other words, sometimes it's there even if you don't have to say it out loud. Also, in some (at least one?) of his uses, either all of the operands precede all of the operators, or vice-versa; also, in some (at least one?) of his uses, all of the operators are binary.

About word-classes.

Jonathan Knibb's T4 does not have traditional word-classes; i.e. doesn't have nouns and verbs and pronouns and adjectives and adverbs and adpositions and conjunctions. Instead T4 has just "words" and "operators"; and the "operators" are an extremely small closed class.

And Rosta's Livagian's operators also do not belong to any of the traditional word-classes (adjective, adposition, adverb, conjunction, noun, pronoun, verb). Like T4's operators they form a closed class, but in the Livagian case this class is _not_ all that dang small.

--

A "transparent type-assignment" operator taking two or more operands could be considered to correspond with a "conjunction". It conjoins two or more words and/or phrases (syntagms) of a certain type into a new syntagm also of that same type.

The negative particle could be considered a one-operand operator with "transparent" type-assignment. Whatever kind of word or phrase "X" is, "not X" is a phrase of the same type, just with "the opposite" meaning.

Certain other words could be considered one-operand operators with transparent type-assignment, only with restrictions on what kind of operands they can take. For instance;
 * an adverb of degree might be considered an operator which takes an adjective or phrasal adjective and produces a syntagm which again acts like an adjective.
 * a sentential adverb might be considered as an operator which takes a verb or periphrastic verb and produces a phrasal verb.
 * an adjective might be considered an operator which takes a noun or nominal phrase and produces an expanded nominal phrase.

As for the word-class types of syntagms whose operators have "opaque" type-assignment: From some points of view, some linguists seem to think these will mostly turn out to be either noun-like syntagms or adverb-like syntagms. From other points of view, some linguists seem to think these will mostly turn out to be either adjetive-like syntagms or adverb-like syntagms. In a natlang, an opaquely-assigning operator taking three or more operands will almost surely correspond to a verb. In a natlang, almost all opaquely-assigning operators taking two operands also correspond to verbs; the remaining few almost surely correspond to adpositions. An adpositional phrase generally is used either like an adjective or like an adverb.

--

There are many languages which have many verbs of which most forms are single words, but, if the verb is inflected into certain combinations of tense, aspect, mood, and voice, the result is necessarily analytic or periphrastic -- in other words, a phrasal verb is produced instead of a "monolexemic" verb. The extra words common to some of these forms might be considered operators which produce syntagms which are, once again, operators. Thus, the restriction that operators have to be single words, not syntagms, is not for the purposes of making the conlang resemble natlangs better; instead, it is for making it easier to see that the parsing is unambiguous. For all I know, allowing operators which are syntagms, is _not_ incompatible with "unambiguous parseability"; but, if it can be done, I'll bet it's harder to see that it has been done correctly.