HOWTO-grm

id est

How to write a PolyGen grammar


Index

1.0 What is a grammar?

     1.0.1 Subproductions
     1.0.2 Optional subproductions
     1.0.3 Comments


2.0 Advanced features
     
     2.0.1 Concatenation
     2.0.2 Epsilon
     2.0.3 Controlling the probability of a production
     2.0.4 Unfolding

               2.0.4.1 Non-terminal symbols
            2.0.4.2 Subproductions
            2.0.4.3 Optional subproductions
               2.0.4.4 Permutable subproductions
            2.0.4.5 Deeply unfolded subproductions

     2.0.5
Attributes
     
            2.0.5.1
Labels and selection
             2.0.5.2 Multiple selection

      2.0.6 Capitalization
      2.0.7 Permutation
     2.0.8 Deep unfolding
     2.0.9 Folding


2.1 Advanced techniques

     2.1.1 Recursion
     2.1.2 Grouping
     2.1.3 Controlling the probabilty of an optional subproduction
     2.1.4 Selection reset


3.0 Static checking of grammars

     3.0.1 Errors

            3.0.1.1 Undefined non-terminal simbols
            3.0.1.2 Cyclic recursion and non-termination
            3.0.1.3 Recursive unfolding
            3.0.1.4 Epsilon-productions
            3.0.1.5 Overriding of non-terminal symbols
            3.0.1.6 Illegal character
            3.0.1.7 Unexpected token

     3.0.2 Warnings

               3.0.2.0 Level 0
           
               3.0.2.1 Level 1

                     3.0.2.1.1 Undefined symbol I
                     3.0.2.1.2 Potential epsilon-productions

               3.0.2.2 Level 2
           
                     3.0.2.2.1
 Useless permutation
                     3.0.2.2.2 Potential cyclic recursion


4.0 Appendix

     4.1.1 Concrete syntax
     4.1.2 Abstract syntax
     4.1.3 Lexical rules
     4.1.4 Escape sequences
      4.1.5 Translation rules




1.0 What is a grammar?

A grammar is an ASCII text file providing the definition of the syntactical structure and terms used by the program to build sentences. PolyGen is able to interpret a language designed for defining type-2 grammars (according to Chomsky classification) consisting in an extension of the BNF (Backus Naur Form) - a very simple and common form for the description of the syntax of a language.

A definition consists in specifying for a given symbol a set of productions interleaved by a pipe | and followed by a semicolon ; , which acts as terminator:

EXAMPLE

S ::= the apple | an orange ;


PRODUCES

the apple
an orange



Such definition of symbol S (said non-terminal) allows the generation of symbols the apple as well as an orange (said terminal).
The probability for the output the apple to be generated is equal to 1 every 3 times; and the same for an orange: thus, when 2 productions occur, we have 1 in 2 chances each; when 5 occur, we have 1 in 5, etc.

You're allowed to define several non-terminal symbols and reference them from any productions, in order to let more complex sentences be generated:

EXAMPLE

S ::= the Animal is eating a Animal ;

Animale ::= cat | dog ;



PRODUCES

the cat is eating a cat
the cat is eating a dog
the dog is eating a cat
the dog is eating a dog


ecc.


Note: PolyGen uses by default the non-terminal symbol S as the starting one: every grammar must therefore define it at least unless another starting symbol has been specified as argument to the program.

By default, a term beginning with a capital letter is considered as non-terminal (thus bound to a definition) and a term beginning with a non-capital letter as terminal (a simple word). If you need then to specify a capital word you must quote it in order to get the program not to mistake it for a non-terminal symbol:

EXAMPLE

S ::= a Pet called "Pet" ;

Pet ::= cat | pig | dog ;



PRODUCES

a cat called Pet
a pig called Pet
a dog called Pet



Keep in mind that many characters (punctuation marks, parentheses, etc), including those interpret as keywords by the program, must be quoted to be output (see section 4.1.3 for complete lexical rules).

EXAMPLE

S ::= "(" (apple | orange) ")" ;


PRODUCES

( apple )
( orange )




1.0.1 Subproductions

After the keyword ::= in a definition, a subproduction of any form can be specified between round brackets:

EXAMPLE

S ::= an (apple | orange) is on the (table | desk) ;


PRODUCES

an apple is on the table
an orange is on the table
an apple is on the desk
an orange is on the desk


Subproductions are generated as standalone blocks, that is as they were bound to a non-terminal symbol.

1.0.2 Optional subproductions

A subproduction specified between square brakets is considered as optional and is generated 1 every 2 times, i.e. 50% probability:

EXAMPLE

S ::= an (apple | orange) is on the (table | desk) [in the (living | dining) room] ;


PRODUCES

an apple is on the table
an apple is on the table in the living room
an apple is on the table in the dining room
an orange is on the table
an orange is on the table in the living room


ecc.


Optional subproductions, apart from being generated once every two times, behave as normal subproductions.


1.0.3 Comments

You can write any kind of text within a pair of (* and *) keywords. Such text will be completely ignored by PolyGen.

EXAMPLE

S ::= apple | rainge (* | banana *) | mango ;
(* this is comment too *)


PRODUCES

apple
orange
mango




2.0 Advanced features

PolyGen provides a set of keywords that raises the language expressivity beyond BNF.


2.0.1 Concatenation

The cap ^ can be either prefixed or suffixed to as well as infixed in any point within a production in order to make the program not insert a white space character in the output string:

EXAMPLE

S ::= "(" ^ (apple | orange) ^ ")" ;


PRODUCES

(apple)
(orange)


Concatenation, as a feature, is particularly useful every time you wish to generate words by assembling syllabes or letters coming from different productions:

EXAMPLE

S ::= "I" Verb ^ e Verb ^ ing ;

Verb ::= lov | hat ;


PRODUCES

I love hating
I love loving
I hate hating
I hate loving



Keep in mind that specifying many caps is perfectly equal to specifying one only.



2.0.2 Epsilon

The underscore keyword _ stands for the empty production, formally called epsilon.

EXAMPLE

S ::= ball | _ ;


PRODUCES

ball
_


Notice that an epsilon-production is neither the underscore character itself nor the white space, rather it stands for no output at all - the empty string, if you prefer. The former example is perfectly equivalent to the following:

EXAMPLE

S ::= [palla] ;


PRODUCES

palla
_


That is a grammar generating either a or nothing as output.



2.0.3 Controlling the probability of a production

The plus keyword + , when prefixed to a (sub)production (however nested), raises the probability for it to be generated, in respect to the others of that very series; simmetrically, the minus keyword - lowers it down. Any number of + and - keywords may be specified:

EXAMPLE

S ::= the cat is eating (+ an apple |- an orange | some meat |-- a lemon) ;


PRODUCES

the cat is eating an apple
the cat is eating an orange
the cat is eating some meat
the cat is eating a lemon




The set of sentences that can be produced is as expected; indeed, the definition for the non-terminal symbol S is internally interpretet as follows:

S ::= the cat is eating ( an apple | an apple | an apple | an apple
                        | an orange | an orange
                        | some meat | some meat | some meat
                        | a lemon) ;


as requested, proportion of probabilty raising and lowering holds: the probability for an apple to be generated is higher than an orange, which is higher that some meat, on its turn higher than a lemon.




2.0.4 Unfolding

PolyGen provides a powerful unfolding system which, in general, allows to take a series of productions (otherwise folded either by a subproduction or a non-terminal symbol) to the level of the current sequence.
Roughly, you may look at this operation as at the flattening of a portion of grammar that is performed before the generation and that may thus affect it just as far as the probability are concerned, since the transformation does not alter the source grammar semantics - as the traslation rules in section 4.1.5 confirm.

Not every atom though supports unfolding, rather only those which such operation makes sense for: refer to section 4.1.1 for a syntactical formalization of such subset.

2.0.4.1 Non-terminal symbols

Consider the following scenario:

EXAMPLE

S ::= ugly cat | nice Animal ;

Animal ::= dog | bull | pig ;


PRODUCES

ugly cat
nice dog
nice bull
nice pig


Produced output deserves no surprises, still the chance for ugly cat to be generated is 1 every 2 times, but it is not the same for nice dog, nice bull and nice pig, even though a user may find it reasonable for all them to be generated with the same probability.
The problem is due to productions ugly cat and nice Animal equally sharing the unit of prabability of S: thus the chances for ugly cat to be generated is equal to the chances for nice Animal, i.e. one among nice dog, nice bull and nice pig. In the example above the probability distribution appears as follows:

ugly cat
1/2
nice dog
1/2 * 1/3 = 1/6
nice bull
1/2 * 1/3 = 1/6
nice pig
1/2 * 1/3 = 1/6


As a proof, 1/2 + 1/6 + 1/6 + 1/6 = 1.

In order to equally redistribute prababilities of subproductions, the user should write S this way:

S ::= ugly cat | nice dog | nice bull | nice pig ;

though loosing the original architecture, which used to fold all animals within a non-terminal symbol, and moreover dramatically increasing the amount of editing duties.
In order to solve this problem, itself an instance of the dishomogeneity of probability distribution problem in case of subproductions, the language offers an operator for unfolding non-terminal symbols:

EXAMPLE

S ::= ugly cat | nice >Animal ;

Animal ::= dog | bull | pig ;



Prefixing the > keyword to a non-terminal symbol, during the preprocessing phase the program performs the translation above, changing the probability distribution as follows:

ugly cat
1/4
nice dog
1/4
nice bull
1/4
nice pig
1/4



2.0.4.2 Subproductions

It is not uncommon to use subproductions in order to decrease grammar verbosity, e.g. by collecting a set of phrasal verbs according to the supported preposition.

EXAMPLE

S ::= (walk | pass) through
   
|  look at
   
|  (go | come | move | link | run) to ;


Grammar architecture and scalability increase, on one side, while, on the other side, output quality lowers down, since 1 every 3 times look at will be generated for the same reason discussed in section 2.0.4.1. In order to take output etherogeneity to the desired level, that is where each single verb may be produced with the same probability, the user should avoid round bracket usage at all, so that there would be no more 3 macro-productions, and should suffix the proper preposition to each verb.
For this very purpose, any subproduction may be unfolded analogously to what stated in section 2.0.4.1 regarding non-terminal symbols. The operator > makes the program delegate to the preprocessor the unfolding of the following subproduction, allowing the user to keep the original source architecture unchanged.

EXAMPLE

S ::= >(walk | pass) through
   
|  look at
   
|  >(go | come | move | link | run) to ;


is translated into:

S ::= walk through | pass through
   
|  look at
   
|  go to | come to | move to | link to | run to ;


that is what one would expect: a flat series of productions.

A more complex example could be:

Digit ::= z: 0 | nz: >(1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) ;

is translated into:

Digit ::= z: 0 | nz: 1 | nz: 2 | nz: 3 | nz: 4 | nz: 5 | nz: 6
       
nz: 7 | nz: 8 | nz: 9 ;


2.0.4.3 Optional subproductions

A subproduction in square brackets (see section 1.0.2) is equal to a subproduction in round brackets that produces either the original content or epsilon (see the example in section 2.1.3).
Thus, unfolding an optional subproduction is perfectly legal and the result is analogous to what said in section 2.0.4.2.


2.0.4.4 Permutable subproductions

As the translation rules in section 4.1.5 reveal, the unfolding is performed by the preprocessor after all the permutations (see section 2.0.7) occured: a permutable subproduction bound to a > operator is therefore first permutated, while the unfolding holds and is then performed at the new position within the sequence.

EXAMPLE

S ::= >{the >(dog | cat)} and {a (fish | bull)} ;


Pay attention to how the two unfoldings, the former inside the curly brackets and the latter outside, behave: the translation is as follows:

S ::= the dog and a (fish | bull)
   |  the cat and a (fish | bull)
   |  (fish | bull) and the dog
   
|  (fish | bull) and the cat ;


2.0.4.5 Deeply unfolded subproductions

As stated in section 2.0.8, deep unfolding leads to a subproduction where everything has been flattened within.
Though, one may sometimes wish to perform a further unfolding: of that very subproduction.

EXAMPLE

S ::=  > >> the (dog | cat) | a (fish | bull) << | an alligator ;


that is translated into:

S ::= the dog | the cat | a fish | a bull | an alligator ;




2.0.5 Attributes

2.0.5.1 Labels and selection

Any (sub)production, however nested, can be bound to a label and it is eventually possible to limit generation to a subset of a series of productions by using the dot operator.

EXAMPLE

S ::= Verb.inf | Verb.ing ;

Verb ::= (inf: to) (eat | drink | jump) (ing: ^ing) ;


PRODUCES

to eat
to drink
to jump
eating
drinking
jumping


Selection simply deletes all (sub)productions bound to a label other than the one selected. More precisely, a selection propagates the label specified on the right hand of the dot operator for the whole generation of what stands on the left of it; during the generation will be considered as valid both those (sub)productions that are bound to no label and those bound to a label that has been selected.
Notice therefore that you're allowed to use selection many times in order to enrich the list of selected labels: such a technique may be useful for propagating some attributes that are wanted to affect generation.

EXAMPLE

S ::= (Conjug.S | Conjug.P).sp | (Conjug.S | Conjug.P).pp ;

Conjug ::= (Pronoun Verb).1 | 
(Pronoun Verb).2 | (Pronoun Verb).3 ;

Pronoun ::= S: (1: "I" | 2: you | 3: (he | she | it))
         |  P: 
(1: we | 2: you | 3: they) ;

Verb ::= (pp: Be) (eat | drink) (sp:
(S: (3: ^s)) | pp: ^ing) ;

Be ::= S: (1: am | 2: are | 3: is) | P: are ;



PRODUCES

I eat
you eat
he eats
she eats
it eats
we eat
they eat
I am eating
you are eating
he is eating
we are eating


etc.


In the example above, assuming labels 1,2,3,S and P respectively identify syntactical forms for the first, second and third persons, singular and plural, we managed to correctly conjugate both simple present and present progressive tenses according to a pronoun.
Notice that whenever you decide to extend the set of verbs, you'll just need to add more radixes in the proper subproduction of Verb.


2.0.5.2 Multiple selection

Reconsider the example in seciton 2.0.5.1: the definition of the non-terminal S simply activates all combos of label pairs S,P and sp,pp for the production of Conjug. In order to avoid such frequent uncomfortable solutions you're allowed to specifiy, on the right of the dot operator, a set of labels in round brackets interleaved by the pipe keyword.
Modifying the example above:

EXAMPLE

S ::= Conjug.(S|P).(sp|pp) ;


Analogously to what stated in section 2.0.3 for grammar productions, it is possible to specify probablity modifiers for labels too, by means of the + and - keywords.

EXAMPLE

S ::= Ogg.(+S|--P).(sp|-pp) ;


That is internally treated as:

S ::= (Conjug.S | Conjug.S | Conjug.S | Conjug.S | Conjug.P).sp
   |  (Conjug.S | Conjug.S | Conjug.S | Conjug.S | Conjug.P).sp
   |  (Conjug.S | Conjug.S | Conjug.S | Conjug.S | Conjug.P).pp ;



2.0.6 Capitalization

It is often needed, mainly for style purposes, to respect capitalization rules, for instance after a point mark.
Nevertheless, a complex grammar architecture, providing recursive productions generating subclauses, may make such operation impossible, unless the user rewrites part of the source.
In order to solve this problem, the language provides the backslash keyword \, which makes the program perform the capitalization of the very following terminal symbol, i.e. switching its first letter to uppercase.

EXAMPLE

S ::= \ smith (is | "." \) Eulogy ^ "." ;

Eulogy
::= rather a smart man
        
|  really a gentleman ;

PRODUCE

Smith is rather a smart man.
Smith. Rather a smart man.
Smith is really a gentleman.
Smith. Really a gentleman.


Keep in mind that capitalization is active until the following terminal symbol resulting from the generation, therefore every other atom (epsilon, concatenation or the capitalization operator itself) found in the middle simply acts as usual.

EXAMPLE

S ::= a \ ^ \ _ b

PRODUCES


aB



2.0.7 Permutation

Many spoken languages allow to change the order of some words (or groups of words) in a sentence holding the original meaning; analogously, at a macroscopic level, it sometimes makes sense to exchange the sentences of a phrase.
In order to avoid the user to rewrite the same sequence over and over exchanging atom positions, specifying some subprodutions within curly brackets { and }, the program automatically performs all the permutations among them.

EXAMPLE

S ::= whether {is} {therefore} {he} ;

PRODUCES


whether is therefore he
whether is he therefore
whether therefore is he
whether therefore he is
whether he therefore is
whether he is therefore


Keep in mind that the permutability of a subproduction only holds within the sequence containing it: no permutation occurs if permutable subproductions are specified in different subsequences (or subprodutions - permutable or not). See the difference in the two example that follow:

EXAMPLE

S ::= {in 10 minutes}^, {at 3 o'clock}^, {"I" {will depart} {alone}} ;

PRODUCES


in 10 minutes, at 3 o'clock, I will depart alone
at 3 o'clock, in 10 minutes, I will depart alone
in 10 minutes, I will depart alone, at 3 o'clock
at 3 o'clock, I will depart alone, in 10 minutes
I will depart alone, in 10 minutes, at 3 o'clock
I will depart alone, at 3 o'clock, in 10 minutes
in 10 minutes, at 3 o'clock, I alone will depart
at 3 o'clock, in 10 minutes, I alone will depart
in 10 minutes, I alone will depart, at 3 o'clock
at 3 o'clock, I alone will depart, in 10 minutes
I alone will depart, in 10 minutes, at 3 o'clock
I alone will depart, at 3 o'clock, in 10 minutes


EXAMPLE

S ::= {in 10 minutes}^, {at 3 o'clock}^, ("I" {will depart} {alone}) ;

PRODUCES


in 10 minutes, at 3 o'clock, I will depart alone
at 3 o'clock, in 10 minutes, I will depart alone
in 10 minutes, at 3 o'clock, I alone will depart
at 3 o'clock, in 10 minutes, I alone will depart


2.0.8 Deep unfolding

The language allows the deep unfolding of a subproduction specified in reverse-doubleangle brackets >> and <<: any atom, however nested, for which the unfolding operation makes sense (see section 2.0.4) is unfolded. As a result, the complete flattening of every subproduction and non-terminal symbol is done:

EXAMPLE

S ::= look at >> the (dog | (sorian | persiancat)
               |(cow | bull | Animal)
              << ;


Animal
::= pig | (weird | ugly) chicken ;



The non-terminal S is translated into:

S ::= look at ( the dog
           
  | the sorian cat
              | the persian cat
              | a cow
           
  | a bull
           
  | a pig
              | a (weird | ugly) chicken
              ) ;


Deeply unfolded subproductions are therefore translated into a subproduction with everything has been recursively flattened in; exceptions are subproductions bound to non-terminal symbols. The reason of this behaviour is dued to deep unfolding being an operation which actually performs a simple unfolding of every (sub)atom for which such operation makes sense: thus, while non-terminals are unfolded, productions bound to them are left untouched. Even though such a policy may seem unjustified at first, it rather allows the user to specify any non-terminal symbol within a reverse-doubleangle subproduction without unintentionally generating either a huge series of unfoldings or - even worse - cyclic unfoldings (see section 3.0.1.3).



2.0.9 Folding

Deep unfolding, as described in section 2.0.8, may sometimes not be what one wishes: on one hand, it would realistically be mostly used to avoid the explicit specification of a > operator for every subproduction or non-terminal symbol within a given subproduction; on the other hand, it still is sometimes impossible to perform a deep unfolding of every (sub)atom without generating (unintentional) errors. The PolyGen grammar definition language allows therefore the user to lock the unfolding (of an atom for which such operation would make sense) by means of the prefix operator <.

EXAMPLE

S ::= look at >> the (dog | <(sorian | persiancat)
               |(cow | bull | <Animal)
              << ;


Animal
::= pig | (weird | ugly) chicken ;


where the non-terminal S is translated into:

S ::= look at ( the dog
           
  | the (sorian | persian) cat
              | a cow
           
  | a bull
           
  | Animal
              ) ;

Keep in mind that folding an unfolded atom and viceversa are syntax errors - see section 4.1.1.





2.1 Advanced techniques

2.1.1 Recursion

It is perfectly legal for you to specify in a production the non-terminal symbol you're defining, in order to make PolyGen recur:

EXAMPLE

S ::= Digit [^ S] ;

Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ;


PRODUCES

0
23
853211
000000
00011122335

etc.


i.e. any natural number made up by a random number of digits randomly chosen between 0 and 9.
Keep in mind then that it is up to you providing a non-recursive production somewhere in order to let the program stop recurring sooner or later, otherwise a cyclic recursion error will be generated by the grammar checker (see section 3.0.1.2).

As an exercize, try to define a grammar for generating variably-lengthened sentences obtained by recursively linking clauses.



2.1.2 Grouping

In order to gain control over the probability distribution in even a more flexible way than how described in sections 2.0.3 and 2.0.4, proper usage of round brakets should be considered:

EXAMPLE

S1 ::= canarin | cow | camel ;

S2 ::= canarin | (cow | camel) ;


PRODUCES

canarin
cow
camel


Although S1 and S2 outputs are equal, the probability distribution for the former is:

cat
1/3
cow
1/3
camel 1/3


while for the latter:

cat 1/2
cow 1/2 * 1/2 = 1/4
camel 1/2 * 1/2 = 1/4


All this because the subproduction (cow | camel) is interpreted someway as a whole block by the program.



2.1.3 Controlling the probability of an optional subproduction

The grammar definition language interpreted by PolyGen does not allow any control over the probability for an optional subproduction to occur. In other words, there is no plus- or minus-like operator for subproductions in square brackets.
Nevertheless there exists a simple technique to get this result:

EXAMPLE

S ::= a (+ _ | beautiful) house ;


PRODUCE

a house
a beautiful house


Being an optional subproduction equivalent to a non-optional one generating either a given output or epsilon (see section 2.0.2), it is possible to make this translation by hand and eventually put plus and minus operators.
In the former example, the probability for nothing to be produced is greater than the probability for beautiful.



2.1.4 Selection reset

Keep in mind that the selection operator adds the specified label to the set of already active ones; this leads to the need to someway manually reset such set. For instance, let generate natural numbers (included zero) of arbitrary length without any non-significative zero:

EXAMPLE

S ::= Digit | S.nz [^ S.] ;

Digit ::= z: 0 | nz: {
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9} ;


PRODUCES

0
1
23
23081993
112358
20020723

ecc.

The dot operator followed by no label resets the set of active selections at the time it is encountered during generation; in other words, it stops propagation of labels selected that far.



3.0 Static checking of grammars

PolyGen features a powerful algorithm performing a static checking of a source file: it is therefore able to verify the correctness of a whole grammar in finite time despite its complexity.
A grammar source succeding the checking phase is guaranteed to always generate a valid output - as kind of a soundness.
Thus, since the checking phase always precedes generation, if the program outputs with no error messages, then the grammar is entirely correct.

Where specified within the message generated by the program, warnings and errors refer to a source text file area between two pairs of coordinates showing a line number and a column number.


3.0.1 Errors

Errors are classified as cases that break the grammar correctness definition.
An error halts the program execution.

3.0.1.1 Undefined non-terminal symbols

Each non-terminal symbol appearing in the right-hand of a definition is checked for actually existence in order to avoid usage of non-terminal symbols that have not been defined.

EXAMPLE

S ::= A | B ;
A ::= a ;


Such grammar generates an error message, since B is not defined.


3.0.1.2 Cyclic recursions and non-termination

The checking algorithm is able to verify each non-terminal symbol to possibly produce an output, i.e. - in other words - that the generation will actually terminate without an infinite recursion.

EXAMPLE

S ::= S | A ;
A ::= B ;
B ::= S | A ;



This grammar could never produce any output, since generation, despite any starting non-terminal symbol, would loop forever.
Other, more tricky, cases may lead to such a subcycle - kind of a circuit:

S ::= a | A ;
A ::= B ;
B ::= A ;



Although here generation may not lead to a cyclic recursion thanks to the terminal a, it is still possible for a non-terminating path to be entered: such cases are therefore signaled by an error message too.


3.0.1.3 Recursive unfoldings

You're not allowed to prefix the unfolding operator > (see section 2.0.4.1) to a non-terminal symbol that would cause a cyclic recursion.

EXAMPLE

S ::= >A ;
A ::= >B ;
B ::= >S ;



Such grammar would lead to an unfolding loop that would expand it infinitely and it therefore generates an error message.


3.0.1.4 Epsilon-productions

A grammar may satisfy the termination clause thanks to an epsilon-production (see section 2.0.2). In such case the grammar is considered to be correct, still it possibly produces an empty output.

EXAMPLE

S ::= A.3 ;

A ::= 1: a | 2: b ;


PRODUCES

_


A destructive selection is not an error itself, since in some situations it might be exactly one's will (see section 3.0.2.2); nevertheless all cases where generation would lead to just epsilon-productions are notified by an error message.


3.0.1.5 Overriding of non-terminal symbols

A grammar is verified to not define the same non-terminal symbol more than once.

EXAMPLE

A ::= apple | orange | banana ;

A
::= mandarin | melon ;


Such grammar leads to an error since A is defined twice.


3.0.1.6 Illegal character

Properly an error generated by the lexer in case a character not belonging to any known token (i.e. not defined by the lexical rules in section 4.1.3) has been detected during the syntactical analisis phase of a source file.



3.0.1.7 Unexpected token

An error generated by the parser in case a valid token in a wrong position has been detected during the syntactical analisis phase of a source file, leading to the breaking of the syntactical rules in section 4.1.1.




3.0.2 Warnings

Warnings are classified as cases that do not break the grammar correctness definition, still they may lead to unexpected results.
Warning messages thus mean not the grammar source to be uncorrect, indeed not robust.
A warning does not halt the program execution.

Warnings are divided into levels, depending on their gravity: the lower a level, the more urgent the warnings grouped within; level 0 contains warnings that cannot be ignored (still not being dangerous for the generation).

3.0.2.0 Level 0

There exists no warning within this group, by now.

3.0.2.1 Level 1

3.0.2.1 Undefined symbol I

The lack of definition for the non-terminal symbol I does not allow the usage of the program -info option.
Such symbol typically specifies an information string for the grammar (its author, title, etc.) and it is a good habit to follow such convention by properly defining it, though it is actually not an error to omitt it.

3.0.2.2 Potential epsilon-productions

There exist cases of a grammar not always generating an epsilon (see section 3.0.1.4), but just possibly.

EXAMPLE

S ::= A.3 | c ;

A ::= 1: a | 2: b ;



PRODUCES

c
_


Such case are not errors and are notified by a warning message.


3.0.2.1 Level 2

3.0.2.3 Useless permutation

In case only one permutable subproduction appears within a sequence (see section 2.0.7), no permutation is actually performed for obvious reasons.

EXAMPLE

S ::= a {b} c ;


Though such case actually is not an error, just a low-gravity warning message is output.


3.0.2.2.2 Potential cyclic recursion

Recursion is rather a powerful and useful tool, probably quite frequently used in complex grammars. Though all non-termination cases caused by a abuse of recursion are trapped by the grammar checker (see section 3.0.1.2), on the other hand, formally legal recursions may be enough tangled to  slow generation down for a sensible amount.
Every single recursion case may therefore be signaled a low-gravity warning, somehow in order to aid the user at localizing unwanted recursions.



4.0 Appendix


4.1.1 Concrete syntax

Type-2 syntax (in BNF notation) of the grammar definition language interpreted by PolyGen and described in this document follows below.
Non-terminal symbols bound to productions are capitalized; non-terminal symbols bound to regular expressions begin with a capital letter (see section 4.1.3); terminal symbols are quoted; S is the starting non-terminal symbol.


S      ::= DEF
        |  DEF S

DEF    ::= Nonterm "::=" PRODS ";"

PRODS  ::= PROD
        |  PROD "|" PRODS

PROD   ::= SEQ
        |  MODIF SEQ

MODIF  ::= "+"
        |  "-"
        |  "+" MODIF
        |  "-" MODIF

LABELS ::= LABEL
        |  LABEL "|" LABELS

LABEL  ::= Label
        |  MODIF Label

SEQ    ::= ATOMS
        |  Label ":" ATOMS

ATOMS  ::= ATOM
        |  ATOM ATOMS

ATOM   ::= Term
        |  "^"
        |  "_"
        |  "\"
        | UNFOLDABLE
        |  ">" UNFOLDABLE
        |  "<" UNFOLDABLE
        |  ATOM "."
        |  ATOM DotLabel
        |  ATOM ".(" LABELS ")"


UNFOLDABLE ::= Nonterm
       
|      "(" PRODS ")"
        |      "[" PRODS "]"
        |      "{" PRODS "}"
        |      ">>" PRODS "<<"




4.1.2 Abstract syntax

For the sake of completeness, the abstract syntax of the language interpreted by PolyGen follows cleared from syntactical sugars and terms involved in preprocessing only.


S      ::= DEF
        |  DEF S

DEF    ::= Nonterm "::=" PRODS ";"

PRODS  ::= SEQ
        |  SEQ "|" PRODS

SEQ    ::= ATOMS
        |  Label ":" ATOMS

ATOMS  ::= ATOM
        |  ATOM ATOMS

ATOM   ::= Nonterm
        |  Term
        |  "^"
        |  "_"
        |  "(" PRODS ")"
        |  ATOM "."
        |  ATOM DotLabel




4.1.3 Lexical rules

Type-3 lexical rules (in regular expression notation) follow.


Term     ::= [a-z 0-9 , '][a-z A-Z 0-9 , ']*
          |  " [A-Z a-z 0-9 ( ) _ - ? . , ! : \ & # + * / % $ � [ ] { } ~ @ ; : | < > = ^ ' \ "]* "

Nonterm  ::= [A-Z][A-Z a-z 0-9 _]*

Label    ::= [A-Z a-z 0-9 _]+

DotLabel ::= . Label


Notice that the white space character can be quoted as a terminal symbol too.



4.1.4 Escape sequences

Regular expression Nonterm in section 4.1.3 recognizes the backslash character within quotes. A terminal symbol is therefore allowed to contain any escape sequence among the following:


\\ backslash
\" quote
\n new line
\r carriage return
\b backspace
\t tab
\xyz ASCII decimal code xyz




4.1.5 Translation rules

As a formal reference, translation rules from concrete syntax to abstract syntax together with proper precedence (where the least number stands for the highest priority) follow. They therefore refer to the operations performed by either the parser (in case of syntactical sugars) or the preprocessor (otherwise).



concrete syntax
abstract syntax

1
A.(+(a1)-(b1)l1|...|+(an)-(bn)ln)
(A.l1 |(1) ... |(w1) A.l1 | ... | A.ln |(1) ... |(wn) A.ln)

where wi = ai - bi - min {a1-b1 ... an-bn}

2
+(a1)-(b1) P1 | ... |+(an)-(bn) Pn
P1 |(1) ... |(w1) P1 | ... | Pn |(1) ... |(wn) Pn

where wi = ai - bi - min {a1-b1 ... an-bn}
3
[P]
(_ | (P))

4
>> P <<
(P')

where P' is isomorph to P where unfoldable atoms are unfolded



5
P1 | ... | A1 {Q1}.l1 ... An {Qn}.ln | ... | Pn
P1 | ... | A1 (Q11).l1 ... An (Q1n).ln
   
| ... | A1 (Qn!n).l1 ... An (Qn!n).ln | ... | Pn

where Qji is the i-th element of the j-th permutation
of
Q1..Qn(with i = 1..n, j = 1..n!)

6
P1 | ... | L: A >(Q1 | ... | Qm).l| ... | Pn
P1 | ... | L: A (Q1).l| ... | L: A (Qm).l| ... | Pn

6
P1 | ... | L: A >X.l| ... | Pn

where X ::= Q1 | ... | Qm
P1 | ... | L: A (Q1).l| ... | L: A (Qm).l| ... | Pn


Where the following notational conventions hold:


P, Q productions or series of productions
A, B atoms o atom (sub)sequences
L, l labels
X, Y
non-terminali symbols
+(n)-(m) juxtaposition of n and m, respectively, + and - operators
P |(1) ... |(n) P n-lengthened series of productions P