The ModulaTor logo 7KB

The ModulaTor

Oberon-2 and Modula-2 Technical Publication

The ModulaTor
Erlangen's First Independent Modula_2 Journal! Nr. 4, May-1994 


Character-less Programming 
__________________________

by K Hopper, RH Barbour, Dept of Computer Science, University of Waikato, 
New Zealand, Email: kh@athens.cs.waikato.ac.nz 

 

Abstract 

This paper discusses issues of programming language translator portability. A 
proposed mechanism to provide for defining and implementing programming 
languages is described. The mechanism depends upon the separation of the 
language, character set and other culturally dependent features from the 
process of translating a program or executing it when translated. 

The mechanism is developed against a background of a review of programming 
languages and their visible manifestation. The proposed mechanism applies the 
principles of abstraction using abstract data typesin order to specify some 
aspects of programming languages. A corresponding strategy for the 
implementation of programming language tools and the environment in which 
they may be used is also described. 

 

Introduction 

New techniques are urgently required for providing access to Information 
Technology to people from all nations and cultures. To help meet this need, 
research into portable software for programming language translation has been 
carried out in New Zealand. 

Such software isolates translator activities from the peculiarities of various 
machine and operating system environments. The research questioned what 
was common across cultures in every mapping from an external representation 
of programmer's source text to an internal encoding suitable for a programming 
language translator. 

Further work is continuing to prepare lists of those common features of 
individual cultures which require specific hardware or software solutions. These 
lists, together with the portable translators referred to will form the enabling 
technology for each culture to participate in Informaton Technology by invitation 
and on its own terms. 

 

Cultural Needs 

The need for improved inter-cultural communication poses a challenge to the 
Information Technology community. This need has indeed generated interest 
and effort among international standards groups working on behalf of the IT 
industry and its clients. A recent development from this effort has been the 
recognition of a requirement for a generic set of tools within a coherent 
theoretical framework. These are needed to provide culturally appropriate 
representations of ideas to, from and between computer systems [Bar]. 

The use of a programming language for describing to a computer the desired 
manipulation of ideas is the principal currently available technology. The general 
concept of language is intended to describe the sounds, marks and signs which 
people use for communication. The mechanism proposed for programming 
language translation reflects the nature of the human communication process 
rather than any specific cultural perspective. 

It is suggested, based upon experimental implementation, that the proposed 
mechanisms and techniques are as applicable to those programming languages 
currently in use as to those yet to be developed. 

 

The Expression of Meaning 

Languages vary in complexity and expressive power. All language users use the 
notion of 'token-in-shared-context' for determining the meaning of a particular 
utterance (a mark or sign) in a particular context. Examples are two friends in 
idle conversation or a programmer and a translation tool. 

The shared context is provided in two parts :-- 

- The structure of the utterance itself -- that is the syntax of the language. 

- The common milieu of those communicating. For example, when using the 
English word path, the two friends in conversation would probably be referring to 
a walkway between two hedgerows 'over there'. The programmer 
communicating with his computer almost certainly considers that path refers to 
a sequence of tokens used to identify some name space. 

The token made visible as the word path therefore has no intrinsic meaning of 
its own. All English speakers who are computer literate, however, have an 
internal map from visible (audible) token to (at least) the two 
'meanings-in-context' given as examples. 

To extend this notion further, homophones are just as able to have different 
'mappings-in-context' as may homographs. Providing the recipient of the 
utterance knows the context then all is well. However, if the shared context is 
not in fact determinate as, for example, on reading the isolated visible token fin 
-- is it English and part of a fish or aeroplane, or French and this is the end! -- 
then communication is not possible. 

 

Token Abstraction 

Linguists are aware that no token has an intrinsic meaning in isolation from a 
context. This deduction has rarely been applied in computing -- particularly not 
in the design and implementation of computer programming languages. This is 
understandable from a historical perspective. In the past, computer 
representations of visible character sets were in use well before the 
development of language theory as it is known today. The definition of the Algol 
60 language was given in terms of the visible marks in the document because of 
the differences in available character sets. 

Current practice also reflects the natural attitude of programmers who are 
working within a single culture. A programmer would take for granted that other 
people would share both context and the meaning of the tokens used without 
need for further expl 

The idea of token abstraction was first used in the programming language APL. 
It was defined in this way because the richness of the visible forms of the tokens 
used to express it was not available in any computer character encoding at the 
time of its development. The latest version of this language, Extended APL (DIS 
13751 [ISO-a]), follows this scheme by naming tokens rather than giving them 
explicit visible expression in the standard (although, naturally, the names 
themselves are visible when printed). 

An implementation of APL is required to provide a mapping from some external 
representation of a token to the language-specified token. No distinction is 
necessarily made, however, between these two conceptually different things in 
the language implementation. 

The programming language Ada (including its latest revision as Ada-9X [ISO-b]) 
goes a step further in indicating a possible solution to the overall problem 
(although it must be pointed out that this has not yet been taken to its inevitable 
conclusion as seen from the perspective of the suggestions offered in this 
paper). The Ada language defines enumeration types, each value of which is 
itself a token, provided with a position (in the enumeration). In a real sense the 
value of this token is its position in the value domain defined by the type. 
However, Ada goes one step further -- it provides for an explicit representation 
to be defined for each value. Referring to the heading of this section and 
inverting the above statement, the Ada language provides an abstract value for 
each representation (visible/audible token) in an enumeration. 

The Modula-2 programming language (DIS 10514 [ISO-c]) offers a variant, 
though related, definition of the pre-defined language type CHAR which it 
defines as an enumeration. Some of the names for the values of the type are 
pre-defined in the standard while additional names are (if provided) to be 
defined by an implementation. The important point about this for the purposes of 
this paper, however, is that Modula-2 leaves a person carrying out an 
implementation entirely free to choose any suitable representation for the 
transfer of values into and out of a program. 

 

Language Tokens and Representation 

While all programming language standards (or other specifying documents) are 
provided as written/displayable text, this is solely for the human reader. If it were 
necessary to provide standards documents for blind people -- unable to see -- 
then it is likely that they would be prepared as audio recordings of the spoken 
word. The language tokens would then be phoneme rather than grapheme 
sequences. 

Similarly, in the foreseeable future programming will become an activity in which 
some design tool generates the syntactic tokens of a language -- not a string of 
lexemes derived from a visible character representation. It is this notion, 
together with those ideas taken from APL, Ada and Modula-2 language 
definitions which lead to the first part of the proposed solution to token rich but 
character-less programming. 

Provided that a programming language translator is passed a stream of syntax 
tokens in whatever internal representation it requires then it can attempt to carry 
out the translation desired. The task of providing the internal representation of 
the syntax tokens is a simple one of mapping short sequences of 
graphemes/phonemes into a token. The encoding of the lexemes is irrelevant -- 
provided that a mapping is made available to the translator for its use when 
translating! Using this technique it is possible to produce an almost 
language-independent lexical analyser. The only problems arise where a 
language makes use of constructed tokens (usually only numbers). 

What is needed, therefore, is a list of all of those terminal tokens which exist in a 
programming language -- a list specified as an ordered enumeration in the 
language standard (or other specifying document). A typical list would include 
such tokens as -- Full-stop, End-token, Line-mark, For-token, White-space, 
Procedure-token, Quote-mark. The list, apart from being ordered in the 
language specification must contain all of the tokens specified by the language, 
including separately those lexemes used in constructing tokens such as 
numbers. 

It is necessary to distinguish tokens in this list from those specified in an 
individual program. Programmer-defined tokens may be divided into the 
following classes :-- 

- Compositions of language defined lexemes into a constructed token. In all 
known programming languages such compositions are restricted to the 
representation in visible form of a numeric value (which requires translator 
processing in order to determine its value). For languages in which such 
compositions are needed then the list of terminal tokens will necessarily include 
such additional language-defined tokens as Digit-zero, Digit-one, 
Exponent-mark, etc. 

- Bracketed data. For such data values a language need only define one or 
more tokens to provide the brackets. The single token to 'toggle' between 
program source and data (the Quote-mark) as used in this sentence is another 
possible mechanism. Many existing languages consider strings of visible 
characters in this kind of way (see Note). 

- All other programmer-defined tokens -- considered as identifiers. 

Note: Where an existing language does interpret the contents of some character 
string in a particular way (eg when I/O formatting) there is always some 'escape' 
token followed by tokens known to the I/O formatting translator -- sometimes 
followed by a further separator to indicate the end of the 'nested' bracketing 
(dependent on the syntax of the formatting language). This is just another 
language for which syntax tokens need definition. The fact that it may be 
incorporated within some programming language translator must not be allowed 
to obscure the fact that it is a different language and, as such, needs definition 
too. 

It is suggested that a lexical analyser has no need to be aware of the external 
form of representation of any token at all. The lexical analyser for a 
(programming) language translator only needs to be given a mapping of the 
external representations for language tokens, including those necessary to 
provide for programmer composable numeric value representations. It has no 
need to be aware of any character encodings used in its host environment -- for 
example any use of codes which a suitable rendering engine may produce as 
Arabic, English, Chinese or any other humanly readable/audible language 
'words' or 'digits' or 'punctuation'. 

 

Multi-culturalism 

A mapping provided for a translator in this way should be made available in the 
environment in which it is to execute as part of the culture-dependent 
components of the host operating system. A multi-cultural operating system will 
provide, say, for Arab, English and Chinese programmers to work in their own 
natural language alongside each other on the same joint project and all be able 
to use the same programming language tools, including the translator. 

The prospect described in the previous paragraph of joint project working in a 
multi-cultural environment does raise with it a pair of additional problems which 
need to be addressed if the technique of providing mapping files for lexical 
analysers is to be a useful step towards improving application portability. There 
needs to be :-- 

- An operating system 'lingua franca' so that those multi-cultural users may 
share objects which others have created. 

- A way of sharing concepts identified by some encoding of this lingua franca so 
that users from different cultures can use their own native terminology and 
produce a dynamic addition to what is effectively a project common concept 
name space. 

While this observation may seem to offer cause for reconsideration of the 
practicality of the projected mechanism, it is really only serving to reiterate the 
point made earlier that before communication can take place there must be a 
shared concept about which communication would be meaningful. 

The consequences, therefore, are that there needs to be potentially two 
additional facilities where joint working is important, whether this is on a joint 
project or a multi-user machine :-- 

- A facility for creating unique identifiers in the name-space concerned when 
creating a new object. This is the equivalent to adding a new element to the 
enumerated list of lexical tokens for a programming language definition. It is this 
identifier, known only to the underlying operating system or project data-base 
(eg where a project is under PCTE [ECMA] control), to which a user may attach 
an encoding of an associated lexical token for the associated object. 

- An addition to the lexical mapping for the environment tools which need to find 
and manipulate such an object. 

This, therefore, extends the concept to the dynamic creation (and, necessarily, 
destruction too) of new identifiers and encoding mappings. Such a concept has, 
it may be remarked, been used in 'secret' by the Unix operating system for many 
years -- identifiers for file system objects are contained in uniquely numbered 
and dynamically created entities known as i-nodes. The dynamically extensible 
mapping is provided by a directory system. The technique is thus well-known by 
implementers and this paper is merely suggesting that it can be extended into 
other areas for the purposes of promoting application portability and the goals of 
culturally sensitive Information Technology products in general. 

 

The Mechanism in Practice 

Whether used statically in relation to the definition of a (programming) language 
standard, or dynamically for common access to a project name-space, the 
technique described is intended to be used by the appropriate tool(s) in the 
following general manner. 

The initialization phase of a lexical analyser/generator involves first finding (in 
an environment dependent way) the mapping file for the currently defined 
culture. The contents of the map must then be read into some internal data 
structure for later use as a lexicon for analysis (or generation) of external 
representation encodings. 

During subsequent translation of some unit or program the actions of a lexical 
analyser are as follows -- the encodings received in the source stream are 
matched against the mappings and further lexical actions proceed in the way 
outlined below :-- 

- If a match is found then the appropriate token value (not the encoding) is used 
in further processing. 

- If the lexeme found is needed to start a token construction then subsequent 
lexemes are used in such construction as may be required by language-specific 
rules until a non-composing lexeme is detected. 

- If the token is a start-data-delimiter then subsequent encodings (these are not 
tokens since there is no mapping for them) are collected as data (for 
interpretation by some other language engine as needed at some later time) 
until an appropriate end-data-delimiter token is detected. 

- Every other sequence of encodings is treated as an identifier token and (most 
likely) looked-up/entered into the translator identifier table. 

Note: Comments in the source of a translation unit could be treated as data 
which may or may not be discarded for the purposes of the translator. 

Further translation syntactic and semantic processing can then take place in the 
normal way. 

For the generation of a stream of lexemes, similar (though rather simpler) rules 
of operation may be described. 

Note that no characters at all are needed in this description or in the actual 
processing involved in implementing such a mechanism -- merely an encoding 
map together with three auxiliary lexical rules, only the composition and data 
delimiting rules being language-dependent as would be expected. 

Annex A contains a set of guidelines for use by programming language and data 
abstraction designers and tool implementers, based on the above description of 
operation. The formal specification for the format for a universal file to contain 
such a mapping for any (programming) language and cultural environment is 
given in Annex B and an example programming language lexical token 
enumeration in Annex C. 

The format specification is suitable for standardisation so that anyone with a 
modicum of skill (and/or intelligent tools) can define a mapping file for a 
language which is sensitive to the local culture, devices and operating system. 
Providing facilities to enable this kind of thing to be done locally is a very 
important part of being culturally sensitive in spreading the benefits of IT to a 
wider community. 

Note 1: Any need to translate the source form of a program exported from some 
other site therefore requires a copy of the mapping file used originally. 

Note 2: Any need to read the source form will also require a suitable rendering 
engine for the encoding used -- possibly also someone who knows the language 
involved to interpret the lexical tokens presented. 

 

Data Manipulation 

The suggested language-defined translator map completely specifies a 
character-less (and hence considerably more portable and culturally-sensitive) 
translator mechanism. It does not at first sight appear to solve the allied problem 
of data manipulation by an executing program. 

Where data appeared in the source of a program, the translator lexical analyser 
merely collected encodings in a totally transparent manner. After all, it is the 
manipulation of such unknown data, together with that read from some input 
channel during processing, which is the principal purpose of a program. The 
production of data through some output channel is only of related concern if it is 
intended for use by some other tool (eg a rendering engine producing sounds or 
visible marks for human users). 

The encoding of data, whether as part of some program source or as data 
obtained/generated during program execution, is therefore of major concern to 
the writer of a program -- but only insofar as the actual encoding employed may 
be 'understood' by the program. The solution to this problem given above in the 
case of a programming language translator (which is, after all, only another 
program), relied upon providing a mapping from encodings to lexical tokens in a 
completely portable, transparent, manner. 

While it would, perhaps, be ideal for the specification of every program to 
indicate its input/output languages in such a way, this is unlikely to be practical 
for some time to come (if ever). As an interim solution, therefore, it will be 
necessary to provide a simple, practical mechanism which will improve upon the 
present situation -- ie it must be possible to determine the incoming tokens from 
the external representation. 

The current work being undertaken world-wide to collect a number of abstract 
concepts which are shared by more than one culture offers, it is suggested, a 
small step in the right direction. If, for each shared abstraction, a unique 
Abstract Data Type (ADT) is provided as part of the environment within which a 
program executes then the program may make use of such a type and its 
standard operations without any concern for the culturally specific interpretation 
of such a concept. Similarly, in addition to not needing to be concerned with 
local interpretation, the program does not need to 'understand' the syntax nor 
encoding of any human interpretable form of the abstraction -- that is hidden 
inside the implementation of the lexical analyser/generator for import/export of 
values of the abstraction. 

The implementer of the ADT facility for some specific culture and system will, for 
lexical analysis/generation purposes, need to initialise an internal mapping of a 
culturally-specific encoding in exactly the same way that a language translator 
lexical analyser had to read its lexical mapping. Once again, there is no need for 
the culturally-specific ADT implementer to be aware of what form an encoding 
may take. 

This simple notion of providing input and output of values of an abstract data 
type is not, however, quite as simple as this may seem. Until a general 
technique for describing language syntax in a simple tabular manner 
dynamically can be developed, the implementation of the ADT must contain its 
own culturally appropriate syntax rules for the internalisation and externalisation 
of values of the type. That this is practiced currently is, of course, known -- but 
the separation of the production/analysis of the lexis for the ADT's 
human-interpretable language now merely needs to conform to the rules given 
for the language translator lexical analyser mapping file. 

The use of ADTs in this way for compiler lexical token lists and for cultural 
service lexical token lists implies that the provider of a computing environment in 
which these operate is only specifying encoding in a mapping file which will 
represent the concepts concerned. This in turn implies that these may be 
rendered in a visible form in any natural language. It seems, therefore, that this 
offers the opportunity for a system programmer who is a member of any culture 
to prepare mapping files provided only that he/she shares the concepts used in 
defining either a programming language or an ADT. 

 

Are Characters Really Necessary? 

While the discussion in this paper has so far hardly referred to the word, 
characters are necessary although only in a very limited way -- far more limited 
than much current thinking about programming would indicate. 

A quick review of any average selection of programs written in whatever 
programming language will reveal that, with very few exceptions, characters are 
currently used for the following :-- 

- Expressing in visible form in program source some complete or partial 
message for export to a human (or other text reading entity) -- eg ''Death in 
family!''. 

- For use in converting to/from some internal token or value (converting a 
number or date, say). 

- Either individually or in combination as a substitute syntax token (in much the 
same way as programs are currently written, just as some form of textual data). 

The relation of these three to comments/data, language tokens and value 
construction as specified for programming languages is evident. What, perhaps, 
is not so obvious is that these have essentially all been eliminated from 
application programs which :-- 

- Provide for 'messages' to be external as a list of messages in the same order 
as some internal enumeration. Naturally this is done to permit the use of 
different natural languages and orthographic forms of message. Such 
messages then become a program-specific token list and mapping! 

- Use culturally shareable ADTs which embody conversion analyser/generator 
code as described earlier. 

- Merely use encoding pattern matching to determine whether or not some 
encoding sequence matches a known syntax token. This uses encodings but 
not characters! 

The only point where characters are actually needed, therefore, are as visible 
forms (where an encoding uses a suitable rendering engine) for human 
interpretation/generation. This could occur, for example where a human user is 
preparing a list of messages for some program to use. Note that the list is 
prepared using the visible marks on a display, say -- but that what is really 
wanted is the encoding of the message. The using program merely uses the 
encoding, not the characters. 

In essence, programming is not a task which involves characters. In restricted 
circumstances detailed manipulation of character encodings may be required, 
but this is almost exclusively restricted to the use and preparation of mappings 
to lexical tokens, tokens which have some representation internal to a using 
program which is not the concern of any external entity. 

 

Words and Strings 

It is proposed, as a final part of the interim solution using ADTs suggested 
earlier, that two very special ADTs be made mandatory in every program 
execution environment -- Word and String, where a String is defined as a 
sequence of words separated by white space encodings. 

From the point of view of the human user, the ability of a computer to organise 
visible text in accordance with some culturally appropriate rules is of paramount 
importance. The organised list of names, etc is a very important feature of many 
human activities. 

Strangely enough, these two ADTs (and others which may be derived from 
them) are the only ones which necessarily use the concept of character as 
understood by the human user. The need is to divide the encoding stream into 
sub-components which, when applied to a rendering engine, result in the 
generation of a single identifiably separate mark representing what is called a 
character. Once this has been done, then in the semantics of the operations on 
these objects, which include such things as ordering, equality/inequality testing, 
etc the human concept of character is needed. 

 

Summary 

The paper has identified an urgent need for the expansion of information 
technology to all nations and cultures. As the result of some New Zealand 
research into translator portability, it has been possible to propose a generic 
mechanism to extend the way in which the lexis of programming language 
standards is defined to improve the cultural sensitivity of implementations. A 
similar approach to the definition of shared cultural concepts permits greater 
cultural independence of application programs written using existing 
programming languages. The key components of this are :-- 

- A named list of lexical tokens for a language. 

- A locally-defined mapping between representation encodings and the items in 
this list. 

- Where separate translation of program components is permitted, then a tool to 
map between different cultural encodings of names defined for use in another 
component may be needed. 

- Application of the above principles in the design and implementation of ADTs 
which are shared by more than one culture. 

 

Acknowledgments 

Many thanks are owed to those kind people who have listened to our often 
tentative explanations of embryo ideas -- D Andrews, A la Bonte, R Hicks, H 
Jespersen, RJ Mathis, PJ Plauger, K Pronk, P Rabin, K Simonsen, M 
Woodman, D Wong. Much of the credit for the ideas expressed is due to them, 
any errors are ours. 

 

Bibliography 

[Bar] Barbour RH, Cunningham S-J & Ford G: Maori word-processing for 
indigenous New Zealand young children, British Journal of Educational 
Technology, (2), p114-124, 1993 

[ECMA] European Computer Manufacturer's Association: Portable Common 
Tools Environment -- Abstract Specification, ECMA-149, June 1993 

[ISO-a] ISO/IEC: Information Technology -- Programming Languages, their 
environments and system software interfaces -- Programming Language 
Extended APL, CD 13751 draft, 26 Aug 1993 

[ISO-b] ISO/IEC: Information Technology -- Programming Languages, their 
environments and system software interfaces -- Programming Language Ada, 
CD 8652:1993 draft, 16 Sep 1993 

[ISO-c] ISO/IEC: Information Technology -- Programming Languages, their 
environments and system software interfaces -- Programming Language 
Modula-2, DIS 10514 draft, 31 Jan 1994 

[ISO-d] ISO/IEC: Information Technology -- Universal Multiple-Octet Coded 
Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane, 
ISO/IEC 10646-1, 1 May 1993 

[ISO-e] ISO/IEC: Information Technology -- Programming languages and their 
environments and system software interfaces -- Vienna Development 
Method/Specification language, Part1 -- Base language, CD 13817-1 draft, Dec 
1993 

 

Annex A 

Design Guidelines 

The following paragraphs contain suggested guidelines which, if adopted by the 
designers of programming languages and implementers of translators, should 
lead to the benefits described in this paper. 

It is important to recognise that all tools which provide abstractions of human 
concepts are responsible for translating the values embodied in those concepts 
between the form in which they may be uttered/absorbed by a human and some 
internal concretisation of the abstraction concerned. In this respect a 
programming language is one of the most complex abstractions currently 
handled by computer software tools. As such it requires greater effort in 
providing a tool -- but that tool must operate on the same principles as that 
which merely recognises a s 

 

Language Design 

The designer of a programming language which has a specified form of 
concrete representation should consider the following :-- 

- What are the legal tokens required by the language? These do not include 
user-defined tokens (which may be called names or identifiers) -- which are 
defined in a language dependent manner in terms of the legal tokens in this list 
and other previously defined tokens in the program unit being translated. 

- What are the needs of the language for determining a meaning from some 
primitive token structure in order to produce a composed language token? For 
example, what is the set of character tokens used in presenting a numeric value 
to a language translator? Is this a sub-language to be treated by a separate 
recogniser? 

- What subordinate (sub-)languages are used in the definition of the 
programming language itself? For example does some input/output formatting 
sub-language also need definition? 

- For the programming language itself and for each sub-language as defined 
above, produce an enumeration of all of the tokens which are either a legal 
token or needed to compose a legal token. Define this as the lexis for the 
language. 

 

Language Implementation 

The implementer of a programming language should consider treating each of 
the sub-languages defined (eg I/O formatting) as a separate abstraction to be 
handled in the way defined below for data abstractions in general. 

The implementation of the lexical analyser should be initialised by reading some 
lexical token mapping file as defined in Annex B following and then using that for 
look-up purposes in determining the lexical tokens needed for syntax analysis. 

Where some part of the input stream is determined to be data for interpretation 
by some other sub-language abstraction, then the raw encodings should be 
passed to that as required without modification. 

It should be noted that the look-up tables provided in this way are very similar to 
the lexicon or dictionary provided for natural language users. 

 

Data Abstraction Design 

It is most important that the concept for which an abstraction is being designed 
is fully and unambiguously defined. There is, however, little point in designing an 
abstraction around a concept which is not shared by at least two cultures. It is, 
for example, important to distinguish between the concept of time as a 
countable series of regularly occurring events (eg clock ticks) and time as a 
sequence of events which are ordered but not regular -- these two are entirely 
different concepts of time -- both equally valid, just different abstract concepts. 

Once a concept has been fully defined, this definition should be standardised. 

The grammar (syntax) used for humanly visible/audible forms of utterance of 
values of the abstraction will differ from culture to culture and a registry of such 
grammars should be set up and maintained. It is important that all such 
grammars include a 

Employing the same mapping file principle as used for programming languages, 
individual cultures are then free to prepare appropriate mappings as may be 
required. 

 

Annex B 

Proposed Lexical Mapping File Format 

The format described below for the contents of a mapping file to provide the 
lexical independence described is based upon the fact that all standard 
character encodings (for example those defined in ISO/IEC 10646-1 [ISO-d]) 
occupy an integral number of octets. 

The lexical mapping file contains a file header followed by one or more mapping 
entries as described in the following sections. There shall be as many entries as 
there are lexical tokens specified in the lexis of the language for which the 
map-file has been created. The effect of there being any lesser number of 
entries is tool-dependent. 

 

File header 

Apart from the first octet in this header, two encodings are defined. These two 
are specified so that they may not form part of any external encoding which the 
file defines. They are only used by the translation tool which converts this 
mapping file into a lexical look-up facility. As with the external encoding entries 
defined after the header, their meaning is determined by their location in the 
mapping file. 

- First octet -- this shall be an unsigned binary numeric value which specifies the 
number of octets of which all other entries in the file are to be a multiple. This is 
called size in the following sub-paragraphs. 

- The next size octets -- this shall be an encoding for a separator token which is 
not used elsewhere in the mapping file as a component of any other external 
encoding for a token. 

- The next size octets -- this shall be an encoding for a token to be known as an 
'alternate token'. The encoding for this alternate token shall not be used 
anywhere else in the mapping file as a token with any other meaning than the 
following. The alternate token shall indicate that for the entry (defined below) in 
which it appears, the external encoding following it is a valid alternative external 
encoding to the one(s) preceding it. 

- The next size octets -- a separator token. 

 

File Entries 

Each entry in the file consists of one or more encoding sequences followed by a 
separator. Where there is more than one sequence in an entry then each 
sequence is separated from a following one by an alternate token as defined in 
the file header. Each encoding sequence must be a multiple of size octets. 

 

Formal Definition 

The following abstract syntax (in the form specified by CD 13817 [ISO-e]) 
formally defines the required file format in such a way that it may be made 
concrete by implementing the data structures defined as streams of binary digits 
where the abstract type Binary-digit is implemented as a single bit. 

types
  Map-file ::
    header : Map-header
    entries : Map-entry+

  Map-header ::
    size : Encoding-sizes
    separator : Map-code
    alternate : Map-code
    terminator : Map-code
  
  inv mk-Map-header(_,sep,alt,term) =^
    (sep = term) &
    (sep # alt) &
    (sep = SEPARATOR) &
    (alt = ALTERNATE) &
    (size = OCTET-COUNT)
  
  Encoding-sizes =
    N
  
  inv sz : Encoding-sizes =^
    let bin : Octet be st (bin = sz)
  
  Map-code =
    Octet+

  inv mp : Map-code =^
    len mp = OCTET-COUNT
  
  Octet =
    Binary-digit*
  
  inv begin oct : Octet =^
    len oct = 8

  Binary-digit = (SET | CLEAR)
  
  values
    SEPARATOR : Map-code = <File-dependent>
    ALTERNATE : Map-code = <File-dependent>
    OCTET-COUNT : Encoding-sizes = <File-dependent>
  
Annotate: These three values which must, of course, conform to the invariants 
not only in their own type-definitions, but also that in the type Map-header are 
otherwise arbitrarily chosen according to the practical needs of the encodings 
being defined in the mapping file. 

types
  Map-entry ::
    encode : Encoding
    alternates : [Option*]
    term : Map-code

  inv mk-Map-entry(enc,alt,term) =^
    forall elem <= dom inds enc . (enc(elem) # SEPARATOR) &

                                  (enc(elem) # ALTERNATE)) &
    forall opt <= dom inds alt . let mk-Option(_,val) in
      forall elem <= dom inds val . (val(elem) # SEPARATOR) &
                                    (val(elem) # ALTERNATE))) &
                                    (term = SEPARATOR)

  Encoding = Map-code+
  
  Option ::
    all-sep : Map-code
    val : Encoding
  
  inv mk-Option(alt,_) =^
    (alt = ALTERNATE)
 

 

Annex C 

Example Language Lexical Token Enumeration 

The following listing of lexical tokens for the Modula-2 language is merely an 
exemplar of which the source material was readily available to the authors. 
Following the listing there is some discussion about the way in which the entries 
in the encoding file could be made. 

types

  Modula-lexeme =
    Digit-0 | Digit-1 | Digit-2 | Digit-3 |
    Digit-4 | Digit-5 | Digit-6 | Digit-7 |
    Digit-8 | Digit-9 | Digit-10 | Digit-11 |
    Digit-12 | Digit-13 | Digit-14 | Digit-15 |
    Hex-number-mark | Octal-number-mark | Character-mark |
    Exponent-mark | Single-quote | Double-quote |
    White-space | Comment-start | Comment-end |
    Source-code-directive-start | Source-code-directive-end | Colon |
    Comma | Ellipsis | Equals | Period | Semicolon |
    Left-parenthesis | Right-parenthesis | Left-bracket | Right-bracket |
    Left-brace | Right-brace | Assignment-operator | Plus-operator |
    Minus-operator | Logical-disjunction | 
    Multiplication-operator | Division-operator |
    Logical-conjunction | Logical-negation | Inequality | Less-than |
    Greater-than | Less-than-or-equal | 
    Greater-than-or-equal | Dereferencing |
    AND-SY | ARRAY-SY | BEGIN-SY | BY-SY |
    CASE-SY | CONST-SY | DEFINITION-SY | DIV-SY |
    DO-SY | ELSE-SY | ELSIF-SY | END-SY |
    EXIT-SY | EXCEPT-SY | EXPORT-SY | FINALLY-SY |
    FOR-SY | FORWARD-SY | FROM-SY | IF-SY |
    IMPLEMENTATION-SY | IMPORT-SY | IN-SY | LOOP-SY |
    MOD-SY | MODULE-SY | NOT-SY | OF-SY |
    OR-SY | PACKEDSET-SY | POINTER-SY | PROCEDURE-SY |
    QUALIFIED-SY | RECORD-SY | REM-SY | RETRY-SY |
    REPEAT-SY | RETURN-SY | SET-SY | THEN-SY |
    TO-SY | TYPE-SY | UNTIL-SY | VAR-SY |
    WHILE-SY | WITH-SY |
    ABS-Ident | BITSET-Ident |
    BOOLEAN-Ident | CARDINAL-Ident | CAP-Ident | CHR-Ident |
    CHAR-Ident | COMPLEX-Ident | CMPLX-Ident | DEC-Ident |
    DISPOSE-Ident | EXCL-Ident | FALSE-Ident | FLOAT-Ident |
    HALT-Ident | HIGH-Ident | IM-Ident | INC-Ident |
    INCL-Ident | INT-Ident | INTERRUPTIBLE-Ident | INTEGER-Ident |
    LENGTH-Ident | LFLOAT-Ident |
    LONGCOMPLEX-Ident | LONGREAL-Ident |
    MAX-Ident | MIN-Ident | NEW-Ident | NIL-Ident |
    ODD-Ident | ORD-Ident | PROC-Ident | PROTECTION-Ident |
    RE-Ident | REAL-Ident | SIZE-Ident | TRUE-Ident |
    TRUNC-Ident | UNINTERRUPTIBLE-Ident | VAL-Ident
 

 

Remarks 

These are the some hundred and thirty lexical tokens of the Modula-2 
programming language. Some of them are permitted more than one encoding 
by the nature of the language, for example the visible marks '#' and '<>' are, as 
currently defined, valid alternatives in an 'English' representation of the 
Inequality lexical token. 

White-space too has several possibilities including a space (' '), a horizontal tab 
and a line mark (however that may be indicated/detected in a particular 
operating system). 

Note that in an 'English' version of a mapping file there are a couple of 
potentially awkward problems as the letter 'C' is used as a Character-mark as 
well as Digit-12. Such duplication is, however, peculiar to a particular form of 
visible representation which is always context dependent in a composition. 
Disambiguation of the context dependency is a simple problem within the value 
composing routine. 

The way in which the enumeration has been defined is designed to show that, 
for example, Arabic digits or Chinese digits (as well as Roman digits, etc) coding 
could be used in an appropriate culture (and made visible by an appropriate 
local rendering engine) for Digit-zero, Digit-one, etc! 

[Ed. note: This article was submitted for publication in the ModulaTor by the 
autors in 10-Mar-94.] 

________________________________________________________________

IMPRESSUM: The ModulaTor is an unrefereed journal. Technical papers are to be taken as working papers and personal rather than organizational statements. Items are printed at the discretion of the Editor based upon his judgement on the interest and relevancy to the readership. Letters, announcements, and other items of professional interest are selected on the same basis. Office of publication: The Editor of The ModulaTor is Guenter Dotzel; he can be reached by tel/fax: [removed due to abuse] or by mailto:[email deleted due to spam]
  ModulaWare home page   The ModulaTor download    [Any browser]

Webdesign by www.otolo.com/webworx, 14-Jul-1998