X-Symbol Overview Related Details Manual News »Project »Download

7. X-Symbol Internals

This section is outdated, it currently describes Version 3.4.2 of X-Symbol.

Package X-Symbol is distributed in two ways. End-users should use the binary package which contains pre-compiled files. X-Symbol developers should use the source package which contains some additional files.

7.1 Internal Representation of X-Symbol Characters  How X-Symbol represents X-Symbol chars.
7.2 Defining X-Symbol Charsets  How X-Symbol defines additional chars.
7.3 Defining Input Methods  How X-Symbol defines the input methods.
7.4 Extending Package X-Symbol  How to add fonts and token languages.
7.5 Various Internals  How X-Symbol handles other aspects.
7.6 Design Alternatives  Why X-Symbol is not designed differently.
7.7 Language Internals  How X-Symbol handles languages.
7.8 Miscellaneous Internals  Various. TODO.

7.1 Internal Representation of X-Symbol Characters

As mentioned in 6.1 Pseudo Token Language "x-symbol charsym", most functions do not operate on X-Symbol characters directly, they use "x-symbol charsyms". These charsyms have a symbol property x-symbol-cstring which points to a string, called cstring, containing the X-Symbol character.

If the character is also a 8bit character in some encoding (see section 3.2.2 File Coding of 8bit Characters), the charsym also has the symbol property x-symbol-file-cstrings for the representation in the file and property x-symbol-buffer-cstrings to recognize character aliases (see section 3.2.7 Character Aliases). E.g., under XEmacs/no-Mule, with `\335' is Yacute, `\251' is copyright, we get

(get 'Idotaccent 'x-symbol-file-cstrings)
     => (iso-8859-9 "\335" iso-8859-3 "\251")
(get 'Idotaccent 'x-symbol-buffer-cstrings)
     => (iso-8859-9 "\234\335" iso-8859-3 "\235\251")

The values are plists (see section `Property Lists' in XEmacs Lisp Reference Manual) mapping the file coding to the strings in the file or the buffer, respectively.

After token languages have been initialized, the charsym also has the symbol properties x-symbol-tokens (see section 3.1 Token Language) and x-symbol-classes (see section 3.6 Character Group and Token Classes):

(get 'Idotaccent 'x-symbol-tokens)
     => (sgml "İ" tex "{\\.I}")
(get 'Idotaccent 'x-symbol-classes)
     => (sgml (non-l1) tex (text aletter))

7.2 Defining X-Symbol Charsets

An X-Symbol charset, called cset in the code and the docstrings, handles one font used by package X-Symbol. Each cset must use the same char registry-encoding as the corresponding variables for the fonts (see section 2.9 Lisp Coding when Using Other Fonts).

You have to tell X-Symbol, how to define Mule charsets with Emacs or XEmacs/Mule and which leading character to use with XEmacs/no-Mule. As an example, we use the definition of the Adobe symbol font.

(defvar x-symbol-xsymb0-cset
  '((("adobe-fontspecific") ?\233 -3600)
    (xsymb0-left  "X-Symbol characters 0, left"  94 ?:) .
    (xsymb0-right "X-Symbol characters 0, right" 94 ?\;)))

Mule charsets (see section `Charsets' in XEmacs Lisp Reference Manual) may be used for 94 or 96 characters (this example: 94, only charset with dimension 1 can be defined with X-Symbol). Thus, if your font provides more characters, you are likely to use both the left and the right half of the font to define two Mule charsets. For both of them, you have to define a unique, free final character/byte of the standard ISO 2022 escape sequence designating the charset (this example: `:' and `;'). The remaining free (reserved by Emacs for users) are `>' and `?', the latter is already used in XEmacs.

For XEmacs/no-Mule, you have to define the leading character (this example: `\233').

Cset definitions only using the upper halves of the fonts where the corresponding Mule charsets are known and which define characters which are considered 8bit characters in the corresponding encoding, see 3.2.2 File Coding of 8bit Characters.

Cset definitions using both halves of the fonts where no corresponding Mule charset are yet known.

7.3 Defining Input Methods

This is probably the hardest section in this manual....

7.3.1 Defining Input Methods: Objectives  Input methods should be intuitive/consistent.
7.3.2 X-Symbol Character Descriptions: Example  An example introducing char descriptions.
7.3.3 Defining Input Methods by Character Descriptions  The aspects and the contexts of a character.
7.3.4 Defining Input Methods: Example  A complete example defining input methods.
7.3.5 Customizing Input Methods  How to customize the input methods.

7.3.1 Defining Input Methods: Objectives

Input methods should be intuitive. This requires consistency:

Observation: It is impossible, especially with the possibility to load character definitions later on, to define the input methods directly, i.e., by something like define-key. The solution is an indirect definitions with "character descriptions".

7.3.2 X-Symbol Character Descriptions: Example

As an example for "character descriptions", look at the definition of longarrowright in x-symbol-xsymb1-table (`95' is the encoding in the font and not of interest here). Some terms are defined in the next section:

(longarrowright 95
 (arrow) (size big . arrowright) nil ("->" t "-->") (emdash))

With this definition, package X-Symbol automatically defines:

Consider that this character would be missing in package X-Symbol and you want to define your own character (in your own font). With the current scheme, the one line above is enough! Have fun defining all the consequences directly instead....

7.3.3 Defining Input Methods by Character Descriptions

Characters are defined with character descriptions which consist of different aspects and contexts, which can also be inherited from a parent character. All characters which are connected with parents, form a component. Aspects and contexts are used to determine the modify-to and rotate-to chain for characters, the contexts for input method Context and Electric, the key bindings, and the position in the Menu and the Grid.

If you want to check the component, scores, etc of a specific character, look at the symbol property (e.g., with M-x hyper-apropos-get-doc) of the corresponding charsym, e.g., arrowright. See also the docstrings of x-symbol-init-cset and x-symbol-init-input.

Remember, all characters which are connected with parents, form a component. Contexts are the contexts of input method Context (see section 4.7 Input Method Context: Replace Char Sequence). If a table entry of a charsym does not define its own contexts, they are the same as the contexts of the charsym in an earlier position in the modify chain (see below), or the contexts of the first charsym with defined contexts in the modify chain. The modify context of a charsym is the first context.

Characters in the same component whose aspects only differ by their direction (east,...), a key in this alist, are circularly connected by "rotate-to". The sequence in the rotate chain is determined by rotate scores depending on the values in the rotate aspects. Charsyms with the same "rotate-aspects" are not connected (charsyms with the smallest modify scores are preferred).

(get 'longarrowright 'x-symbol-rotate-aspects)
     => (-1500 direction east)

Characters in the same components whose aspects only differ by their size (big,...), shape (round, square...) and/or shift (up, down,...), keys in this alist, are circularly connected by "modify-to", if all their modify contexts are used exclusively, i.e., no other modify chain uses any of them. The sequence in the modify chain is determined by modify scores depending on the values in the modify aspects, the charsym score defined in the definition tables and the score of the whole cset (see section 7.2 Defining X-Symbol Charsets).

(get 'longarrowright 'x-symbol-score)
     => -3500
(get 'longarrowright 'x-symbol-modify-aspects)
     => (1500 shift nil shape nil size big)

Otherwise, the "modify chain" is divided into modify subchains, which are those charsyms sharing the same modify context. All modify subchains using the same modify context, build a horizontal chain whose charsyms are circularly connected by "modify-to".

We build a key chain for all contexts (not just modify contexts), consisting of all charsyms (sorted according to modify scores) having the context. Input method Context modifies the context to the first charsym in the key chain.

If there is only one charsym in the key chain, C-= plus the context inserts the charsym. Otherwise, we determine a suffix for each charsym in the key chain by its index and this string. C-= plus the context plus the suffix inserts the charsym.

7.3.4 Defining Input Methods: Example

An example:  Modify  Modify  Rotate  Rotate  Modify   Other
             Score   Aspect  Score   Aspect  Context  Contexts
charsym 1w    150     nil     100     west    `a'      `c'
charsym 2w    200     nil     100     west    `b'       -
charsym 3w    350     big     100     west   (`b')     (-)
charsym 1e    100     nil     200     east   (`a')    (`b')
charsym 2e    250     big     200     east    `a'      `b'
charsym 3e    300     big     200     east    `a'       -
charsym 1n    100     nil     300     north   `d'      `c'
charsym 2n    200     big     300     north   `c'       -

Assuming that all charsyms form one component, we have:

Rotate chains:     (1w,2w)-1e-1n and 3w-(2e,3e)-2n.
Modify chains:     1w-2w-3w and 1e-2w-3w and 1n-2n.
Horizontal chains: 1e-1w-2e-3e       (for modify context `a')
                   2w-3w             (for modify context `b')
Key chains:        1e-1w-2e-3e       (for context `a')
                   1e-2w-2e-3w       (for context `b')
                   1n-1w-2n          (for context `c')
                   1n                (for context `d')

That makes the following bindings:

Rotate-to: 1w->1e, 2w->1e, 1e->1n, 1n->1w
           3w->2e, 2e->2n, 3e->2n, 2n->3w
Modify-to: 1e->1w, 1w->2e, 2e->3e, 3e->1e     (horizontal chain)
           2w->3w, 3w->2w                     (horizontal chain)
           1n->2n, 2n->1n  (modify chain with exclusive modify contexts)
CONTEXTS:  `a'->1e, `b'->1e, `c'->1n, `d'->1n
KEY:       `a1'=1e, `a2'=1w, `a3'=2e, `a4'=3e, `b1'=1e, ..., `d'=1n

7.3.5 Customizing Input Methods

When defining contexts for characters, you should try to use default contexts to make them and key bindings as consistent as possible. E.g., package X-Symbol only defines explicit contexts for 186 of the 437 characters.

Defines default scores and bindings for characters of a group (see section 3.6 Character Group and Token Classes). E.g., the definition (in x-symbol-latin1-table)

(aacute 225 (acute "a" Aacute))

defines aacute without any explicit contexts, but having the group acute and the subgroup `a'. The default input for the group is defined by the following element in this variable:

(acute 0 "%s'" t "'%s")

That means: 0 is added to the normal "modify-score" of the character. `%s'' and `'%s' with `%s' substituted by the subgroup, i.e., `a'' and `'a', are the contexts for aacute. The context `'a' is also used for input method Electric since it is prefixed by t.

It is quite unlikely that a one-character context is not the prefix of another context, at least when loading additional font definitions. In order not to have to change key bindings C-= key to C-= key 1, it is required that the length of the key binding without C-= is at least 2.

7.4 Extending Package X-Symbol

In this section, you are told what to consider and what to do when extending package X-Symbol with new characters and new token languages. If you only want to define a token language using existing characters, you only have to read the last section.

7.4.1 Extending X-Symbol with New Fonts  How to add fonts to X-Symbol.
7.4.2 Guidelines for Input Definitions  Guidelines for input definitions.
7.4.3 Emacs Lisp File Defining a New Font  How to define new character in a file.
7.4.4 Emacs Lisp File Extending a Token Language  Extending an existing language.
7.4.5 Emacs Lisp File Defining a New Token Language  Defining a new language.

7.4.1 Extending X-Symbol with New Fonts

If you add a new token language to package X-Symbol which should represent tokens by characters which are not yet defined by package X-Symbol, you have to add a new font to package X-Symbol, first.

When adding new fonts to package X-Symbol, consider that X-Symbol has to run under Emacs, XEmacs/Mule and XEmacs/no-Mule.

Running under Emacs and XEmacs/Mule requires that you cannot use all encodings in a font for characters: you should probably only use encodings 33 to 126 and 160 to 255. You should also use a unique pair of charset properties `CHARSET_REGISTRY' and `CHARSET_ENCODING'.

Running under XEmacs/no-Mule can leads to problems when major modes do not check whether the previous character is an escape character (in our case, a leading character, see section 7.1 Internal Representation of X-Symbol Characters) when looking at a character. Thus, you should probably not use encodings which represent characters in your default font with a special syntax.

You have to tell package X-Symbol which fonts to use for the normal text, subscripts and superscripts. See section 2.9 Lisp Coding when Using Other Fonts.

You have to tell X-Symbol, how to define Mule charsets with Emacs and XEmacs/Mule and which leading character to use with XEmacs/no-Mule. See section 7.2 Defining X-Symbol Charsets.

7.4.2 Guidelines for Input Definitions

Read section 7.3 Defining Input Methods. Look at the tables in `x-symbol.el'. Here are some guidelines of how to define the input methods for new characters:

  1. Define reasonable character groups for new characters, see 3.6 Character Group and Token Classes. E.g., if you add the IPA font for phonetic characters, you are likely to define at least one additional charset group. If you do not know whether to use one or two groups for a set of characters, use two.

  2. Define under which Grid/Menu header the character of the new character group should appear. You may also want to add additional headers for these characters. See section 3.6 Character Group and Token Classes.

  3. If reasonable, define default contexts for characters of a group, see 7.3.5 Customizing Input Methods.

  4. For the other characters, define contexts by Ascii sequences which look similar to the character.

  5. Form a component for a set of characters which are strongly related to each other. In most cases, characters of a component are in the same group but not vice versa. E.g., the simple arrows already defined by package X-Symbol form one component. You form a component of characters by specifying parents in their definition, see 7.3.3 Defining Input Methods by Character Descriptions.

  6. Use aspects to describe the new characters. Add new aspects to x-symbol-modify-aspects-alist and x-symbol-rotate-aspects-alist if necessary (see section 7.3.3 Defining Input Methods by Character Descriptions).

  7. Finish the definition of your font file (see section 7.4.3 Emacs Lisp File Defining a New Font), load it with M-x load-file, and initialize the input methods, e.g., by invoking the grid (M-x x-symbol-grid).

  8. If there are no errors, you are likely to get warnings about equal modify scores. In this case, the sequence of characters in the modify-to chain is random, so are the numerical suffixes of key bindings.

    1. Define a base score for the whole X-Symbol charset ("cset score") which should be a positive number in order not to change the key bindings of previously defined X-Symbol characters.

    2. Define reasonable scores for newly defined aspects and character groups.

    3. Finally, fine-tune your definitions by charsym scores in the tables. This should be necessary only for a few characters.

7.4.3 Emacs Lisp File Defining a New Font

Now put all things together in a separate font definition file. You should not put it in a language definition file.

Here is a tiny example using only the lower half of the font:

(provide 'x-symbol-myfont)
(defvar x-symbol-myfont-fonts
(defvar x-symbol-myfont-cset
  '((("xsymb-myfont") ?\200 1000)
    (myfont-left "My font characters, left"  94 63) . nil))

(defvar x-symbol-myfont-table
  '((longarrownortheast 33 (arrow) (size big . arrownortheast))
    (koerper 34 (setsymbol "K"))
    (circleS 35 (symbol "S") nil nil "SO")))
(x-symbol-init-cset x-symbol-myfont-cset x-symbol-myfont-fonts

Due to an XEmacs bug with char syntax inherit, you should also add the following line to files `x-symbol-xmas20.el' and `x-symbol-xmas21.el':

  (modify-syntax-entry ?\200 "\\" (standard-syntax-table))

7.4.4 Emacs Lisp File Extending a Token Language

If you want to use the new font to extend an existing token language, define a new token language which inherits most variables from the "parent language". E.g., token language utex inherits most variables from tex, see `x-symbol-utex.el'.

A language must define variables for all language aspects, see 7.7 Language Internals. Our example defines a language mytex using the additional characters from 7.4.3 Emacs Lisp File Defining a New Font.

First, you have to register the language in a startup file:

(defvar x-symbol-mytex-name "My TeX macro")
(defvar x-symbol-mytex-modes nil)
(x-symbol-register-language 'mytex 'x-symbol-mytex x-symbol-mytex-modes)

The language definition file should look like (leaving out most parts which are similar to the ones in `x-symbol-utex.el'):

(provide 'x-symbol-mytex)
(require 'x-symbol-tex)
(defvar x-symbol-mytex-required-fonts '(x-symbol-myfont))
(put 'mytex 'x-symbol-font-lock-keywords 'x-symbol-tex-font-lock-keywords)

(defvar x-symbol-mytex-user-table nil)
(defvar x-symbol-mytex-myfont-table
  '((longarrownortheast (math arrow user) "\\longnortheastarrow")
    (koerper (math letter user) "\\setK")
    (circleS (math ordinary amssymb) "\\circledS")))
(defvar x-symbol-mytex-table
  (append x-symbol-mytex-user-table

It is important that you do not define a variable for the language access x-symbol-font-lock-keywords, but rather use the variable of the parent language directly, see 7.7 Language Internals.

During the testing phase, you should probably leave out the `'(nil)' which prevents warnings about redefinitions for the following elements.

7.4.5 Emacs Lisp File Defining a New Token Language

You might also want to define a new token language not based on another language.

As an example, consider a token language "My Unicode" (myuc) for buffers with major mode myuc-mode. Thus, we register the language by:

(defvar x-symbol-myuc-name "My Unicode")
(defvar x-symbol-myuc-modes '(myuc-mode))
(x-symbol-register-language 'myuc 'x-symbol-myuc x-symbol-myuc-modes)

Each token if language myuc consists of `#' plus the hexadecimal representation of the Unicode with hexadecimal values where the case of digits is not important and the preferred case is upcase. A single `#' is represented by the token ##. In order to be more flexible, we want to define the tokens by their decimal value in the table. There are no subscript and no images. The code below (`x-symbol-myuc.el') is included in the source distribution of package X-Symbol.

(provide 'x-symbol-myuc)
(defvar x-symbol-myuc-required-fonts nil)
(defvar x-symbol-myuc-modeline-name "myuc")
(defvar x-symbol-myuc-class-alist
  '((VALID "My Unicode" (x-symbol-info-face))
    (INVALID "no My Unicode" (red x-symbol-info-face))))
(defvar x-symbol-myuc-font-lock-keywords nil)
(defvar x-symbol-myuc-image-keywords nil)

(defvar x-symbol-myuc-case-insensitive 'upcase)
(defvar x-symbol-myuc-token-shape '(?# "#[0-9A-Fa-f]+\\'" . "[0-9A-Fa-f]"))
(defvar x-symbol-myuc-exec-specs '(nil (nil . "#[0-9A-Fa-f]+")))
(defvar x-symbol-myuc-input-token-ignore nil)

(defun x-symbol-myuc-default-token-list (tokens)
  (list (format "#%X" (car tokens))))
(defvar x-symbol-myuc-token-list 'x-symbol-myuc-default-token-list)
(defvar x-symbol-myuc-user-table nil)
(defvar x-symbol-myuc-xsymb0-table
  '((alpha () 945) (beta () 946)))
(defvar x-symbol-myuc-table
  (append x-symbol-myuc-user-table x-symbol-myuc-xsymb0-table))

7.5 Various Internals

7.5.1 Tagging Insert Commands for Token and Electric  Don't break input methods Token and Electric.
7.5.2 Avoiding Hide/Show-Invisible Flickering  Moving cursor in invisible commands.

7.5.1 Tagging Insert Commands for Token and Electric

Input methods Token (see section 4.2 Input Method Token: Replace Token by Character) and Electric (see section 4.8 Input Method Electric: Automatic Context) stop their auto replacement if you use a command which is not an insert command.

These commands and commands aliased to these are recognized as input commands by having a non-nil value of its symbol property x-symbol-input.

7.5.2 Avoiding Hide/Show-Invisible Flickering

Starting a command makes a previously revealed super- or subscript command (see section 5.1 Super- and Subscripts) invisible again. Repeatedly invoking commands which moves the point just by a small amount can lead to some flickering.

If the point position after the execution of these commands is still "at" the super- or subscript command, the command won't be made invisible at the first place. Each of these four commands have a function (1+ and 1-) as the value of its symbol property x-symbol-point-function which returns the position "after" when called with the position "before".

7.6 Design Alternatives

This section describes potential design alternatives and why they were not used.

7.6.1 Alternative Token Representations  Why we need the conversion.
7.6.2 Alternative Ways to Turn on X-Symbol Globally  How to turn on X-Symbol globally.
7.6.3 Alternative Auto Conversion Methods  When do we convert automatically.

7.6.1 Alternative Token Representations

Package X-Symbol represents tokens in the file by characters in the buffer. This requires an automatic conversion when visiting a file or saving a buffer, see 3.2 Conversion: Decoding and Encoding.

Another possibility would be to use the tokens directly in the buffer and just display them differently. You would need no conversion and you could copy the text easily to a message buffer. This could be done by a special face and an additional font-lock keyword for every token. The disadvantages make this approach unfeasible:

Another possibility would be to adapt TeX to the representations of the corresponding characters in Emacs' buffer. Again, you would need no conversion. The disadvantages make this approach too restrictive:

A third alternative would be very similar to the method used in this package. There would be just a slight difference when running under XEmacs/no-Mule: the internal representation of a character is always just one character, but we would also provide font properties for characters not of your default font. The disadvantages make this approach too unsafe:

7.6.2 Alternative Ways to Turn on X-Symbol Globally

This package hooks itself into hack-local-variables-hook which makes the installation very simple.

Another possibility would be to use the major-mode hooks which is the normal way how to turn on a minor mode. The disadvantages are:

Another possibility would be to hook X-Symbol into find-file-hooks, as it is done in old versions of package X-Symbol. It would be as easy as the current approach but we would have to be careful with sequence of functions in find-file-hooks, especially with the function hooked in by font-lock.

7.6.3 Alternative Auto Conversion Methods

Without package crypt, this package automatically decodes tokens when turning on the minor mode (in hack-local-variables-hook, see section 7.6.2 Alternative Ways to Turn on X-Symbol Globally) or in after-insert-file-functions. This package automatically encodes characters in write-region-annotate-functions. The disadvantage is that the possibility to change buffers in write-region-annotate-functions is not official (see section 9.2.3 Wishlist: Changes in Emacs/XEmacs), i.e., not mentioned in the docstring (only mentioned for corresponding encode-functions of package format which use a similar loop in the C code).

With package crypt, this package automatically decodes tokens when turning on the minor mode. This package automatically encodes characters in write-file-hooks. The disadvantage is that the encoding is slower (use jka-compr instead crypt) and the problem with vc-next-action (see section 8.2 Spurious Encodings).

Without package crypt, Version 2.6 of this package automatically encoded characters in write-file-data-hooks. The advantage was that changing buffers there is official, the disadvantage is that it is also more complicated.

A totally different method would be to use package format. Unfortunately, this is not really possible, since a regexp in format-alist is much too weak, i.e., X-Symbol's decoding does not change any file headers which would represent the file format. In XEmacs, this package also fails to work properly with jka-compr and crypt.

7.7 Language Internals

In order to use a token language or accessing one of the language dependent values, the following conditions must be met:

Language dependent values are accessed by language accesses:

Returns the language depending value. Also initializes language if necessary. E.g., we get the name of a language by the language access x-symbol-name. With a simplified expansion, we get

(x-symbol-language-value 'x-symbol-name 'tex)
     ==> (symbol-value (get 'tex 'x-symbol-name))
     => (symbol-value 'x-symbol-tex-name)
     => "TeX macro"

List of all language accesses. A token language must define all variables accessed by language accesses. A language access is a property of the language symbol, its value is the symbol naming a variable whose value is used.

If the language is a derived language, e.g., like language utex, the language access x-symbol-font-lock-keywords, should point directly to the variable of the parent language (here tex), see file `x-symbol-utex.el'.

7.8 Miscellaneous Internals

TODO. This is currently just a collection of unrelated stuff.

Characters might also define a subgroup which is a string defining some order on characters in the same group (see section 3.6 Character Group and Token Classes) and is also used for default contexts/bindings (see section 7.3.5 Customizing Input Methods).

Lists all valid character groups. Under Emacs and XEmacs/Mule, this list also determines the syntax of characters.

The character group could probably also be used to define character categories if they are implemented in XEmacs.

This document was generated by Christoph Wedler on December, 8 2003 using texi2html