LaTeX/Special Characters

< LaTeX

In this chapter we will tackle matters related to input encoding, typesetting diacritics and special characters.

In the following document, we will refer to special characters for all symbols other than A-Za-z0-9 and English punctuation marks.

This chapter is tightly linked with the font encoding issue. You should have a look at Fonts on the topic.

Some languages usually need a dedicated input system to ease document writing. This is the case for Arabic, Chinese, Japanese, Korean and others. This specific matter will be tackled in Internationalization.

The rules for producing characters with diacritical marks, such as accents, differ somewhat depending whether you are in text mode, math mode, or the tabbing environment.

Input encoding

A technical matter

Most modern computer systems allow you to input letters of alphabets with non-ASCII characters directly from the keyboard. If you try to input these special characters in your LaTeX source file and compiled it, you may notice that they do not get printed at all.

A LaTeX source document is a plain text file. A computer stores data in a binary format, that is a sequence of bits (0 and 1). To display a plain text file, we need a code which tells which sequence of bits corresponds to which sequence of characters. This association is called input encoding, character encoding, or more informally charset.

For historical reasons, there are many different input encodings. There is an attempt to unify all the encoding with a specification that contains all existent symbols that are known from human history. This specification is Unicode. It only defines code points, which is a number for a symbol, but not the way symbols are represented in binary value. For that, unicode encodings are in charge. There are also several unicode encodings available, UTF-8 being one of them.

The ASCII encoding is an encoding which defines 128 characters on 7 bits. Its widespread use has led the vast majority of encodings to have backward compatibility with ASCII, by defining the first 128 characters the same way. The other characters are added using more bits (8 or more).

This is actually a big issue, since if you do not use the right encoding to display a file, it will show weird characters. What most programs try to do is guess statistically the encoding by analyzing the frequent sequences of bits. Sadly, it is not 100% safe. Some text editors may not bother guessing the encoding and will just use the OS default encoding. You should consider that other people might not be able to display directly your input files on their computer, because the default encoding for text file is different. It does not mean that the user cannot use another encoding, besides the default one, only that it has to be configured. For example, the German umlaut ä on OS/2 is encoded as 132, with Latin1 it is encoded as 228, while in Cyrillic encoding cp1251 this letter does not exist at all. Therefore you should consider encoding with care.

The following table shows the default encodings for some operating systems.

Operating system Default Encodings
Western Latin Cyrillic
Modern Unices (*BSD, Mac OS X, GNU/Linux) utf-8 utf-8
Mac (before OS X) applemac maccyr
Unix (Old) latin1 koi8-ru
Windows ansinew, cp1252 cp1251
DOS, OS/2 cp850 cp866nav

UTF-8 and Latin1 are not compatible. It means that if you try to open a Latin1-encoded file using a UTF-8 decoding, it will display odd symbols only if you used accents in it, since both encoding are ASCII superset they encode the classic letters the same way. There aren't many advantages in using Latin1 over UTF-8, which is technically superior. UTF-8 is also becoming the most widely used encoding (on the Web, in modern Unices, etc.).

Dealing with LaTeX

TeX uses ASCII by default. But 128 characters is not enough to support non-english languages. TeX has its own way to do that with commands for every diacritical marking (see Escaped codes). But if we want accents and other special characters to appear directly in the source file, we have to tell TeX that we want to use a different encoding.

There are several encodings available to LaTeX:

In the following we will assume you want to use UTF-8.

There are some important steps to specify encoding.

\usepackage[utf8]{inputenc}

inputenc [2] package tells LaTeX what the text encoding format of your .tex files is.

The inputenc package allows as well the user to change the encoding within the document by means of the command \inputencoding{'encoding name'}.

\usepackage[utf8]{inputenc}
% ...
% In this area
% The UTF-8 encoding is specified.
% ...
\inputencoding{latin1}
% ...
% Here the text encoding is specified as ISO Latin-1.
% ...
\inputencoding{utf8}
% Back to the UTF-8 encoding.
% ...

Extending the support

The LaTeX support of UTF-8 is fairly specific: it includes only a limited range of unicode input characters. It only defines those symbols that are known to be available with the current font encoding. You might encounter a situation where using UTF-8 might result in error:

! Package inputenc Error: Unicode char \u8:ũ not set up for use with LaTeX.

This is due to the utf8 definition not necessarily having a mapping of all the character glyphs you are able to enter on your keyboard. Such characters are for example

ŷ Ŷ ũ Ũ ẽ Ẽ ĩ Ĩ

In such case, you may try need to use the utf8x option to define more character combinations. utf8x is not officially supported, but can be viable in some cases. However it might break up compatibility with some packages like csquotes.

Another possiblity is to stick with utf8 and to define the characters yourself. This is easy:

\DeclareUnicodeCharacter{'codepoint'}{'TeX sequence'}

where codepoint is the unicode codepoint of the desired character. TeX sequence is what to print when the character matching the codepoint is met. You may find codepoints on this site. Codepoints are easy to find on the web. Example:

\DeclareUnicodeCharacter{0177}{\^y}

Now inputting 'ŷ' will effectively print 'ŷ'.

With XeTeX and LuaTeX the inputenc package is no longer needed. Both engines support UTF-8 directly and allow the use of TTF and OpenType fonts to support Unicode characters. See the Fonts section for more information.

Escaped codes

In addition to direct UTF-8 input, LaTeX supports the composition of special characters. This is convenient if your keyboard lacks some desired accents and other diacritics.

The following accents may be placed on letters. Although 'o' letter is used in most of the examples, the accents may be placed on any letter. Accents may even be placed above a "missing" letter; for example, \~{} produces a tilde over a blank space.

The following commands may be used only in paragraph (default) or LR (left-right) mode.

LaTeX command Sample Description
\`{o} ò grave accent
\'{o} ó acute accent
\^{o} ô circumflex
\"{o} ö umlaut, trema or dieresis
\H{o} ő long Hungarian umlaut (double acute)
\~{o} õ tilde
\c{c} ç cedilla
\k{a} ą ogonek
\l{} ł barred l (l with stroke)
\={o} ō macron accent (a bar over the letter)
\b{o} o bar under the letter
\.{o} ȯ dot over the letter
\d{u} dot under the letter
\r{a} å ring over the letter (for å there is also the special command \aa)
\u{o} ŏ breve over the letter
\v{s} š caron/háček ("v") over the letter
\t{oo} o͡o "tie" (inverted u) over the two letters
\o ø slashed o (o with stroke)

To place a diacritic on top of an i or a j, its dot has to be removed. The dotless version of these letters is accomplished by typing \i and \j. For example:

If a document is to be written completely in a language that requires particular diacritics several times, then using the right configuration allows those characters to be written directly in the document. For example, to achieve easier coding of umlauts, the babel package can be configured as \usepackage[german]{babel}. This provides the short hand "o for \"o. This is very useful if one needs to use some text accents in a label, since no backslash will be accepted otherwise.

More information regarding language configuration can be found in the Internationalization section.

Less than < and greater than >

The two symbols '<' and '>' are actually ASCII characters, but you may have noticed that they will print '¡' and '¿' respectively. This is a font encoding issue. If you want them to print their real symbol, you will have to use another font encoding such as T1, loaded with the fontenc package. See Fonts for more details on font encoding.

Alternatively, they can be printed with dedicated commands:

\textless
\textgreater

Euro currency symbol

When writing about money these days, you need the euro sign. The textcomp package features a \texteuro command which gives you the euro symbol as supplied by your current text font. Depending on your chosen font this may be quite far from the official symbol.

An official version of the euro symbol is provided by eurosym. Load it in the preamble (optionally with the official option):

\usepackage[official]{eurosym}

then you can insert it with the \euro{} command. Finally, if you want a euro symbol that matches with the current font style (e.g., bold, italics, etc.) you can use a different option:

\usepackage[gen]{eurosym}

again you can insert the euro symbol with \euro{}.

Alternatively you can use the marvosym package which also provides the official euro symbol.

\usepackage{marvosym}
% ...

\EUR{}

Now that you have succeeded in printing a euro sign, you may want the '€' on your keyboard to actually print the euro sign as above. There is a simple method to do that. You must make sure you are using UTF-8 encoding along with a working \euro{} or \EUR{}command.

\DeclareUnicodeCharacter{20AC}{\euro{}}
% or
\DeclareUnicodeCharacter{20AC}{\EUR{}}

Complete example:

\usepackage[utf8]{inputenc}
\usepackage{marvosym}
\DeclareUnicodeCharacter{20AC}{\EUR{}}

Degree symbol for temperature and math

The easiest way to print temperature and angle values is to use the \SI{value}{unit} command from the siunitx package, which works both in text and math mode:

\usepackage{amsmath}
\usepackage{siunitx}
%...

A \SI{45}{\degree} angle.

It is $\SI{17}{\degreeCelsius}$ outside.

For more information, see the documentation of the siunitx package.

A common mistake is to use the \circ command. It will not print the correct character (though $^\circ$ will). Use the textcomp package instead, which provides a \textdegree command.

\usepackage{textcomp}
%...

A $45$\textdegree angle.

For temperature, you can use the same command or opt for the gensymb package and write

\usepackage{gensymb}
\usepackage{textcomp}
%...

17\,\celsius % best (with textcomp)

Some keyboard layouts feature the degree symbol, you can use it directly if you are using UTF-8 and textcomp. For better results (font quality) we recommend the use of an appropriate font, like lmodern:

\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage{textcomp}

% ...

17\,°C

17\,% best

Other symbols

LaTeX has many symbols at its disposal. The majority of them are within the mathematical domain, and later chapters will cover how to get access to them. For the more common text symbols, use the following commands:

Command Sample Character
\% %
\$ $
\{ {
\_ _
\P
\ddag n/a
\textbar n/a |
\textgreater >
\textendash n/a
\texttrademark n/a
\textexclamdown n/a ¡
\textsuperscript<nowiki>{a}</nowiki> a
\pounds n/a £
\# #
\& &
\<nowiki>}</nowiki> }
\S §
\dag n/a
\textbackslash n/a \
\textless <
\textemdash n/a
\textregistered n/a ®
\textquestiondown n/a ¿
\textcircled<nowiki>{a}</nowiki> n/a
\copyright n/a ©

Not mentioned in above table, tilde (~) is used in LaTeX code to produce non-breakable space. To get printed tilde sign, either write \~{} or \textasciitilde{}. And a visible space can be created with \textvisiblespace.

For some more interesting symbols, the Postscript ZipfDingbats font is available thanks to the pifont package. Add the declaration to your preamble: \usepackage{pifont}. Next, the command \ding{number}, will print the specified symbol. Here is a table of the available symbols:

.

In special environments

Math mode

Several of the above and some similar accents can also be produced in math mode. The following commands may be used only in math mode.

LaTeX command Sample Description Text-mode equivalence
\hat{o} circumflex \^
\widehat{oo} wide version of \hat over several letters
\check{o} vee or check \v
\tilde{o} tilde \~
\widetilde{oo} wide version of \tilde over several letters
\acute{o} acute accent \'
\grave{o} grave accent \`
\dot{o} dot over the letter \.
\ddot{o} two dots over the letter (umlaut in text-mode) \"
\breve{o} breve \u
\bar{o} macron \=
\vec{o} vector (arrow) over the letter

When applying accents to letters i and j, you can use \imath and \jmath to keep the dots from interfering with the accents:

LaTeX command Sample Description Sample with upper dot
\hat{\imath} circumflex on letter i without upper dot
\vec{\jmath} vector (arrow) on letter j without upper dot

Tabbing environment

Some of the accent marks used in running text have other uses in the tabbing environment. In that case they can be created with the following command:

Unicode keyboard input

Some operating systems provide a keyboard combination to input any Unicode code point, the so-called unicode compose key.

Many X applications (*BSD and GNU/Linux) support the Ctrl+Shift+u combination. A 'u' symbol should appear. Type the code point and press enter or space to actually print the character. Example:

<Ctrl+Shift+u> 20AC <space>

will print the euro character.

Desktop environments like GNOME and KDE may feature a customizable compose key for more memorizable sequences.

Xorg features advanced keyboard layouts with variants that let you enter a lot of characters easily with combination using the aprioriate modifier, like Alt Gr. It highly depends on the selected layout+variant, so we suggest you to play a bit with your keyboard, preceeding every key and dead key with the Alt Gr modifier.

Notes and References

  1. For a quick explanation on character sets, see this article on Joel Spolski's blog.
  2. For a detailed information on the package, see complete specifications written by the package's authors.
This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.