Material and methods

In this section we will take a close look at complexities
involved in the recognition process.

Number of Character Shapes

In Arabic
each letter can have di?erent
shapes depending on its position i.e. initial, middle and ending. Some letters
join with other letters from both sides, some join from only one side and some
do not join at all. Each connected piece of characters is also known as
ligature or sub word. Thus a word can consist of one or more sub words. In Urdu
the shape of the character not only depend on its position but also on the
character to which it is being joined. The characters change their shape in
accordance with the neighboring characters. This feature of Nastaliq is also
known as context sensitivity. Thus in Urdu the possible shapes of a single
character are not limited to 3 but it can have many more shapes depending on
the preceding and following characters. Among these classes character hamza (?)
do not join from any side and make only one primary shape while all other
characters connect form either right or both sides. Di?erent shapes of charter bay (?) when joined with
characters from di?erent
classes at di?erent positions.



The calligraphic nature of Nastaliq also introduces
slopping in the text. Slopping mean that as the new letters are joined with
previous letters, a slope is introduced in the text because the letters are
written diagonally from top right to bottom left. One of the major advantages
of slopping is that it conserves a lot of writing space.

Slopping also means that characters no more join
with each other on the baseline which is an important property in Naskh. It is
utilized in the character segmentation algorithms for Arabic/Persian text. So
the character segmentation algorithms designed for Arabic/Persian text cannot
be applied on the Urdu text. Number of character shapes and slopping makes
Nastaliq character segmentation most challenging task in the whole recognition
process and till now in our knowledge not a single algorithm exists which
promises decent results in segmentation of sub words into individual characters.
This is also one of the main hurdle which keeps most of the researchers away
from accepting the challenge of Nastaliq character recognition.


Another very important property of the Nastaliq
style is stretching. Stretching means that letters are replaced with a longer
versions instead of their standard version. Some characters even change their
default shape when stretched i.e. seen (?) however some only change their
width. The purpose of stretching is not only to bring more beauty into the character
but it also serves as a tool for justification. Justification means that the text meets the boundaries of the bounded area
irrespective to the varying length of the sentences. However it should be noted
that not every character in Urdu can be stretched. For example alif (?), ray (?),
daal (?) cannot be stretched but bay (?), seen (?) and fay (?) can be
stretched. It should also be noted that stretching works closely with the
context sensitive property of Nastaliq and certain class of characters can only
be stretched when joined with another character of a certain class or written
at a certain position (initial, medial and end). All these attributes of
stretching show that stretching is a complex procedure and it also increases the
complexity in machine recognition. Standard Nastaliq fonts used in the prints
normally do not support stretching. However it is commonly used in the titles
of the books and calligraphic art. So if we are dealing only with machine
printed Nastaliq text, we normally do not need to worry about stretching, but
if we are dealing with calligraphic or handwritten Nastaliq document, there is
a huge possibility that we have to deal with stretched version of characters.

and Spacing

Like stretching, positioning and spacing are an
important tool for justification in Nastaliq and are also used for the beautification
of text. Positioning means the placement of ligatures and sub words in Nastaliq
and spacing means the space between two consecutive ligatures. In normal
situations the ligatures are written to right of previous ligature with a small
standard spacing. But positioning allows the ligatures to be placed at di?erent positions such as new ligature is started
somewhere from the top of previous ligature or it can be placed right above it
even if it is a part of another word. Positing will not care even it had to
overlap and connect two ligatures if the need arises. Unlike stretching,
positioning is quite common and used extensively in the news heading in the
Urdu print media industry because of its extreme power to accommodate long and
big headings in small spaces in the paper. All these flexibilities and
strengths of Nastaliq make it real challenge for the machine recognition. On
one hand context sensitivity and sloping makes the character segmentation a
very di?cult task and on the other hand positioning
makes even the ligature and sub word segmentation equally more di?cult.

