Quantcast
Channel: Typophile - Comments
Viewing all articles
Browse latest Browse all 20084

For those who wonder how the

$
0
0

In reply to Adobe Devanagari Font:

For those who wonder how the statistics based on the dictionaries can be obtained, here is the method I used (using Python regular expressions).

As John said, a compound is a sequence of virama separated consonants. A consonant can be defined by the Python regular expression

   [\u0915-\u0939\u0958-\u095F]

(I just rewrote in Python what John wrote in words above). The Virama is \u094D. A compound is thus a sequence of 1 or more consonants followed by \u094D (whence the + operator in the code below), with a consonant appended to it. For grouping, I used (?:expression) to avoid the back referencing mechanism.

The dictionary contains one word per line. The following program reads line per line, finds the compounds in each line, and outputs them directly.

import re, sys

compound = ur'(?:[\u0915-\u0939\u0958-\u095F]\u094D)+[\u0915-\u0939\u0958-\u095F]'
f=open(sys.argv[1])
word = f.readline().decode('utf-8')
while word:
  for comb in re.findall(compound, word):
    print comb.encode('utf-8')
  word = f.readline().decode('utf-8')

If we call this stub compounds.py and if the dictionary is aspell.txt, then

python compounds.py aspell.txt

outputs the compounds (I should rather say the candidates for producing compounds), one per line, as many times as they occur. That should work on any platform.

To get a more sophisticated output, you can make a more involved Python program, or just use standard unix commands if you are on Linux or OS X. The first thing to do is to sort the compounds so that they are grouped together in the output and then use the unix command uniq with the option -c to count them.

python compounds.py aspell.txt | sort | uniq -c

Here are the first lines of the output:

223 क्क
1 क्क्क
1 क्क्ड़
32 क्ख

So there are 223 occurrences of क्क. It would be interesting now to have those numbers is descending order. Again, all that is needed is to sort those last lines according to the numerical value (option -n) and I'll choose the reverse order (option -r). Here is the full command.

% python compounds.py aspell.txt | sort | uniq -c | sort -n -r

Here are the first lines of the output

3200 प्र
1614 त्र
1599 क्ष
961 स्त
924 र्ण

Once you know the compounds, you can search for the words containing them in the dictionary using the unix command grep. Very little programming is thus required.

Of course, to produce a table for LaTeX, the compounds found were put in a Python dictionary and the processing was all done in Python. The full source is 39 lines of Python after removing comments and blank lines (but including the 8 lines above).

Michel
Rem: I removed duplicates in aspell after I produced my last tex files; the number of occurrences for स्त has decreased from 962 to 961.


Viewing all articles
Browse latest Browse all 20084

Trending Articles