One may compare the Mahabharata to the Greek Iliad or the Persian Shahnameh. Such epics contribute considerably to the culture of a nation, provide a vehicle for learning national languages, and may provide background information on the dominant religion of the country..
Now and then, one wishes to search the original text of the epic to answer questions such as "When did Paris first see Helen?" in the context of the Iliad, or "Where did Kṛṣṇa and Arjuna first meet?" in the context of the Mahabharata. If the person asking the question is not conversant with the classical language of the text, it may become necessary to work with a translation. For the Iliad and other ancient Greek works, a few high quality online searchable indexes and concordances exist. For the Mahabharata, printed resources are available, but their online versions are almost always scanned PDF files from hardcopy, and searching is not an available feature. There exist one or two Unicode text versions of the Mahabharata that one can find from a Web search, and I wished to see if a usable index could be constructed using these resources.
Recently, my interest in building such an index was sharpened after watching a running TV series on the Mahabharata on Star TV (India). With over a hundred half-hour episodes released so far, as of mid-February 2014 they have covered almost one book (Parva) of the eighteen books of the Mahabharata. Often, viewers ask if the TV realizations are faithful to the original text, whether sections have been left out and whether new material has been added that was not present in the written text. While paging through a printed Hindi translation of the Mahabharata, which is in six quarto volumes with a total of 6,511 pages, I found it difficult to locate the passages that I sought. Although the book has a table of contents, there is no index, and the table of contents is, itself, quite long!
The SDP platform used for the indexing work
I agreed to work on developing an index for the Mahabharata when Intel loaned me a Haswell SDP (Software Development Platform). One question about the task may be dealt with first: why use a convertible Ultrabook/tablet for this work, given that a desktop or a well-equipped (in terms of I/O features) laptop is perfectly capable of doing much of the work? Although an Ultrabook may have a keyboard that takes getting used to, and the particular Ultrabook that I received (Lenovo Yoga 2 PRO, 1.6 GHz i5-4200U, 4GB Ram, 128 GB SSD) had barely sufficient space to load all the large software packages that I would need (such as Intel Parallel Studio, Microsoft Visual Studio and Microsoft Office), it did have one advantage: portability and reasonably long battery life. One can carry an Ultrabook to a library and put to use in the special collections section.
To assure myself that the Ultrabook was capable of performing the intended work, I ran some benchmarks on the SDP Ultrabook and compared the results with those from my Core-2 E8400 desktop, a desktop-replacement laptop with an i7-2720M CPU, and another laptop with an i3-2350M CPU. The Ultrabook turned out to be the second fastest while perfoming CPU-intensive tasks such as collecting index entries from the Unicode text of the Mahabharata and sorting the index. With the exception of some multithreaded benchmarks, for which the four cores and eight threads of the i7-2720M gave it an edge over the i5-4200U, the Haswell i5 CPU came close to matching the Sandy Bridge i7 performance.
The trackpad gave me trouble sometimes, causing the cursor to go off screen or jump in startling ways. I plugged in a USB mouse to overcome this problem. I reduced the screen resolution to 2048 X 1152 instead of the recommended 3200 X 1800, and set font sizes to values that enabled comfortable viewing of text. Nevertheless, I found that some software packages could not accommodate the high screen resolution. For example, the Opera browser showed menu text of such tiny sizes that I could not decipher the menus, and I had to resort to Alt+F4 to quit the program.
Parsing the text of the Mahabharata and collecting the index entries
I elected to restrict myself to using authoritative versions of the Mahabharata that were available in Unicode Devanagari. I have had some experience with Unicode text processing, from a project involving reviewing and editing an English - Kannada dictionary a couple of years ago, and some of the software utilities that I had developed then could be used with slight modification for the new task. Because of my unfamiliarity with other encoding and transliteration schemes such as Itrans, and given that I had only about two weeks available to do something useful, I did not attempt to use non-Unicode versions of the text, regardless of how worthy those might be.
There are two freely available (but not necessarily copyright free) sources of the Mahabharata in Unicode, and both originated from the work of Dr. Muneo Tokunaga. The first version, available from John Smith's Indology page, has a simpler structure and is the result of corrections made to Tokunaga's work. The second version is from the Bhandarkar Oriental Research Institute (BORI), and displays titles for some of the subchapters. I am neither an Indologist nor do I know Sanskrit well, but it appeared to me that the second version had had the benefit of careful editing and showed word breaks in places that those knowledgeable in Sanskrit would find to be more natural. Therefore, despite the slightly more complicated structure, I chose the Devanagari UTF-8 HTML files of the BORI version as my primary sources.
I ran a sed script on the downloaded HTML files to strip off the HTML header and footer markups, putting the output into UTF-8 files. Following this, and after discovering some irregularities while parsing these files, I corrected those manually, using a Unicode enabled editor. I wrote a utility in C to read these files and replace the vertical line ('|', U+007C) characters used to demarcate the verses by the more appropriate single and double Danda ('।', U+0964, '॥', U+0965) characters of Devanagari script (Note: you will not see these characters displayed properly unless your browser is set to use a Unicode font that has the Devanagari range, U+0901 to U+0970, covered.)
The parser, also written in C, reads the eighteen files prepared above, and writes a single output file, with one Sanskrit word per line labelled with the Parva (book), Anuparva (chapter) and Sloka (verse) numbers. This file is about 30 MB in size, and contains 692,000 lines. Even though many words may appear multiple times in the files, each instance of such a word will probably have different book/chapter/verse labels. No attempt is made to split apart conjoined words (Sanskrit is replete with such words, and the Sandhi rules for joining and decomposing are complex). The size of the file makes it infeasible for a single reviewer to do exhaustive checking, unless assisted by a team of workers and a generous amount of time is available.
The last step within the scope of this task is to sort the word files, preserving the labels and preserving the order of lines with matching words. A number of variations of quicksort and mergesort were implemented and tested. Details of the sorting methods and comparisons of their features and performance will be described in a separate article. Let it suffice to note here that the filters described earlier take a second or two of CPU time and the sorting takes less than five seconds on the Haswell powered SDP Ultrabook.
What's next?
Once the sorted index of Sanskrit words and associated section numbers is available, a number of different applications become possible. A cross-linked HTML help file could be constructed; concordances could be produced; GUI applications may be built with which an Indologist or other interested person could select a name and receive the text of all the verses that contain that name. The needs and preferences of these end-users would need to be ascertained and accommodated.