PhD Defense of Computer Science Scholar, Mr. Sharaf Hussain
Title: Retrieval of Mathematical Information with Syntactic and Semantic Structure
Date: Friday, April 9, 2021
Venue: MCC 11, Aman CED, IBA Main Campus, Karachi, Pakistan.
Advisor: Dr. Shakeel Khoja
Dr. Shafay Shamail (LUMS) and Dr. Malik Muhammad Saad Missen (IUB Bahawalpur)
The efficient retrieval of mathematical expressions over the web is a complex process as compared to simple text searches. It is only possible when the syntactic (for example, Textual) and semantic (for example, Structural) information of a mathematical expression is retrieved properly and analyzed methodically. This research proposes a technique that indexes expressions along with their syntactic and semantic information. The proposed technique also improves memory storage efficiency for the inverted index by encoding indexing terms in Braille Unicode.
The mathematical expressions are originally represented in Content MathML (CMML) for indexing. However, the majority of scientific collection of documents contains mathematical expressions in the LATEX math style. Therefore, a rule-based conversion technique is developed for transforming LATEX math expressions into CMML, termed as LATEX Math Grammar (LMG).
A weighting function that assigns a weight to each indexing term is introduced to improve the ranking of retrieved documents. The weighting score of each term contributes to the ranking function that improves the rank of a document that contains query terms. Multiple indices are created in a distributed environment to avoid large storage of an inverted index in a centralized location. Additionally, a user-friendly graphical user interface is developed for users so that both experienced and general users can use systems without any hassle.
The proposed technique has been evaluated on Wikipedia and Arxiv NTCIR-12-MathIR corpora, other than that three sets of ArXiv document dumps are also selected for testing the performance of the system on a large collection of mathematical expressions.
The performance metrics are divided into two categories; retrieval performance and system execution performance. Retrieval performance is measured using NTCIR-MathIR evaluation criteria. The Wikipedia queries without wildcards resulted in the nDCG value of 49.02%, the MSnDCG value of 49.66%, Precision values of 45.50%, the Average Precision (AP) value of 49.32%, and nERR value of 65.69% at the top 5 documents. The Arxiv queries without text resulted in the nDCG value of 48.38%, the MSnDCG value of 47.88%, Precision values of 44%, the AP value of 34.83%, and nERR value of 56.20% at the top 5 documents. The system execution performance on an uncompressed index (for example, without Braille encoding), it is observed that 18.63 million formulae stored per Gigabytes storage, 53.26 million formulae are indexed in per-hour time; the average search time of a query is 267 milliseconds. In contrast, The system execution performance on a compressed index (for example, with Braille encoding), it is observed that 45.10 million formulae stored in per Gigabytes storage, 49.66 million formulae are indexed in per hour time; the average search time of a query is 347 milliseconds.