PhD Public Defense, Anwar ul Haque

Title: The Storyteller - Computer Vision Driven Context and Content Transcription System for Images 
Speaker: Anwar ul Haque, PhD scholar, Department of Computer Science, IBA
Advisor:  Dr. Sayeed Ghani
External Examiners: Dr. Arif Mahmood (Information Technology University, Lahore) and Dr. Waheed Iqbal (Faculty of Computing & Information Technology, University of the Punjab).
Date:June 13, 2025, at 2:30 PM
Venue: Tabba Conference Room, Tabba Academic Block, Second Floor, Main Campus, IBA Karachi

Abstract:
The biological vision given to all life is through the combination of small molecular organisms, which helps to view and understand the surroundings. It is one of the most vital forms of receptors a human being can have. Generally, human vision is a combination of sensors (eye) for capturing the light, a signal traveling path called a receptor, which allows the light signals to travel to the visual cortex. The perfect combination of these systems helps us see and understand the world.

The last 30 years have seen significant research aiming to provide a similar vision to machines and computers. Earlier work in the 1900s started with capturing light to generate an image (photography). Technology advanced towards more expectations, like perceiving and understanding what is in the image, just like humans do. However, having something like human perception, understanding, and recognition requires a huge amount of experience, which our brain has inherited from the past human evolution of thousands of years. For a computer/machine to have similar capability means the same amount of experience and an equivalent level of the algorithms to process the image as our visual cortex processes. The image under observation is quite different for a machine than a human perceives. The entire image for a machine is a combination of multiple rows and columns, creating a matrix representing the formation of image elements. An ideal computer vision system would have an algorithm to interpret the image matrix as our brain does. However, that is quite far in the future, even for this modern age.

A general way to reach the goal of computer vision is to train the machines to understand the fundamental geometry, shapes, and orientation of objects in the image. The era of the 90s started with human-crafted shapes, sizes, and other fundamental elements of any object to train the machine for the autonomous understanding of an image and its contents. Despite achieving significant success, the research works were far from practical as small variations for every object could create millions of viewpoints. For example, a single change in the degree of an object will require a completely new feature map for the machine to learn. In the 2000s, we started using machine learning, which enabled us to train machines with a list of given feature matrices and support many variations. Despite the usage of machine learning, having a manually created feature matrix was not enough for success on a generic scale.

The advancement of deep learning enabled a new frontier towards training machines for computer vision. The era of deep learning has enabled machines to learn the representative features automatically instead of manually generated feature matrices. This behavior helps improve learning scopes for computer vision. Due to deep learning, our machines are better trained for various computer vision tasks and sometimes rival humans in various challenges, e.g., detecting and segmenting objects within images. Despite this level of success, a true mimicry of the human level of vision is a challenging task that still needs to be solved. The reasons for not achieving true human-level mimicry are likely the amount of data to train, computation resources, better algorithms & various features like lighting, color, angle, shape, etc.

Generally, computer vision involves a large number of tasks; however, covering various outcomes of human-level vision requirements, e.g., object detection, object segmentation, object tracking, image classification, scene classification, object classification, image captioning, video captioning, etc., are the primary and essential tasks. Regarding the complexity level of the tasks, image captioning is one of the most complex tasks for computer vision research. Since image captioning involves object detection, object classification, generation of relationships among the objects, and then describing the entire image in human language. Many innovative and significant research with state-of-the-art outcomes has been done in the past six years, generating quite a human-like caption for an image. However, they are focused on something other than the content and context of the story within the caption from the image's perspective, as we humans do.

In The Storyteller, I brought a more human-like storytelling system that can caption images with the perspective of content, context, and knowledge. We have attempted to provide a working solution for several applications, including generating datasets for training Self-driving vehicles, videos to subtitle generation, and giving suggestive reasoning over MRI images. Our methodology combines capsule networks for image encoding, knowledge graphs for context and content, and transformer neural networks for text generation. Capsule networks extract spatial and orientational details from the images during feature extraction. Using a knowledge graph as a knowledge engine finds content, context, and semantics from the corpus against the generated features from the feature encoding stage. The decoding phase comprises transformer neural networks fed by the knowledge graph-driven annotation iterator. We have used dynamic multi-headed attention in transformer neural networks to make the model tangible regarding time and memory. Transformer neural networks are good in long-termed dependencies, which is essential for generating the referential context of a sentence from its previous sentence.

We have used the MS COCO dataset for training, validation, and testing. MS COCO has been used by state-of-the-art research in image captioning and serves as the benchmark dataset. The acquired results provide good content, context understanding, and metrics like B4: 71.93, M: 39.14, C: 136.53, and R: 94.32. The usage of adverbs and adjectives within the generated sentence according to the objects' geometrical and semantic relationship is phenomenal. The results also reflected an in-depth understanding of positional information within the generated text due to the positional understanding encoding engine. One of the key attributes of The Storyteller's result is dense captioning. A single image is captioned up to three sentences with complete cohesiveness and conciseness. The captioned sentences remain glued to each other and provide a proper connection.