VisualAIze: Accessible Spatial Awareness Using AI
Problem & Motivation
An estimated 2.2 billion people worldwide have vision impairment or blindness. Navigating large, crowded indoor spaces, such as airports, can be challenging for these individuals and often requires heavy reliance on assistance from others or memorization of layouts. Enhancing accessibility for visually impaired individuals improves their quality of life and promotes inclusivity and diversity in society. Accessible environments and technologies empower individuals with visual impairments to participate fully in social, educational, and professional activities.
The VisualAIze team aims to combat these challenges by leveraging large language models (LLMs) and multimodal datasets to generate real-time descriptions of images at various airport locations. By providing detailed descriptions of a user's surroundings, we hope that visually impaired individuals can quickly understand their environment to enable them better to navigate airports independently and confidently. The VisualAIze project contributes to the broader goal of creating a more inclusive society where individuals of differing abilities can feel empowered to explore the world confidently.
Data Source & Data Science Approach
Our project used the Indoor Scene Recognition Dataset from MIT. This dataset, which includes high-quality indoor images of airports, perfectly fits our project's needs despite initial concerns about image quality. It is too good to be realistic representations captured by visually impaired individuals. This observation stemmed from our review of the Vizwiz Dataset, where many photos taken by visually impaired persons are blurred, out-of-focus, or do not fully capture targeting items.
To address this discrepancy, we applied random Gaussian and motion blur, exposure adjustment, rotations, and cropping to the images to simulate the conditions of photos taken by visually impaired users. By doing so, we created a robust and proprietary dataset tailored to our task, allowing us to evaluate models using these processed images. In addition, we also performed other extensive EDAs like t-SNE and PCAs to ensure the datasets had enough variety and were suitable for our use case.
Because the output will be descriptive, we generated our golden data annotations. We took 92 images (a random 10% of the dataset) and sent them to GPT4 for description, then manually reviewed them and made them more descriptive. We pay special attention to photo accessibility features, like tactile guide strips and signs. With this dataset, we can quickly generate evaluation scores like Rouge and BERT scores and compare them among different models.
Evaluation
Evaluation is important because it lets us compare our approach against a well-known baseline and examines how feasible our product is for airport navigation. For our evaluation, we chose to compare four different models: BLIP, Llama3, gpt-4o, and gpt-4o mini.
We first evaluated these models using automated metrics, which included ROUGE, BertScore, and runtime. For our baseline, we chose BLIP, a well-known image captioning model.
Expectedly, our baseline performed the slowest and lowest out of our chosen models, given that it does not allow an input prompt or prompt tuning. Next, we tried llama3 because it’s open and free. Performance was improved over the baseline but was actually slower in terms of runtime. Finally, for the GPT-4o models (mini and regular), we noticed the performance, speed, and cost between them were very similar and that they both outperformed our baseline and llama3.
While quantitative evaluations are helpful, we believe that qualitative evaluations are just as, if not more important, to ensure that the outputs are actually applicable to navigating an airport. To evaluate this quality, we manually reviewed the outputs of each of our models on a random subset of images for conciseness, informativeness, and accuracy. Given the similarities observed between the 2 GPT-4o models, we only annotated GPT-4o and not GPT-4o mini to save time.
For this human evaluation, we created a scoring rubric for each metric and had multiple annotators evaluate the outputs for each model to ensure sufficient inter-annotator agreement. Interestingly, we found that GPT-4o performed highly for informativeness and accuracy, but for conciseness, it failed to follow our prompting instructions to output responses of less than three sentences. However, it was much more accurate than llama3 and BLIP.
Next, we checked the random subset of model outputs to see whether each model successfully identified key vision accessibility features when they were present in a given image, as we instructed in our prompts. For example, if a tactile yellow guide strip was present in the image, our models should have identified them in the descriptions.
In one example, we found that all four models mentioned a hand-railing feature. However, BLIP failed to mention the tactile guide strip and other important information, including the moving walkway and signage.
Next, we reviewed another important qualitative feature of our models: whether they can detect when a photo contains insufficient information for navigation. For example, given an image of a ceiling, we expect the models to ask for more information as we instructed them to do in our prompts. In this case, we found that other than BLIP, all three models mentioned the lack of relevant navigation information in the given image. However, Llama3 didn't request another image. In this same description, we found that Llama3 suggested asking airport staff for assistance. While this isn’t bad advice, it is interesting that the model recommends this, given that it goes against our mission of increasing the independence of visually impaired travelers.
Key Learnings & Impact
This project has been an impactful learning experience as the team applied cutting-edge AI and LLM methods and supported all members of society in their exploration of the world. Throughout the project development and in collaboration with users, the team found the need for individual accommodation to support independent, confident travel.
Data Quality and Annotation Solution: Applied rigorous data preprocessing techniques and manually annotated a large dataset to ensure high-quality inputs for the models.
Model Optimization and Deployment Solution: Extensive hyperparameter tuning, model evaluation, and cost analysis were conducted to confirm that GPT-4o is the model we will deploy. Paid for the API to save time and reduce complexity.
Accessibility Features Integration Solution: To enhance user experience, we developed and integrated multiple accessibility features, such as audio input/output, increased font size, and dynamic prompting. We also highlighted the accessibility features within the scene.
Acknowledgments
We are grateful for the guidance and encouragement of our Capstone Advisors, Joyce Shen and Kira Wetzel, who supported and advised us on our Capstone Project for the Masters in Data Science program at the School of Information, University of California, Berkeley. We especially thank our volunteer users for providing feedback and helping us to create a better product that safely fulfills their needs.