A Computational Analysis of Language, Place, and Sentiment in Anglophone Kashmiri Literature

This project uses Python-based text analysis to explore how place, language, and emotion intersect in Anglophone Kashmiri literature. It investigates which geographic locations are most frequently referenced, the proportion of romanized Kashmiri versus English words, and how this correlates with emotional tone. The study combines natural language processing with literary interpretation to uncover patterns that are often obscured in traditional close reading.

Methodology Overview

To process the multilingual corpus, I created three custom wordlists:
- English Dictionary: Compiled from open-source resources, including irregular verbs and UK/US variants.
- Romanised Urdu-Hindi Dictionary: Includes commonly used words transliterated from Devanagari and Nastaliq scripts.
- Romanised Kashmiri Wordlist: Manually curated based on linguistic research and textual context.
Textual data from a selected corpus of novels and memoirs was extracted and normalised. Romanized script inconsistencies and diacritics were handled using unicodedata. Language tagging was done through dictionary filtering and regex pattern matching.
Rule-based sentiment analysis was performed using TextBlob, scoring the emotional polarity of sentences or paragraphs containing Kashmiri words. Statistical validation was attempted using Pearson correlation and the Mann-Whitney U test to examine relationships between sentiment and language presence.
Programming: Python (VS Code + Jupyter extension)
Text Processing: re, unicodedata
Visualisation: Matplotlib, Seaborn, WordCloudSentiment Analysis: NLTK
Statistical Testing: Pearson Correlation, Mann-Whitney U

Findings

Most Represented Areas in Anglophone Kashmiri Literature

The summer capital and the biggest urban area Srinagar dominate the landscape of modern Kashmiri literature, reflecting a city-centric focus in narratives about conflict, loss, and memory. Mapping place mentions revealed how geographic references cluster around conflict zones, governmental hubs, and sites of personal significance.

Linguistic Composition of Texts

The data used was of the same texts as showcased above. While authored by Kashmiris and deeply rooted in Kashmiri themes, the use of English situates them within a postcolonial literary framework where global legibility often comes at the cost of linguistic authenticity. This raises important questions about cultural identity, language preservation, and the role of English as both a bridge and a barrier. This text frequency analysis visualised in a pie chart asks for a deeper conversation regarding the identity of such literature and can be forwarded across the corpus underscores the need for a more nuanced classification—one that acknowledges both the geopolitical origins of these texts and their linguistic realities.

Enhanced Sentiment Analysis Using Manually Annotated
Kashmiri Lexicon Data

Since sentiment analysis does not account for the Kashmiri language, I developed a program to extract romanised Kashmiri words from certain texts. Those words were collected, sorted for issues, duplicates and manually annotated for emotional data such as positive, negative or neutral sentiment.

The sentiment annotations for the romanised Kashmiri words (marked as positive, neutral, or negative) are based on manual labelling and may not fully capture the nuanced emotional context within the novel. These labels are approximate and subjective, and therefore the combined sentiment scores should be interpreted with caution. Future improvements could include refining this lexicon using native speaker validation or machine learning techniques tailored to the Kashmiri language.