ComprehensionWorkshop

Predicting Syntactic Dependencies in Albanian Language by leveraging mouse tracking data

Authors:
Haveriku, Alba, alba.haveriku@fti.edu.al, Polytechnic University of Tirana
Kote, Nelda, nkote@fti.edu.al, Polytechnic University of Tirana
Çepani, Anila, anila.cepani@unitir.edu.al, University of Tirana
Çerpja, Adelina, adicerpja@gmail.com, Academy of Sciences of Albania
Kajo Meçe, Elinda, ekajo@fti.edu.al

Keywords: mouse tracking, universal dependencies, low-resource languages, Albanian

Abstract:

Purpose: High-quality eye-tracking equipment is costly, especially in the case of understudied languages, such as Albanian, where there is no dedicated laboratory infrastructure. To address this, we explore the usage of low-cost alternatives to study reading behaviors. Specifically, we leverage the MoTR (Mouse Tracking for Reading) tool developed by Wilcox et al. [1] to enhance the prediction of syntactic prediction in Albanian.

Methods: Our study aims to integrate two datasets: (1) a mouse tracking corpus collected from 50 native Albanian speakers, collected using the adaptions made to the MoTR tool in Haveriku et al. [2]; and (2) a Stanza model trained using a universal dependencies treebank for Standard Albanian Language (SAL), presented in Kote et al. [3], composed of 24,537 tokens, including part of speech (PoS) tagging, morphological features, lemmas and syntactic dependencies. We leverage the MoTR corpus by integrating the data from the ConLL-u files (automatically generated by the Albanian model) to understand whether the addition of mouse tracking data can improve the accuracy of predicting syntactic dependencies. Four models were tested on the merged corpora: Random Forest, XGBoost, LightGBM and Multi-Layer Perceptron (MLP) to predict syntactic dependencies HEAD and DEPREL, comparing their accuracy with and without gaze duration as a feature.

Results & Discussion: Our preliminary results indicate that incorporating the total gaze duration improves the prediction of each of the models. The XGBoost model’s accuracy increased from 98.4% to 98.7% for the HEAD prediction, while for DEPREL prediction the LightGBM model improved 98.6% to 98.8 %. Similar trends are observed for the other models, demonstrating that collected reading times reflect syntactic complexity. The findings suggest that, even without eye-tracking data, mouse tracking data enhances existing models for low-resource languages like Albanian, opening new possibilities for more advanced data processing and predictive model evaluation.

REFERENCES [1] Wilcox, E. G., Ding, C., Sachan, M., & Jäger, L. A.: Mouse Tracking for Reading (MoTR): A new naturalistic incremental processing measurement tool. In: Journal of Memory and Language, 138 (2024). https://doi.org/10.1016/j.jml.2024.104534 [2] Haveriku, A., Bedulla, S., Kote, N., Meçe, E.K.: Understanding Reading Patterns of Albanian Native Readers through Mouse Tracking Analysis. In: The 39th International Conference on Advanced Information Networking and Applications (AINA-2025) (in press). [3] Kote, N. Rushiti, R. Çepani, A., Haveriku, A., Trandafili, E., Meçe, E. K., Rakipllari, E. S., Xhanari, L., Deda, A.: Universal Dependencies Treebank for Standard Albanian: A New Approach. In: Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024), pp 80-89, Sofia, Bulgaria, Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences. (2024). https://aclanthology.org/2024.clib-1.7/