Annotation Error Detection and Correction for Indonesian POS Tagging Corpus

Muhammad Alfian; Umi Laili Yuhana; Daniel Siahaan; Harum Munazharoh

doi:10.24843/LKJITI.2025.v16.i01.p04

Authors

Muhammad Alfian Department of Informatics, Institut Teknologi Sepuluh Nopember
Umi Laili Yuhana Department of Informatics, Institut Teknologi Sepuluh Nopember
Daniel Siahaan Department of Informatics, Institut Teknologi Sepuluh Nopember
Harum Munazharoh Department of Indonesian Language and Literature, Universitas Airlangga

DOI:

https://doi.org/10.24843/LKJITI.2025.v16.i01.p04

Keywords:

Annotation Error Detection, Annotation Error Correction, POS Tagging

Abstract

Linguistic Corpus is the primary material for training and evaluating machine learning models, especially for POS Tagging. However, the human-annotated corpus is not free from annotation errors. Annotation errors have a negative impact on model performance. Therefore, we propose annotation error detection and correction. We detect annotation errors in the Indonesian POS Tagging corpus using the n-gram variation method. Then, we correct the corpus using an expert-voting approach. Annotation error detection successfully collected 6,536 annotation error candidates. Each candidate has two possibilities: (i) an ambiguous word or (ii) an incorrect annotation. Annotation error correction validated and corrected the candidates using the majority-voting method in an expert group. Annotation error correction successfully identified and corrected 503 words from 1918 sentences. Then, we compared the performance of the POS Tagging model with the corpus before and after correction. The results showed a significant improvement in the F1-score value (+9.69%) compared to the uncorrected corpus.

👁 Abstract Views: 172📥 pdf Downloads: 146