Volume 12, Number 2, December 2016 - DOI: http://dx.doi.org/10.21700/ijcis.2016.274

IJCIS

Computing and Information Sciences is a peer reviewed journal that is committed to timely publication of original research, surveying and tutorial contributions on the analysis and development of computing and information science. The journal is designed mainly to serve researchers and developers, dealing with information and computing. Papers that can provide both theoretical analysis, along with carefully designed computational experiments, are particularly welcome. The journal is published 2-3 times per year with distribution to librarians, universities, research centers, researchers in computing, mathematics, and information science. The journal maintains strict refereeing procedures through its editorial policies in order to publish papers of only the highest quality. The refereeing is done by anonymous Reviewers. Often, reviews take four months to six months to obtain, occasionally longer, and it takes an additional several months for the publication process.

DOI: http://dx.doi.org/10.21700/ijcis.2016.119

Automatic Diacritics Restoration for Dialectal Arabic Text

Ayman A. Zayyan1.*  email: zayyan@qu.edu.qa
Mohamed Elmahdy2
Husniza binti Husni3
Jihad M. Al Ja’am1

1Department of Computer Science and Engineering, College of Engineering, Qatar University, Qatar

2Faculty of Media Engineering and Technology, German University in Cairo, Cairo, Egypt.

3School of Computing, College of Arts and Sciences, Universiti Utara Malaysia, Malaysia

 

 

*Corresponding author

Received: 4 June 2016
Revised: 10 August 2016
Accepted: 25 August 2016
Published: 25 December 2016

Abstract: In this paper, the problem of missing diacritic marks in most of dialectal Arabic written resources is addressed. Our aim is to implement a scalable and extensible platform for automatically retrieving the diacritic marks for undiacritized dialectal Arabic texts. Different rule-based and statistical techniques are proposed. These include: maximum likelihood estimate, and statistical n-gram models. The proposed platform includes helper tools for text pre-processing and encoding conversion. Diacritization accuracy of each technique is evaluated in terms of Diacritic Error Rate (DER) and Word Error Rate (WER). The approach trains several n-gram models on different lexical units. A data pool of both Modern Standard Arabic (MSA) data along with Dialectal Arabic data was used to train the models.

Keywords: Diacritization; Vowelization; Sialectal Arabic; Text Processing.


  • PDF (895 KB)
  • ZIP (843 KB)


  •  

    Contacts

    Editor-in-Chief
    Prof. Jihad Mohamad Alja'am 
    Email: journal.editor.ijcis@gmail.com

    The Journal Secretary
    Eng. Dana Bandok
    Ontario, Canada 
    Email: sec.ijcis@gmail.com 

    Home Page »