Features of Distributional Method for Indonesian Word Clustering

Herry Sujaini

Abstract


We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.

Keywords


n-gram;word clustering;word similarity;EWSB

Full Text:

PDF

References


D. Jurafsky, dan H.Martin, “Speech and language processing”, Parson International Edition, New Jersey, 2009.

S. Ker and J. Zhang, “A Class-based Approach to Word Alignment”, in Computational Linguistics, Vol. 23, No. 2, pp 313-343, 1997.

Z. Harris, “Distributional structure”, Word, pages 146–142, 1954.

D. Hindle, “Noun classification from predicate-argument structures”, In Proceedings of ACL-90, pages 268–275, 1990.

G. Grefenstette, “Explorations In Automatic Thesaurus Discovery”, Kluwer Academic Publishers, 1994.

I. Dagan, F. Pereira, and L. Lee. “Similarity-Based Estimation of Word Cooccurrence Probabilities”, In Proceedings of ACL 94, 1994.

I. Dagan, S. Marcus, and S. Markovitch. “Contextual Word Similarity and Estimation From Sparse Data”, Computer, Speech and Language, 9:123–152, 1995.

D. Lin, “Automatic Retrieval and Clustering of Similar Words”, Proceedings of the 17th international conference on computational linguistics. Vol. 2. Canada, 1998.

I. Dagan, L. Lee, and F. Pereira, “Similarity-based models of word cooccurrence probabilities. Machine Learning”, 34(1-3):43–69, 1999.

J. Bellegarda, J.W. Butzberger, Y.L. Chow, B.C. Noah, D. Naik, “A Novel Word Clustering Algorithm Based on Latent Semantic Analysis”, in Proceedings of ACSSAP 1996, Atlanta, USA, 1996.

L. Lee, “Measures of Distributional Similarity”, In Proceeding of the 37th Annual Meeting of the ACL, pages 25–32, 1999.

M. Geffet and I. Dagan, “Feature Vector Quality and Distributional Similarity”, Proceedings Of the 20th International Conference on Computational Linguistics, 2004.

J. Weeds and D. Weir, “Co-oocurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity”, in Computational Linguistics, 31(4):439–476, 2005.

P. Muller, N. Hathout, and B. Gaume, “Synonym Extraction Using a Semantic Distance on a Dictionary”, in Proceedings of the Workshop on TextGraphs on Graph-based Algorithms for Natural Language Processing, 2006.

R. Sinha and R. Mihalcea, “Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity”, in Proceedings of the IEEE International Conference on Semantic Computing, CA, USA, 2007.

K. Ichioka and F. Fukmoto, “Graph based Clustering for Semantic Classification of Onomatopoetic Words”, in Proceedings of the 3rd Text graphs Workshop on Graph-based Algorithms for Natural Language Processing, Manchester, UK, 2008.

M.A. Jeff, S. Matsoukas, S.R. Schwartz, “Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm”, Proceedings of the MT Summit XIII, Xiamen, China, 2011.

H. Sujaini, Kuspriyanto, A.A. Arman, and A. Purwarianti, “Extended Word Similarity Based Clustering on Unsupervised PoS Induction to Improve English-Indonesian Statistical Machine Translation”, 16th ORIENTAL COCOSDA/CASLRE-2013, Gurgaon, India, 2013.

M. Geffet and I. Dagan. “The Distributional Inclusion Hypothesises and Lexical Entailment”, In Proceedings Of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 107–114, 2005.

F. Joseph. “Inferring Phylogenies”, Sinauer Associates, Inc., Sunderland, Mass, 2004.




DOI: http://dx.doi.org/10.26418/jp.v5i2.33049

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.  
  View My Stats