Features of Distributional Method for Indonesian Word Clustering

Herry Sujaini


We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.


n-gram;word clustering;word similarity;EWSB

Full Text:



D. Jurafsky, dan H.Martin, “Speech and language processingâ€, Parson International Edition, New Jersey, 2009.

S. Ker and J. Zhang, “A Class-based Approach to Word Alignmentâ€, in Computational Linguistics, Vol. 23, No. 2, pp 313-343, 1997.

Z. Harris, “Distributional structureâ€, Word, pages 146–142, 1954.

D. Hindle, “Noun classification from predicate-argument structuresâ€, In Proceedings of ACL-90, pages 268–275, 1990.

G. Grefenstette, “Explorations In Automatic Thesaurus Discoveryâ€, Kluwer Academic Publishers, 1994.

I. Dagan, F. Pereira, and L. Lee. “Similarity-Based Estimation of Word Cooccurrence Probabilitiesâ€, In Proceedings of ACL 94, 1994.

I. Dagan, S. Marcus, and S. Markovitch. “Contextual Word Similarity and Estimation From Sparse Dataâ€, Computer, Speech and Language, 9:123–152, 1995.

D. Lin, “Automatic Retrieval and Clustering of Similar Wordsâ€, Proceedings of the 17th international conference on computational linguistics. Vol. 2. Canada, 1998.

I. Dagan, L. Lee, and F. Pereira, “Similarity-based models of word cooccurrence probabilities. Machine Learningâ€, 34(1-3):43–69, 1999.

J. Bellegarda, J.W. Butzberger, Y.L. Chow, B.C. Noah, D. Naik, “A Novel Word Clustering Algorithm Based on Latent Semantic Analysisâ€, in Proceedings of ACSSAP 1996, Atlanta, USA, 1996.

L. Lee, “Measures of Distributional Similarityâ€, In Proceeding of the 37th Annual Meeting of the ACL, pages 25–32, 1999.

M. Geffet and I. Dagan, “Feature Vector Quality and Distributional Similarityâ€, Proceedings Of the 20th International Conference on Computational Linguistics, 2004.

J. Weeds and D. Weir, “Co-oocurrence Retrieval: A Flexible Framework for Lexical Distributional Similarityâ€, in Computational Linguistics, 31(4):439–476, 2005.

P. Muller, N. Hathout, and B. Gaume, “Synonym Extraction Using a Semantic Distance on a Dictionaryâ€, in Proceedings of the Workshop on TextGraphs on Graph-based Algorithms for Natural Language Processing, 2006.

R. Sinha and R. Mihalcea, “Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarityâ€, in Proceedings of the IEEE International Conference on Semantic Computing, CA, USA, 2007.

K. Ichioka and F. Fukmoto, “Graph based Clustering for Semantic Classification of Onomatopoetic Wordsâ€, in Proceedings of the 3rd Text graphs Workshop on Graph-based Algorithms for Natural Language Processing, Manchester, UK, 2008.

M.A. Jeff, S. Matsoukas, S.R. Schwartz, “Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithmâ€, Proceedings of the MT Summit XIII, Xiamen, China, 2011.

H. Sujaini, Kuspriyanto, A.A. Arman, and A. Purwarianti, “Extended Word Similarity Based Clustering on Unsupervised PoS Induction to Improve English-Indonesian Statistical Machine Translationâ€, 16th ORIENTAL COCOSDA/CASLRE-2013, Gurgaon, India, 2013.

M. Geffet and I. Dagan. “The Distributional Inclusion Hypothesises and Lexical Entailmentâ€, In Proceedings Of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 107–114, 2005.

F. Joseph. “Inferring Phylogeniesâ€, Sinauer Associates, Inc., Sunderland, Mass, 2004.

DOI: http://dx.doi.org/10.26418/jp.v5i2.33049


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
  View My Stats