Skip to content

vukbatanovic/SCStemmers

Repository files navigation

SCStemmers - a collection of stemmers for Serbian and Croatian

This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian:

Text Encoding

All stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded.

Since Serbian is a digraphic language the input texts can be in either the Cyrillic or the Latin script. All stemmers produce output in the Latin script.

Dual1 Coding System

The stemmers for Serbian internally use the so-called dual1 coding system in which only the Latin script characters without diacritical marks are allowed. To obtain dual1-coded texts all Cyrillic characters are first translated into their Latin script equivalents. Afterwards, all characters with diacritical marks are replaced in the following manner:

  • Č/č is coded as Cx/cx
  • Ć/ć is coded as Cy/cy
  • Dž/dž is coded as Dx/dx
  • Đ/đ is coded as Dy/dy
  • Ž/ž is coded as Zx/zx
  • Š/š is coded as Sx/sx

The greedy and the optimal stemmers of Kešelj and Šipka (but not Milošević's refinement of the greedy stemmer) also apply the following:

  • Lj/lj is coded as Ly/ly
  • Nj/nj is coded as Ny/ny

The stemmers for Serbian also accept texts in the dual1 coding as input, but will still produce the normal Latin script text as output. However, this behavior can easily be changed by applying the coding transformation methods, supplied within the SerbianStemmer class, to the output text.

Usage

All stemmers can be used in a program through the interface declared in the SCStemmer abstract class, via the methods:

public String stemWord (String word)
public String stemLine (String line)
public String stemText (String text)
public void stemFile (String fileInput, String fileOutput)

Command-line interface

The supplied SCStemmers.jar file makes it possible to stem the contents of textual files using the command line. Stemmers from the SCStemmers package can be invoked by the following command:

java -jar SCStemmers.jar StemmerID InputFile OutputFile

where StemmerID is a number identifying the stemming algorithm:

  • 1 - Kešelj & Šipka - Greedy
  • 2 - Kešelj & Šipka - Optimal
  • 3 - Milošević
  • 4 - Ljubešić & Pandžić

InputFile is the path of the TXT file encoded in UTF-8 that is to be stemmed. The stemmed text will be placed in the file determined by the OutputFile argument.

Weka

Alternatively, the stemmers can be utilized as an unofficial plug-in module within Weka (Waikato Environment for Knowledge Analysis). To do so, download the SCStemmers Weka package. Open the Weka package manager (available in Weka >= 3.7) and use the "Unofficial - File/URL" option to select and install SCStemmers. After restarting Weka, the list of available stemmers (within the StringToWordVector filter) will also contain the four stemmers from this package.

References

If you wish to use this package in your paper or project, please include a reference to the following paper in which it was presented:

Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset, Vuk Batanović, Boško Nikolić, Milan Milosavljević, in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia (2016).

Be sure to also cite the original paper of each stemmer you use:

Additional Documentation

All classes and non-trivial methods contain extensive documentation and comments, in both Serbian and English. If you have any questions about the stemmers' functioning, please review the supplied javadoc documentation, the source code, and the papers listed above. If no answer can be found, feel free to contact me at: vuk.batanovic / at / ic.etf.bg.ac.rs

License

GNU General Public License 3.0 (GNU GPL 3.0)