Skip to content

A multilingual lexicon of words to hurt.

Notifications You must be signed in to change notification settings

franciellevargas/Hurtlex

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Hurtlex

HurtLex is a lexicon of offensive, aggressive, and hateful words in over 50 languages. The words are divided into 17 categories, plus a macro-category indicating whether there is stereotype involved. The 17 categories are:

Label Description
PS negative stereotypes ethnic slurs
RCI locations and demonyms
PA professions and occupations
DDF physical disabilities and diversity
DDP cognitive disabilities and diversity
DMC moral and behavioral defects
IS words related to social and economic disadvantage
OR plants
AN animals
ASM male genitalia
ASF female genitalia
PR: words related to prostitution
OM: words related to homosexuality
QAS with potential negative connotations
CDS derogatory words
RE felonies and words related to crime and immoral behavior
SVP words related to the seven deadly sins of the Christian tradition

Hurtlex has a 2-level structure. Lemmas belong to one of these levels:

  • conservative: obtained by translating offensive senses of the words in the original lexicon.
  • inclusive: obtained by translating all the potentially relevant senses of the words in the original lexicon.

Lexica

Here is the updated list of the Hurtlex word lists in all languages.

Language Available versions
AF Afrikaans 1.0 1.1 1.2
AR Arabic 1.0 1.1 1.2
BG Bulgarian 1.0 1.1 1.2
BN Bengali 1.0 1.1 1.2
CA Catalan 1.0 1.1 1.2
CS Czech 1.0 1.1 1.2
CY Welsh 1.0 1.1 1.2
DA Danish 1.0 1.1 1.2
DE German 1.0 1.1 1.2
EL Greek 1.0 1.1 1.2
EN English 1.0 1.1 1.2
EO Esperanto 1.0 1.1 1.2
ES Spanish 1.0 1.1 1.2
ET Estonian 1.0 1.1 1.2
EU Basque 1.0 1.1 1.2
FA Persian 1.0 1.1 1.2
FI Finnish 1.0 1.1 1.2
FR French 1.0 1.1 1.2
GA Irish 1.0 1.1 1.2
GL Galician 1.0 1.1 1.2
HE Hebrew 1.0 1.1 1.2
HI Hindi 1.0 1.1 1.2
HR Croatian 1.0 1.1 1.2
HU Hungarian 1.0 1.1 1.2
ID Indonesian 1.0 1.1 1.2
IS Icelandic 1.0 1.1 1.2
IT Italian 1.0 1.1 1.2
JA Japanese 1.0 1.1 1.2
KO Korean 1.0 1.1 1.2
LT Lithuanian 1.0 1.1 1.2
LV Latvian 1.0 1.1 1.2
MK Macedonian 1.0 1.1 1.2
MS Malay 1.0 1.1 1.2
MT Maltese 1.0 1.1 1.2
NL Dutch 1.0 1.1 1.2
NO Norwegian 1.0 1.1 1.2
PL Polish 1.0 1.1 1.2
PT Portuguese 1.0 1.1 1.2
RO Romanian 1.0 1.1 1.2
RU Russian 1.0 1.1 1.2
SIMPLE Simple English 1.0 1.1 1.2
SK Slovak 1.0 1.1 1.2
SL Slovenian 1.0 1.1 1.2
SQ Albanian 1.0 1.1 1.2
SR Serbian 1.0 1.1 1.2
SV Swedish 1.0 1.1 1.2
SW Swahili 1.0 1.1 1.2
TH Thai 1.0 1.1 1.2
TL Tagalog 1.0 1.1 1.2
TR Turkish 1.0 1.1 1.2
UK Ukrainian 1.0 1.1 1.2
VI Vietnamese 1.0 1.1 1.2
ZH Chinese 1.0 1.1 1.2

New in version 1.2: a table with the alignment between lemmas across languages is here.

Revised Hurtlex (IT)

The Revised HurtLex is a lexicon in which every headword is annotated with an offensiveness level score. Focusing on the Italian entries, we revised the terms in HurtLex and derived an offensive score for each lexical item by applying an Item Response Theory model to the ratings provided by a large number of annotators.

Publications

Hurtlex is described in this paper:

Elisa Bassignana, Valerio Basile, Viviana Patti. Hurtlex: A Multilingual Lexicon of Words to Hurt. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-It 2018)

http://ceur-ws.org/Vol-2253/paper49.pdf

Revised Hurtlex dictionary is described in a paper currently under review.

Contribute

Contributions are welcome, in the form of revised lexica. Everyone who is native speaker of a language is invited to fork the repository and file a pull request.

Please try to limit your modifications to the following operations:

  • add: add a new item to a lexicon, by creating a new line. Fill in all the column values, including category and stereotype, set level="conservative", and add a new unique ID for the lemma.
  • remove: remove an item considered wrong for a lexicon, by removing the corresponding line.
  • update: change the lemma or the category of an item, e.g. because of a misspelling.
  • add offensiveness score: create a new column with a real value between 0 and 1 to indicate a score for the offensiveness of an item in a lexicon.

Frequent issues:

  • Some languages are written in more than one script (e.g. Hindi, Bangla, Bulgarian, Russian): in these cases is it good practice to harmonize the lexicon by adding the missing spelling and keeping the same ID for the same lemma written in different scripts.
  • Some lexicons contain inflected forms instead of lemmas. These are mistakes introduced by the automatic processing. It is safe to remove such works if the corresponding lemma is already in the lexicon, or to modify them if it is not.

Please create a new version directory for the lexicon you submit. If yours is the first manually corrected version of a lexicon (that is, the last version is 1.*) please create the directory for version 2.0. Otherwise, proceed incrementally (2.0 -> 2.1, 2.1 -> 2.2, ...).

Finally, do not forget to add a README.md file in your newly created directory, indicating what has changes, and your contact for due credit.

LICENSE

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • NonCommercial — You may not use the material for commercial purposes.
  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
  • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

  • You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
  • No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

https://creativecommons.org/licenses/by-nc-sa/4.0/

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%