Skip to content

馃搫 SemEval 2024 Task 8: Artificial Intelligence Text Detection System using Natural Language Processing and Neural Network techniques.

Notifications You must be signed in to change notification settings

arhcoder/AI-Textification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

1 Commit

Repository files navigation

馃摎 PROYECTO AI-TEXTIFICATION

馃憛 Procesamiento de Lenguaje Natural

馃捇 Instituto de Investigaciones en Matem谩ticas Aplicadas y en Sistemas

馃捇 Facultad de Ciencias

馃彨 Universidad Nacional Aut贸noma de M茅xico


馃 AI-TEXTIFICATION

馃搫 Detecci贸n de autor铆a en textos AI - Humanos:

馃數 Tarea A: Clasificaci贸n Binaria:

  1. Texto de Humano.
  2. Texto de Inteligencia Artificial.

馃數 Tarea B: Clasificaci贸n Multiclase:

  1. Texto de ChatGPT.
  2. Texto de Cohere.
  3. Texto de Davinci.
  4. Texto de Dolly.
  5. Texto de Humano.

馃懍 Autores:

  • Alejandro Ramos @arhcoder.
  • Manuel Le贸n Rosas..

馃幆 Objetivo

EN ESTE NOTEBOOK SE OBTENDR脕 EL DATASET Y SE LE APLICAR脕N TRANSFORMACIONES DE PREPROCESAMIENTO CON DISTINTAS COMBINACIONES:

  1. PA: Cleaned => Lemma => UNK => GENSIM Embbedings.
  2. PB: Cleaned => Lemma => UNK => OWN Embbedings.
  3. PC: Cleaned => UNK => GENSIM Embbedings.
  4. PD: Cleaned => UNK => OWN Embbeds.

ALSO USING NEURAL NETWORKS TECHNIQUES.

馃摀 Dataset

SemEval2024-task8

Informaci贸n del dataset:

SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection

Large language models (LLMs) are becoming mainstream and easily accessible, ushering in an explosion of machine-generated content over various channels, such as news, social media, question-answering forums, educational, and even academic contexts. Recent LLMs, such as ChatGPT and GPT-4, generate remarkably fluent responses to a wide variety of user queries. The articulate nature of such generated texts makes LLMs attractive for replacing human labor in many scenarios. However, this has also resulted in concerns regarding their potential misuse, such as spreading misinformation and causing disruptions in the education system. Since humans perform only slightly better than chance when classifying machine-generated vs. human-written text, there is a need to develop automatic systems to identify machine-generated text with the goal of mitigating its potential misuse.

We offer three subtasks over two paradigms of text generation: (1) full text when a considered text is entirely written by a human or generated by a machine; and (2) mixed text when a machine-generated text is refined by a human or a human-written text paraphrased by a machine.

Subtasks

Subtask A. Binary Human-Written vs. Machine-Generated Text Classification: Given a full text, determine whether it is human-written or machine-generated. There are two tracks for subtask A: monolingual (only English sources) and multilingual.

Subtask B. Multi-Way Machine-Generated Text Classification: Given a full text, determine who generated it. It can be human-written or generated by a specific language model.

Subtask C. Human-Machine Mixed Text Detection: Given a mixed text, where the first part is human-written and the second part is machine-generated, determine the boundary, where the change occurs.

Downloading

Task File ID
Whole dataset 14DulzxuH5TDhXtviRVXsH5e2JTY2POLi
Subtask A 1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc
Subtask B 11YeloR2eTXcTzdwI04Z-M2QVvIeQAU6-
Subtask C 16bRUuoeb_LxnCkcKM-ed6X6K5t_1C6mL