Skip to content

Developed a machine learning model using scikit-learn, implementing ensemble techniques, PCA, correlation analysis, and extensive feature engineering. The goal was to classify documents as either human-generated (0) or AI-generated (1) based on document embeddings, word count, and punctuation.

Notifications You must be signed in to change notification settings

Kritika97Gaikwad/AI-Generated-Text-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AI Generated Text Detection

Screenshot 2024-06-15 151143

AI-generated texts have become increasingly prevalent across diverse industries, offering innovative solutions in areas such as Content Generation, Personalized Marketing, Virtual Assistants, and Creative Writing. However, with these advancements come challenges that must be addressed to ensure responsible and ethical use.

Project Overview

Developed a machine learning model using scikit-learn, implementing ensemble techniques, PCA, correlation analysis, and extensive feature engineering. The goal was to classify documents as either human-generated (0) or AI-generated (1) based on document embeddings, word count, and punctuation.

Requirements

  • Python 3.8 or higher
  • Jupyter Notebook or Google Colab

Usage

Exploratory Data Analysis (EDA)

In the EDA phase, we analyze the dataset using the following visualizations and statistics:

Distribution of the target variable (ind): Understand the imbalance in the dataset. Distribution of word counts: Analyze the length of the documents. Frequency of punctuation marks: Examine the usage of punctuation in the documents. Correlation heatmap of document embeddings: Identify relationships between different embedding dimensions. PCA and t-SNE visualizations of document embeddings: Reduce dimensions to visualize the embeddings in 2D space.

Data Preparation

During data preparation:

  • Feature Engineering: Create additional features such as average word length and number of unique words.
  • Train-Test Split: Split the data into training and testing sets (90/10 split) with a fixed random seed for reproducibility.
  • Class Imbalance Handling: Use techniques like SMOTE to balance the classes in the training set.

Model Training and Evaluation

We train the following models:

  • Logistic Regression
  • Random Forest
  • AdaBoost
  • SVC
  • Gradient Boosting
  • AutoML/ TPOT

For evaluation, we:

  • Generate learning curves for accuracy and loss.
  • Create confusion matrices.
  • Produce classification reports.
  • Calculate F1 scores, precision, and recall.
  • Generate Permutation Importance
  • Create Partial Dependence Plots

Results

The results section in AI_Generated_Text_Detection_Project.ipynb provides a detailed analysis of model performance, highlighting the strengths and weaknesses of each model.

Contributing

Contributions are welcome! If you have any improvements or bug fixes, please open an issue or submit a pull request.

About

Developed a machine learning model using scikit-learn, implementing ensemble techniques, PCA, correlation analysis, and extensive feature engineering. The goal was to classify documents as either human-generated (0) or AI-generated (1) based on document embeddings, word count, and punctuation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published