Innoplexus-webpage-classification

Solution for this competition hosted on Analytics Vidhya. competition link

Simple Linear SVM model for webpage classification. Weighted F1 score of ~0.75.

Only TFIDF vectors of webpage content are used for the model training.

Requierements:

nltk
beautifulsoup4
pandas
sklearn
scipy

Problem statement:

Classify given webpage to its category (Tag) like news, profile, publication etc.

Data:

In Train dataset wbepage domain name, url, webpage content in HTML form and, category (Tag) are available.
In test dataset wbepage domain name, url and, webpage content in HTML form are available.
6.75 GB of webpages content data.
Based on these data we have to tag every webpage.

Solution approach:

Beacuse webpage content data is in HTML format we have to first remove all HTML tags, css and javascript from this dataset.
For that we can use beautifulsoup. This will give us content of webpage without HTML.
Now, convert this data to TfIdf vectors using sklearn.
I have only used these vectors for training simple Linear SVM model.
Linear SVM model takes too less amount of time for training. So, hyperparameter optimization can be done fast.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
Models.ipynb		Models.ipynb
README.md		README.md
data_preparation.ipynb		data_preparation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Innoplexus-webpage-classification

Simple Linear SVM model for webpage classification. Weighted F1 score of ~0.75.

Only TFIDF vectors of webpage content are used for the model training.

Requierements:

Problem statement:

Data:

Solution approach:

About

Releases

Packages

Languages

License

NishantBhavsar/Innoplexus-webpage-classification

Folders and files

Latest commit

History

Repository files navigation

Innoplexus-webpage-classification

Simple Linear SVM model for webpage classification. Weighted F1 score of ~0.75.

Only TFIDF vectors of webpage content are used for the model training.

Requierements:

Problem statement:

Data:

Solution approach:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages