Building FastText in vector space for Genre Classification of URLs

Description.

1. Datasets

 a. Mainly using the DMOZ dataset.

    https://www.kaggle.com/shawon10/url-classification-dataset-dmoz

 b. Malicious URLs dataset.

    https://www.kaggle.com/sid321axn/malicious-urls-dataset

 c. URL dataset(ISCX-URL2016)

    https://www.unb.ca/cic/datasets/url-2016.html

 d. Detecting Malicious URLs

    http://www.sysnet.ucsd.edu/projects/url/

 e. ANT Datasets

    https://ant.isi.edu/datasets/all.html

    

 Bottom four datasets are for comparison for classification.

2. Introduction

In the introduction some basic information about URLs structure and how we can retrieve information about it. Also, some other information about the equations and possible benefits can be made with the project.

3. Related work.

Whats in a URL? Genre Classification from URLs some papers like this will be suitable for related work.

4. Methodology

 a. From where and why getting such data is required.

 a. Experimental Setup is required.

  e.g. what kind of dataset is being preprocessed, how it will be fed to the models and etc.

 b. How we are extracting features is required.

  i.e. using wordsegmet, Universal Word Segmentation.

  Explanation how such task is done is required.

5. Experimentation and Results

Task 1.

This will be just a simple genre classification of urls.

 a. Must use PyTorch for the environment.

 b. Using RoBERTa-base, RoBERTa-large models to run the genre classification.

 c. Only using urls first, split the urls by ‘/’, then punctuations, then using wordsegmenter in python, for last Universal Word Segmentation(https://aclanthology.org/Q18-1030.pdf).

 Github link for Universal Word Segmentation: https://github.com/yanshao9798/segmenter

  The execution code to run the Universal Word Segmentation will be like following: python segmenter.py tag -p “path to my file” -m seg_Eng -r “path to my target file” -opth “path to my output file”

  Please note that for the Universal Word Segmentation you will be needing python 2.7 version.

  Segmented pickle files can be provided if needed.

 d. First, run the classification just on RoBERTa and fine-tune the models with the descriptions from DMOZ.

  Could be done via using MaskedLM.

 * Result tables, with the comparison and the performance graph through epoch (max 20 epoch),  and implemented equations, with the explanation about the algorithm and why using these equations, are required.

Task 2.

This will be basically predicting urls from descriptions, basically ranking task.

 a. Using sentence_transformers embed the DMOZ’s descriptions to model(From pre-trained models use “all-mpnet-base-v2” and “all-MiniLM-L6-v2”. Also, use RoBERTa-base and RoBERTa-large for own trained models, please refer to this link: https://www.sbert.net/docs/training/overview.html#creating-networks-from-scratch.

 b. After embedding descriptions with matching urls run the sentence-transformers(Look at the usages in the following link: https://www.sbert.net/examples/applications/semantic-search/README.html).

 c. For the loss function try to use BatchAllTripletLoss, BatchHardSoftMarginTripletLoss, MultipleNegativesRankingLoss, TripletLoss.

 d. Need comparison table of each model and loss function.

 e. For the evaluator use Mean Reciprocal Rank, refer to following link: https://en.wikipedia.org/wiki/Mean_reciprocal_rank

 * Need to provide perplexity score for the ranking task.

Task 3.

 This will be a combined work of task 1 and 2.

 a. First, run the model from the task 2 with the given URL and classify some sample URLs and rank the URLs those are similar to the given URL.

 b. Then, run the classification, model from task 1, to show what are the genres for the ranked URL.

 c. Filter out the URLs those are not classified as same genre as the given URL.

 d. Return the result for the recommendation as what you might be interested. 

  Use Semantic search from the sentence-transformer, refer to following link: https://www.sbert.net/examples/applications/semantic-search/README.html

 * Explanation about the models and algorithms that has been used is required along with the result table.

6. Conclusion

 From the results anything can be derived are fine. Positive results or negative results.

 Also, some possible future work must be mentioned.

7. References

 All the referenced papers must be listed here.

* Format guideline is attached.