Using Weka to experiment learning algorithms. Text Classification ----------------------------------------------------------------- Weka is a freeware tool written in Java which has a lot of data mining and machine learning algorithms implemented. You can use it to understand the algorithms, and more interesting for understanding the results you obtained by using the algorithms. The online lecture from Ian Witten [W3] can help you understand it. Based on the tutorial [Wei] Brandon Weinberg - Weka Text Classification for First Time & Beginner Users from https://www.youtube.com/watch?v=IY29uC4uem8 The video also has a Table of contents with jumps to diferent timestamps inside the video. E.g. minute 5:06 explanation of step 2) Step 1) For the example in the video you need to download the dataset from http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz Step 2) The below command is used to create the .arff files needed by Weka from the text files of a movie reviews dataset java weka.core.converters.TextDirectoryLoader -dir "c:/Horia/3-AI/lab12/txt_sentoken/" > "c:/Horia/3-AI/lab12/IMDB.arff" In the previous example "c:/Horia/3-AI/lab12/txt_sentoken" is the folder where I unpacked the movie review dataset and "c:/Horia/3-AI/lab12/IMDB.arff" the output file (with path). The quotation marks are user for correct handling of paths that contain spaces. However it seems that it won't work if you use DOS-like paths (with \), also it doesn't work with paths that contain spaces. Better move afterwards the IMDB.arff file in the Weka's data folder, which on my computer is c:\Program Files\Weka-3-8\data\ Step 3) Load the IMDB.arff file in the Weka Explorer Step 4) Apply a filter on the the data, with different parameters (explained in the video between minute 8 and 27) Step 5) 28:07 AttributeSelect for reducing dataset to improve classifier performance Step 6) 37:37 Cost Sensitivity and Class Imbalance. (What can you do when your data are too few/too many) Step 7) Using different classifiers (min 45:45) Bibliography: [Wei] Brandon Weinberg - Weka Text Classification for First Time & Beginner Users from https://www.youtube.com/watch?v=IY29uC4uem8 [W3] Ian Witten - Advanced Data Mining with Weka, (online, last visited, dec 2021) https://www.cs.waikato.ac.nz/ml/weka/mooc/advanceddataminingwithweka/ [W2] Ian Witten - More Data Mining with Weka (2.4: Document classification): (the book) https://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/ (the lesson) https://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/slides/Class2-MoreDataMiningWithWeka-2014.pdf (video for the lesson) https://youtu.be/Tggs3Bd3ojQ