$29
In this project, you will select a dataset from the two available options. You will then train a machine learning classifier to make predictions based on your features developed in VBA.
There are two datasets available: 1) Yelp reviews (project_yelp.txt) and 2) IMDB movie reviews (project_imdb.txt).
Citation: [1]
Both of these datasets are in the format: <Review<TAB<Class, where <Class = 0 for negative sentiment and 1 for positive sentiment.
You can select ONE of these datasets to engineer features for, using Excel and VBA. It is possible your code might work well on both.
Submission Requirements: Submit an XLSM file <your_last_name_project.xlsm and an accuracy.txt file with the copied output for the LinearSVMBinary from Visual Studio (on Windows) or the custom Mac software.
Project Requirements:
Import the CSV data into a macro-enabled Excel workbook. Give the first column a heading called “REVIEW” and the second column, with the class labels, a heading called “SENTIMENT CLASS” (0 marks).
Develop VBA features, implemented as Subs, to process text – requirement of 10. (5 * 8 = 40 marks).
(0 – Poor, 1, 2 – Marginal Quality, 2,3 – Acceptable Quality, 4,5 – Good Quality)
You will write 8 features, implemented as VBA subs, to process the text in this data to be fed into a machine learning classifier.
Each feature will have its own column. The values for each feature MUST be numeric only. Marks will be assigned based on: Code Quality, Originality, and suspected performance (ie. The potential for the feature to improve the classifier accuracy). The features should be reasonably distinct from each other.
IMPORTANT: Not all of these must be complex. The TA will take into consideration the balance of complexity of your overall project. For instance, a high scoring project’s features might look like: first 2-3 simple, next 4 moderate complexity, last 2 more original; these might demonstrate your creativity and ability to code in VBA. See on Page 2 for some suggestions of
“simple/moderate” and “original/more advanced”.
Clarification of Original/More Advanced: By “original/more advanced” we mean these might be a little more complex (some ideas on Page 2) or items the instructor did not suggest. Please keep in mind the standard for this is not particularly high given this is an introductory course. IMPORTANT # 2: If you cannot come up with 10 features, the Instructor will assist you with ideas.
Page 1 of 8
IMPORTANT # 3: You can write more features if you want; you must clearly label the ones you want us to mark.
A Sub Main, to call all of your features on the data. (4 marks).
Overall sufficient code comments in the Module, including a comment header with separate lines consisting of your name, the course code, “Winter 2020”, and the Instructor name. There should be good naming of Subs + the Module that contains your features (8 marks - subjective).
Overall Good code organization (8 marks - subjective).
Good Accuracy scores (up to 10 marks, with a potential of 2 bonus for a maximum of 12 marks).
The baseline score for both of the datasets will be set at the instructor’s discretion and posted on OWL.
Those who score below the baseline will get < 6 marks. Baseline will get 6 marks.
Greater than baseline with be 6 marks.
Sample feature ideas:
Simple/moderate:
Punctuation
Word lists (https://gist.github.com/mkulakowski2/4289437 - positive, https://gist.github.com/mkulakowski2/4289437 - negative, PLEASE CITE their work in a code comment) (googling positive word list + github or negative word list + github might help too, please CITE anything you borrow)
Words
Letters in words, positions of words, positions of punctuation
length of specific words, length of longest word
mean word length
Capital letters/lowercase letters
Suggested more original/advanced:
characteristics (ie. length, position, letters etc) of combinations of words, in particular groups of 2 or 3 words (bigrams/trigrams)
consonants/vowels
positions of letters in words
specific mentions of aspects of food (yelp) or movies (imdb)
Sample feature code:
Note: It is recommended you take a look at the sample clickbait project. Some sample subs, one working with individual characters, and one with words, are shown below.
Page 2 of 8
Example 1: Iterating through the characters in an IMDB or Yelp Review
Sub CountNumberOfAsInReview()
Dim i As Integer
Dim reviewsToProcess As Range
Set reviewsToProcess = Range("A1")
Dim review As Variant
Dim aCount As Integer
For Each review In reviewsToProcess
For i = 1 To Len(review) Step 1
'Mid(review, i, 1) can be used to get characters from a string 'review is string, i is the index, and 1 means 1 character
If StrComp(Mid(review, i, 1), "a", vbTextCompare) = 0 Then aCount = aCount + 1
End If
Next
Next
'output the feature value here
Range("A2").Value = aCount
End Sub
Page 3 of 8
Example 2: Iterating through the words in an IMDB or Yelp review, placing the number of words in the B column
Sub CalcWordLength()
Dim reviews As Range
Set reviews = Range("A2:A49")
Dim words() As String
Dim i As Integer
Dim strReview As Variant
i = 2
For Each strReview In reviews
words = Split(strReview, " ") ‘use split to get the dynamic array of words in a string
Range("B" & i).Value = UBound(words)
i = i + 1
Next
End Sub
Page 4 of 8
Training your ML Classifier (Mac ONLY):
Use the EasyClassifier program. Note that this program should not be used unless you CANNOT access a Windows PC for Visual Studio. We will not accept Windows submissions using this.
This uses a simple form of Machine Learning (K-Nearest Neighbours) so only use this if you don’t have Windows. You may want to run this a few times as your score may vary since KNN is very basic.
Download EasyClassifier from OWL (do not distribute this file). Extract the ZIP.
Open “easyclassifier.html”. Press Browse. Go to your CSV file.
NOTE: Your csv file should be in the format:
FeatScore1, FeatScore2, FeatScore3, FeatScore4, FeatScore5, FeatScore6, FeatScore7, FeatScore8, CLASS
An example from clickbait feature data has been included so you know what your CSV should look like.
Submit a screenshot of the output of your program as an attachment with your project Excel file like below (there will be text where the red underline is that should also be present):
Page 5 of 8
Training your ML Classifier (Windows Only):
Download Visual Studio Community 2019 for Windows (left) from: https://visualstudio.microsoft.com/
In the Installer setup program, only select Desktop development.
Register for the COMMUNITY edition of the software (this is free for education use). You might be able to use your UWO login.
Download and install the ML.NET Model Builder (you can OPEN this file and it will set everything up for you) https://marketplace.visualstudio.com/items?itemName=MLNET.07
Export your XLSM feature data into a new CSV file. You should be able to just copy+paste it.
Select New - Project - Console Application
7.
Press “Create”
Right click on “ConsoleApp1” below “Solution ConsoleApp1” and hover over “Add” then go to “Machine Learning” and click “Custom Scenario” (NOT sentiment analysis”
Test out your CSV file. Make sure the column to predict is the SENTIMENT CLASS label (0 – negative sentiment or 1 – positive sentiment).
Train for 10 seconds under “binary-classification”
Page 6 of 8
Then click “Evaluate” to get the accuracy score. Check what it is for the LinearSVMBinary OR the MOST ACCURATE classifier if SVMBinary is not there in the output window (if this is hidden click View - Output)
Page 7 of 8
Dataset References
D. Kotzias, M. Denil, N. De Freitas, and P. Smyth, “From group to individual labels using deep features,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, doi: 10.1145/2783258.2783380.
Page 8 of 8