Sentiment Analysis Part 1 (Sentiment Analyst)

The super strong wind of the Machine Learning is turning our heads incredibly, we see an example of the machine learning usage almost every area of the technology. Face detection, voice recognition, text recognition, etc. there are plenty of them, and each of them has a different type of approach to machine learning.

Here explanation of the sentiment analysis from Wikipedia.

" Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. "

To be able to catch this speed as a being developer, we should always be in the carriage of the fast train of technology, because of that, we should support our products by using machine learning.

In this article, we are going to see one of the very popular areas of machine learning which is called "Semantic Analysis". We will train a model by using IMDB Moview Reviews dataset, and then we will create an example application and we will use the trained model to predict if the users' comments are positive or negative.

This is how our example application will look like when we finish it.

The example project contains 3 parts;

Sentiment Analyst, the first part which will be responsible for all machine learning process.

Trainer, the second part which will be responsible for training the model.

Interface, the third part which will be responsible for interaction with the users.

What do we need for this project?

Microsoft ML.NET

Metro Set UI

Html Agility Pack

Important Note: x64 bit is supported by the Microsoft ML.NET, and our example also use x64, if you want to try it with x86 you can have some problems.

You can download the source code from here.

Brief Explanation of Project

Aim of this example project is understanding of practice usage of the machine learning, to achieve this aim, we will create an example application which will help us to analyze users' comments about the movies, we will need to train the machine learning model with movies' reviews.

Please pay attention that we will use "Movies Reviews" to be able to make a prediction for movie reviews, not for "Twitter Messages or something else", if you want to use machine learning for different purpose, you need to train your machine learning model with the data set which is matching with your purpose .

Part I - Sentiment Analyst

ML.Net supports different algorithms of the machine learning, but here we need sentiment analysis, that's why we need to focus exactly this future of the framework. I like to keep things simple and organized, that's why we are going to create a special helper class which will help us to use machine learning futures easily and on a manageable way, also, the "Trainer" which we will have will also need to reach some future of this helper class.

So let's start with creating an empty solution, name it as "Movie Reviews".

After clicking "Create", Visual Studio creates an empty solution for us, which we are going to add our first project "Sentiment Analyst" into it.

Right click the solution and click "Add New Project" as shown below.

Select and add "Class Library" as shown below.

We are going to name it is "Sentiment Analysis" and then create it by clicking "Create".

Great. so now we have "Solution" with "Class Library" project in it, before start to coding, we need to install ML.Net via Nuget, please open the "Manage Packages Solution" window (Tools Menu -> NuGet PackageManager -> Manage NuGet Packages for Solution)

Make sure that you select "Browse" tab, and then find the ML.Net packages as shown below, and install it to the project we just created.

After installation completed, you should be able to see references in the "Solution View" under "References" as shown below.

We are almost there! Last one more thing before starting, we need to set "Platform Target" to x64. after setting "Platform Target" we are ready to go!

As a results here the steps which we need to complete before starting to code.

Create "Solution" and name it as "Movie Reviews".

Create an empty project and name it as "Sentiment Analyst"

Install NuGet Packages for ML.Net

Configure "Target Platform" as x64.

If you completed all the steps which are written above, we are ready to start coding.

We will have Models definitions under "Models" folder, and just on the root folder of the project, we are going to create "Sentiment Analyst" class where we are going to do most of the task there.

Root Folder;
SentimentAnalyst.cs

Models Folder;
CrossValidationResult.cs
Data.cs
Definitions.cs
LearningMethodResult.cs
Prediction.cs

We are going to see all the details step by step for each of the components which we created. Let's get started with the models.

CrossValidationResult

We are going to use this class to return Cross-Validation Results which we are going to do on training data. It is a simple class with 4 fields as you can see below.


    public class CrossValidationResult
    {
        public string Trainer;
        public double AccuracyAverage;
        public double AccuraciesStdDeviation;
        public double AccuraciesConfidenceInterval95;
    }

Data

Data class is the class which we are going to use to pass training data to the ML.Net, as you can see below, we are specifying "Column Names" with the index to refer training data.

Our training data contains these two columns, "Review" and "Sentiment".


  public class Data
    {
        [ColumnName("review")] [LoadColumn(0)] public string Review { get; set; }
        [ColumnName("sentiment")]
        [LoadColumn(1)]
        public bool Sentiment { get; set; }
    }

Definitions

It is definitions of the learning models, by using this, we can try different learning model on training data.


    public enum Trainers
        {
            LbfgsLogisticRegression,
            SgdCalibrated,
            SdcaLogisticRegression,
            AveragedPerceptron,
            LinearSvm
        }

LearningMethodResult

The class which we are going to return the training results.


  public class LearningMethodResult
    {
        public string Trainer;
        public double Accuracy;
        public double AreaUnderRocCurve;
        public double F1Score;
    }

Prediction

Prediction class is the class which we will return as a result after asking ML.Net to make predictions according to input data. Please pay attention that this class inherited from Data class.


    public class Prediction : Data
    {
        [ColumnName("PredictedLabel")]
        public bool PredictionValue { get; set; }
        public float Score { get; set; }
    }

Now, it is time to talk about SentimentAnalyst class where all the process will be done.

In general for we have 3 major part of the machine learning class which we define here;

Loading data

Training data

Making Prediction according to input data

We need two variables where we can store "Path" information for loading data and saving model after training, let's start by defining them.


    private readonly string _dataPath;
    private readonly string _modelPath;

For all ML.Net operations, we need one common context, let's define it as below.


    private readonly MLContext _mlContext;

now we need 2 more fields both for Transformer and Data Pipeline Type


    private ITransformer _model;
    private IDataView _dataViewPrimary;

I think we are clear up to here, now we are going to define two variables more, actually, these two variables are totally optional, we could make all these training and prediction without them also, but I wanted to add flexibility to SentimentAnalyst class.

We have 5 different learning models, as we already defined above in "Trainers"


    LbfgsLogisticRegression,
    SgdCalibrated,
    SdcaLogisticRegression,
    AveragedPerceptron,
    LinearSvm

Here those two variables;


        private EstimatorChain>> _trainingPipelinePlat;
        private  EstimatorChain>  _trainingPipeline;

We are defining "_trainingPipelinePlat" for LbfgsLogisticRegression, SgdCalibrated and SdcaLogisticRegression learning methods and "_trainingPipeline" for AveragedPerceptron and LinearSvm.

Then let's define constructor to initialize variables as below.


public SentimentAnalyst(string dataPath = null, string modelPath = null)
        {
            _mlContext = new MLContext();
            _dataPath = dataPath;
            _modelPath = modelPath;
        }

As we completed defining all variables which we need, we can start to code functions.

As I already mentioned, we have 3 major parts;
Loading data
Training data
Making Prediction according to input data

let's start with the "Loading Data".

Machine Learning needs data, its power comes from data, feeding them with well organized and enough data will bring us much more accurate predictions. To be able to train our model, we need to load data, for loading data, we are going to create LoadData function as below, we use MlContext's LoadFromTextFile function with "Data" class which is already defined and explained above.


        /// 
        ///     Loads data set
        /// 
        private DataOperationsCatalog.TrainTestData LoadData()
        {
            IDataView dataView = null;
            if (_dataPath == null)
                throw new Exception("Data Path is undefined");

            _dataViewPrimary = _mlContext.Data.LoadFromTextFile(
                _dataPath,
                hasHeader: true,
                separatorChar: ',',
                allowQuoting: true
            );

            dataView = _dataViewPrimary;
            // %80 of data for training %20 for testing
            var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
            return splitDataView;
        }

after loading data, we need to split training data and test data, it is good practice to separate it like 80% training data and %20 test data. So by using this function, we are going to load data to our model and prepare training data and test data.

Next, we need to use loaded training data to train our model and then evaluate to understand how well the trained data we have, let's create function for it as below;


        /// 
        ///     Trains model according to selected trainer
        /// 
        /// 
        public LearningMethodResult Train(Trainers targetTrainer =  Trainers.SdcaLogisticRegression)
        {
            //Load data
            var splitDataView = LoadData();

            //Build and train
            _targetTrainer = targetTrainer;
            _model = BuildAndTrainModel(splitDataView.TrainSet);

            //Evaluate
            var learningMethodResult = Evaluate(_model, splitDataView.TestSet);

            var directoryInfo = new FileInfo(_modelPath).Directory;
            if (directoryInfo != null)
            {
                var path = directoryInfo.FullName;
                if (!Directory.Exists(path))
                    Directory.CreateDirectory(path);
            }

            // Save _model
            _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);
            return learningMethodResult;
        }

As you can see above, we load data, train it, and then evaluate it with the test set, then save the trained model and return evaluation results back.

These are all steps which we need to do for being able to use ML.Net on our example project, for keeping article focusing on ML.Net I am not going to give any more details about algorithms, there are are other articles on "Articles" page for who wants to get more details about algorithms.

Here you can find the full code of the SentimentAnalyst class, there are two more functions which I didn't write anything about which are TrainMultiple and CrossValidate functions.

TrainMultiple is the function for trying all learning models at once and display results.

CrossValidate is the function which you can apply to your trained model to understand if it over fitted or well trained.


       public class SentimentAnalyst
    {
        private readonly string _dataPath;
        private readonly string _modelPath;

        private readonly MLContext _mlContext;
        private ITransformer _model;
        private IDataView _dataViewPrimary;

        private Trainers _targetTrainer;

        private EstimatorChain>> _trainingPipelinePlat;

        private EstimatorChain> _trainingPipeline;

        public SentimentAnalyst(string dataPath = null, string modelPath = null)
        {
            _mlContext = new MLContext();
            _dataPath = dataPath;
            _modelPath = modelPath;
        }


        /// 
        ///     Loads trained model for prediction
        /// 
        public void LoadTrainedModel()
        {
            if (_modelPath == null)
                throw new Exception("Model Path is undefined");


            // Load trained _model
            if (File.Exists(_modelPath))
                _model = _mlContext.Model.Load(_modelPath, out _);
        }


        /// 
        ///     Trains model according to selected trainer
        /// 
        /// 
        public LearningMethodResult Train(Trainers targetTrainer = Trainers.SdcaLogisticRegression)
        {
            //Load data
            var splitDataView = LoadData();

            //Build and train
            _targetTrainer = targetTrainer;
            _model = BuildAndTrainModel(splitDataView.TrainSet);

            //Evaluate
            var learningMethodResult = Evaluate(_model, splitDataView.TestSet);

            var directoryInfo = new FileInfo(_modelPath).Directory;
            if (directoryInfo != null)
            {
                var path = directoryInfo.FullName;
                if (!Directory.Exists(path))
                    Directory.CreateDirectory(path);
            }


            // Save _model
            _mlContext.Model.Save(_model, _dataViewPrimary.Schema, _modelPath);

            return learningMethodResult;
        }


        /// 
        ///     Trains model with several trainers
        /// 
        public List TrainMultiple()
        {
            var learningMethodResults = new List();

            //Load data
            var splitDataView = LoadData();

            foreach (var trainer in (Trainers[]) Enum.GetValues(typeof(Trainers)))
            {
                _targetTrainer = trainer;

                Console.WriteLine("Trainer:{0}", _targetTrainer);

                //Build and train
                _model = BuildAndTrainModel(splitDataView.TrainSet);

                //Evaluate
                learningMethodResults.Add(Evaluate(_model, splitDataView.TestSet));
            }

            return learningMethodResults;
        }


        /// 
        ///     Starts cross validation for the model
        /// 
        /// How many iteration
        public CrossValidationResult CrossValidate(int folds = 5)
        {
            var crossValidationResult = new CrossValidationResult();
            IReadOnlyList> crossValidationResults =
                null;
            if (_trainingPipelinePlat != null)
                crossValidationResults =
                    _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipelinePlat,
                        folds, "sentiment");
            else if (_trainingPipeline != null)
                crossValidationResults =
                    _mlContext.BinaryClassification.CrossValidateNonCalibrated(_dataViewPrimary, _trainingPipeline,
                        folds, "sentiment");


            var metricsInMultipleFolds =
                (crossValidationResults ?? throw new InvalidOperationException()).Select(r => r.Metrics);
            var accuracyValues = metricsInMultipleFolds.Select(m => m.Accuracy);
            var accuracyAverage = accuracyValues.Average();
            var accuraciesStdDeviation = CalculateStandardDeviation(accuracyValues);
            var accuraciesConfidenceInterval95 = CalculateConfidenceInterval95(accuracyValues);


            crossValidationResult.AccuracyAverage = accuracyAverage;
            crossValidationResult.AccuraciesStdDeviation = accuraciesStdDeviation;
            crossValidationResult.AccuraciesConfidenceInterval95 = accuraciesConfidenceInterval95;
            crossValidationResult.Trainer = _targetTrainer.ToString();

            return crossValidationResult;
        }


        /// 
        ///     Makes prediction according to input sentiment
        /// 
        /// Input sentiment
        public Prediction Predicate(Data sentiment)
        {
            var predictionFunction = _mlContext.Model.CreatePredictionEngine(_model);
            return predictionFunction.Predict(sentiment);
        }


        /// 
        ///     Makes multiple prediction according to multiple input sentiments
        /// 
        /// Input sentiments
        public IEnumerable MultiPredicate(IEnumerable sentiments)
        {
            var sentimentPredictionResultList = new List();
            var batchComments = _mlContext.Data.LoadFromEnumerable(sentiments);
            var predictions = _model.Transform(batchComments);

            // Use _model to predict whether comment data is Positive (1) or Negative (0).
            var predictedResults = _mlContext.Data.CreateEnumerable(predictions, false);

            foreach (var prediction in predictedResults)
            {
                var sentimentPrediction = new Prediction
                {
                    PredictionValue = prediction.PredictionValue,
                    Score = prediction.Score
                };
                sentimentPredictionResultList.Add(sentimentPrediction);
            }
            return sentimentPredictionResultList;
        }


        /// 
        ///     Loads data set
        /// 
        private DataOperationsCatalog.TrainTestData LoadData()
        {
            IDataView dataView = null;

            if (_dataPath == null)
                throw new Exception("Data Path is undefined");


            _dataViewPrimary = _mlContext.Data.LoadFromTextFile(
                _dataPath,
                hasHeader: true,
                separatorChar: ',',
                allowQuoting: true
            );

            dataView = _dataViewPrimary;

            // %80 of data for training %20 for testing
            var splitDataView = _mlContext.Data.TrainTestSplit(dataView, 0.2);
            return splitDataView;
        }


        /// 
        ///     Builds and Trains model
        /// 
        /// Training data set
        private ITransformer BuildAndTrainModel(IDataView splitTrainSet)
        {
            var dataProcessPipeline = _mlContext.Transforms.Text.FeaturizeText("review_tf", "review")
                .Append(_mlContext.Transforms.CopyColumns("Features", "review_tf"))
                .Append(_mlContext.Transforms.NormalizeMinMax("Features", "Features")
                    .AppendCacheCheckpoint(_mlContext));


            switch (_targetTrainer)
            {
                case Trainers.LbfgsLogisticRegression:
                {
                    var trainer = _mlContext.BinaryClassification.Trainers.LbfgsLogisticRegression("sentiment");
                    _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
                    _trainingPipeline = null;
                    return _trainingPipelinePlat.Fit(splitTrainSet);
                }

                case Trainers.SgdCalibrated:
                {
                    var trainer = _mlContext.BinaryClassification.Trainers.SgdCalibrated("sentiment");
                    _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
                    _trainingPipeline = null;
                    return _trainingPipelinePlat.Fit(splitTrainSet);
                }

                case Trainers.SdcaLogisticRegression:
                {
                    var trainer = _mlContext.BinaryClassification.Trainers.SdcaLogisticRegression("sentiment");
                    _trainingPipelinePlat = dataProcessPipeline.Append(trainer);
                    _trainingPipeline = null;
                    return _trainingPipelinePlat.Fit(splitTrainSet);
                }

                case Trainers.AveragedPerceptron:
                {
                    var trainer = _mlContext.BinaryClassification.Trainers.AveragedPerceptron("sentiment");
                    _trainingPipeline = dataProcessPipeline.Append(trainer);
                    _trainingPipelinePlat = null;
                    return _trainingPipeline.Fit(splitTrainSet);
                }

                case Trainers.LinearSvm:
                {
                    var trainer = _mlContext.BinaryClassification.Trainers.LinearSvm("sentiment");
                    _trainingPipeline = dataProcessPipeline.Append(trainer);
                    _trainingPipelinePlat = null;
                    return _trainingPipeline.Fit(splitTrainSet);
                }

                default:
                    throw new ArgumentOutOfRangeException(nameof(_targetTrainer), _targetTrainer, null);
            }
        }

        /// 
        ///     Evaluates model by test data set
        /// 
        /// Model to evaluate
        /// Test data set
        private LearningMethodResult Evaluate(ITransformer model, IDataView splitTestSet)
        {
            var learningMethodResult = new LearningMethodResult();
            var predictions = model.Transform(splitTestSet);
            var metrics = _mlContext.BinaryClassification.EvaluateNonCalibrated(predictions, "sentiment");

            learningMethodResult.Accuracy = metrics.Accuracy;
            learningMethodResult.AreaUnderRocCurve = metrics.AreaUnderRocCurve;
            learningMethodResult.F1Score = metrics.F1Score;
            learningMethodResult.Trainer = _targetTrainer.ToString();

            return learningMethodResult;
        }


        /// 
        ///     Calculates standard deviation for cross validation results
        ///     This is an auto-generated file by Microsoft ML.NET CLI (Command-Line Interface) tool.
        /// 
        /// Model to evaluate
        private static double CalculateStandardDeviation(IEnumerable values)
        {
            var average = values.Average();
            var sumOfSquaresOfDifferences = values.Select(val => (val - average) * (val - average)).Sum();
            var standardDeviation = Math.Sqrt(sumOfSquaresOfDifferences / (values.Count() - 1));
            return standardDeviation;
        }

        /// 
        ///     Calculates confidence interval
        ///     This is an auto-generated file by Microsoft ML.NET CLI (Command-Line Interface) tool.
        /// 
        /// Model to evaluate
        private static double CalculateConfidenceInterval95(IEnumerable values)
        {
            var confidenceInterval95 = 1.96 * CalculateStandardDeviation(values) / Math.Sqrt(values.Count() - 1);
            return confidenceInterval95;
        }
    }

That's it, we finished the first part, now we have SentimentAnalyst that is based on ML.Net which we can use for training to data and make predictions by using it!

Please continue with the Sentiment Analysis Part 2 (Trainer) for "Trainer" project.