Human aggressiveness and reactions towards uncertain decisions

Article history: Received 1 January 2019 Received in revised form 6 May 2019 Accepted 29 May 2019 Big data is a term that defines data sets are so large or complex that traditional data processing applications are inadequate. Data comes with different formats such as Multilanguage, structured, unstructured and emails. Challenges dealing with a huge amount of data includes analysis, search, sharing, transfer, and visualization. Not all the data collected contains useful information, so there's a need to refine this data in order to filter out the useful information. Tweets posted on Twitter are expressed as opinions. These opinions can be used for different purposes such as to take public views on uncertain decisions. These decisions have a direct impact on the user’s life such as violations and aggressiveness are common causes. For this purpose, we have collected opinions on some popular decision taken in the past decade from Twitter. We have divided the tweet text into two classes; Anger (negative) and positive. We have proposed a prediction model to predict public opinions towards such decisions. We used Support Vector Machine (SVM), Naïve Bayes (NB) and Logistic Regression (LR) classifier for a text classification task. Furthermore, we have also compared SVM results with NB, LR. The research will help us to predict early behaviors and reactions of people before the big consequences of such decisions. Moreover, the results highlight the feasibility of using social media to predict public opinions.


*The
Internet is providing all the services a normal user looking. Health, education, government, and business, all categories of modern life have been covered in the shape of the internet. The Internet provides connectivity between people and information publicly shared globally. Similarly, social media such as Facebook, Twitter, and YouTube are a platform to remain updated with current news and affairs. Through social media, people can share news, opinions and participate in activities being held online. Social Networking Sites (SNS) have been used for expressing opinions on different issues.
Twitter is a popular platform for sharing opinions. Most people preferred sharing their opinions on Twitter because the community on Twitter is more educated and well organized. A study on "Twitter stream" (Nawaz et al., 2017) revealed that despite the high level of noise, a major proportion of Twitter data contain informative content. Various studies have extracted information related to health, education and market trends (Liu and Zhang, 2012). Opinion mining techniques can be used for detection and extraction of subjective information from tweets (Adedoyin-Olowe et al., 2013). A method for detecting human aggressiveness is sentimental analysis (Jiang et al., 2011;Barbosa and Feng, 2010;Torunoğlu et al., 2013;Bravo-Marquez et al., 2013;Lee et al., 2014). It is used to extract subjective information from a tweet. Among various social platforms, Twitter was chosen for this study due to the following reasons (Mustafa et al., 2017a).
• Twitter is a platform where high profile individuals feel comfortable to share their views on any topic. • It has a real-time conversation with all the breaking news shared on publicly for people reviews. • The Twitter API for collecting data is easier to use than those of other social media such as Facebook, LinkedIn, and Tumbler.
In this work, we have collected user's tweets using the twitter streaming Application Programming Interface (API) (Twitter.com). We have classified tweets into two categories such as positive and negative. Then we have proposed a hypothesis model for future considerations. We used machine learning most popular algorithms for text classifications such as SVM, NB, and LR (Sebastiani, 2002). We have performed training and testing in WEKA (Hall et al., 2009). And finally, we have compared the efficiency of these algorithms to find the best among them in the proposed dataset. We believed that this research would be helpful for politicians, industries and for those people who need the quick and early response to such type of decisions.
The rest of the article is organized as follows: In Section 2, related work is discussed. The approach that we used to detect human anger in tweet text is listed in Section 3. Evaluation of the proposed approach and results are discussed in Section 4. Finally, a conclusion is drawn in Section 5.

Literature review
Machine learning, data mining and Natural Language Processing (NLP) all used together for the classifications of text documents widely. These three techniques also used to discover patterns from the electronic documents. Text mining is used to discover useful hidden information from the documents and deals with the operations like retrieval, classification (supervised, unsupervised and semi-supervised) and summarization (Khan et al., 2011).
There have been many efforts regarding text classifications in the past. Chodey and Hu (2016) have analyzed large data from clinics and try to find the clinical disorders. Saini and Kohli (2016) have used machine learning techniques for analysis of social network E-Health data. Fernandes and D'Souza (2016) have analyzed product value using sentimental analysis publicly given on Twitter. Both have worked to solve the problem of reading millions of reviews by a single user for a particular product, they have developed a model using reviews posted which gives product classification in term of positive, negative and neutral reviews.
In the same context, Barnaghi et al. (2016) used Twitter sentiments to predict event winner. They used Bayesian Logistic Regression (BLR). They manually labeled tweets into two categories positive and negative. A model proposed by them can be used to predict the winner of any event using sentiments. Kashyap et al. (2016) have worked on music lyrics to categorize the mood of individuals. They have used different text mining and data mining approaches to deal with such a problem. They have considered music associations, melody choice and music proposal as a feature to demonstrate the data. It is beneficial for predicting more accurate understanding of the music mood in the mood mapping process. Similarly, many studies have been found to investigate the online business trends using social data. Online business and larger company's worldwide used user feedback which has been given on social sites for the improvement of product and business need with the passage of time. The amount of text and information shared on Twitter in the form of tweets have valid information, and it can be used to track the progress of the product. They have categorized the data into categories such as positive and negative and used machine learning clustering algorithms to do so. They have found that the data available online can be used for the process of information extraction and it is beneficial for the companies to track the progress of their product and handy for future considerations (Žunić et al., 2016). Rathore and Ilavarasan (2017) have collected Twitter data for the prediction of Pizza success after its launch. This type of methodologies can be used to predict the behavior of any user for a particular product. Rathore and his company have used R and NodeXL for analyzing tweets collected from Twitter. Furthermore, they have used different text mining, NLP and network analysis techniques to predict user behavior. Any company or food delivering company can use this sort of information for success and failure of the product. However, nobody has worked to analyze the behavior of an uncertain decision and their impact on human life before. This work proposed a methodology to analyze the pattern of human behaviors towards uncertain decisions. For that, we trained and evaluated three classifiers: SVM, NB, and LR. Besides, Section 4 presents a comparative analysis of classifiers which can be handy in the future to select an appropriate classifier for this task. Our proposed methodology saves time and cost for such huge public feedback posted daily on social networks.

Methodology
The solution we suggest involves Twitter data. Tweets collected with Twitter API. Our methodology consists of two steps: training and testing phases. Feature representation, tweets collection and classifier training come in training phase, while the testing phase has four phases tweets collection for testing, feature representation, hypothesis prediction and evaluation. The first two tasks (i.e., tweets collection and feature representation) are shared between the training and testing phase. Classifiers such as SVM, NB, and LR used in training and hypothesis. We have used WEKA tool for training and testing of the proposed methodology. Firstly, we divided the data sets into two parts, training data and secondly testing data. Our proposed solution emphasis on the uncertain decision, for that we have collected tweets from the well-known twitter trends which have been held in the past year. Hashtagify is a free platform which can provide the most used trends on any topic (Hashtagify.com). Decisions we considered for the collection of proposed methodology have been listed in Table 1. We stored all the data in the SQL database. We collected 5000 tweets for each topic.

Data preprocessing
Pre-processing reshapes the data into the desired form. The data we collected is not purified for the process of classification, for this, we have applied data processing methodologies to transform the data into meaningful features. It involves tokenization (or featuring), feature weighting and data cleaning (removal of irrelevant features). Once the data is collected, the Uniform Resource Locator (URL) from the tweets were removed. Tweets only with an image or with a link but there was no textual information were also removed. Stop words also do not give any information about the topic and just create noise in the data so using stop word-list, they were also removed from the data. When data is preprocessed, it helps in saving classifier time while classifying. Collected tweets are further preprocessed with following steps:

Tokenization
Tokenization deals with the breaking of long text strings into substrings which may include phrases and words collectively known as tokens. Among two ways of tokenization (phrase and word tokenization), word-level tokenization is considered as more effective due to statistical significance. In this process, the sentence for instance " the financial crises created the conditions for the worst period of wage growth " was broken into tokens " the, financial, crises, created, the, conditions, for, the, worst, period, of, wage, growth". The algorithms which are used to tokenize a sentence separate the tokens with whitespace and some are based on builtin dictionary. Text can be tokenized in two ways, by words (often called a bag of words) or phrases.

Feature weighting
A standard function to compute the weights is TF-IDF. TF-IDF scheme is based on two parts: TF and IDF. TF stands for term frequency which is used to counts the represented terms/tokens in a document. It can give a complete measure of term occurrence. IDF stands for inverse document frequency of a term in a collection of documents (Salton and Buckley, 1988).

Sentiment classification
Once we applied the pre-processing, we have data in a suitable format to apply classification algorithm on it. We have categorized the data into two classes. A data with false words labeled Negative and data with positive words labeled as Positive. A sample of tweets rows have been listed in Table 2. Different algorithms are available in this domain that can be used to train the classification task. Different experimental studies have been directed to analyze these methods for text categorization. As the result of these experiments, SVM, LR, and NB are observed to be very effective algorithms (Sebastiani, 2002). Bjp is likely to show sunny leone ji CD so that people will forget about GST andamp; NOTEBANDI [#NoteBandi] its good news for Saudi women that they can drive,they can contest in municipal election and attend in to social media [#SaudiWomenCanDrive] Imagine all the crashes now this is going to happen. And all that bad parking. #SaudiWomenDriving [#SaudiWomenCanDrive]

WEKA
To perform the desired task, we used WEKA.
WEKA is an open source free software which has been used for various machine learning problems using data. It contains features which can be used for classifications, preprocessing, clustering, visualization, association rules, etc. Machine learning is nothing without giving artificial intelligence to your data. Machine learning methods are very similar to data mining algorithms. WEKA have a collection of machine learning algorithms which are applied on data to extract desired results from it. Standard 10-fold cross validation is employed to evaluate the performance of classifiers. Crossvalidation technique is used for model validation, and it also evaluates the generalization of independent data set over statistical results that are provided by the model. In 10 cross-fold validation, the dataset is partitioned (randomly) into 10 subdatasets. Out of 10, 1 sub-dataset is selected as a validation set for model testing and remaining 9 subdatasets are used for model training. This process is repeated 10 times in total where each sub-dataset is used exactly once as the validation set. Single estimation of the result is obtained by taking the average of 10 results (Choi et al., 2010). Three evaluation measures (precision, recall and fmeasure) are used to evaluate the performance of the classifier. The reason to choose these three measures is that they can evaluate a category wise prediction of the classifier. The mathematical definition of these measures concerning positive class is defined as follows: In Eqs. 1 and 2, CPP, PE, and PP stand for correct positive prediction, positive examples and positive predictions, respectively.

Results and discussion
The results of sentiment classification using SVM classification are given in Table 3. Precision, recall, and f-measure are approximately 89%, 90%, and 88% respectively. The accuracy of training and testing of validating of data is between a ranges of 88-90%. SVM has showed high precision and recall on #PanamaVerdict, #SaudiWomenCanDrive, and #NoBanNoWall. The results of sentiment classification using NB classification are given in Table 4. Precision, recall, and f-measure are approximately 85%. The accuracy of training and testing of validating of data is between a ranges of 85-86%. The results of sentiment classification using LR are given in Table 5. Precision, recall, and f-measure are approximately 85%, 84%, and 85% respectively. The accuracy of training and testing of validating of data is between a ranges of 84-85%. High precision and recall is achieved for #NoBanNoWall and #PanamaVerdict. To compare the accuracy of the three classifiers, paired t-test (corrected) is performed in WEKA. Statistical paired t-test compares two datasets in which observations in one dataset can be paired with other datasets observations. The main objective of this test is to investigate the statistical evidence that the mean difference from paired observations from two datasets on a particular outcome is significantly different from zero (Mustafa et al., 2017a). Obtained results indicate that some differences in the accuracy of SVM, LR, and NB exist, as shown in Table 6. However, differences inaccuracies of classifiers are not statistically significant. SVM, NB and LR have been also used in various other classifications purposes. However, they have shown different accuracies on Urdu tweets. High precision is achieved for SVM and LR (91.9% and 92.7% respectively). On the other hand, high recall and fmeasure was achieved for NB (Mustafa et al., 2017b).

Conclusion
Twitter is one of the most important social sharing platforms. Tweets posted on Twitter are expressed as opinions. These opinions can be used for different purposes such as to take public views on uncertain decisions. These decisions have a direct impact on users life such as violations and aggressiveness are common causes. We have collected tweets of such decisions and labeled the tweets into two categories such as anger (negative) and positive. We have used classifier algorithms such as SVM, NB, and LR for building models. We have also compared SVM results with NB, LR. The research is useful for predicting early behaviors and reactions of people before the big consequences of such decisions. In the future, we interested to build a tool which can work as a recommender system to classify tweets automatically into two categories such as anger and positive.