Detecting abnormal electricity usage using unsupervised learning model in unlabeled data

© 2021 The Authors. Published by IASE. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Smart-home systems achieved great popularity in the last decade as they increase the comfort and quality of life. Reduction of energy consumption became a very important desiderate in the context of the explosive technological development of modern society with a major impact on the future development of mankind. Moreover, due to the large amount of data available from smart meters installed in households. It makes leverage to able to find data abnormalities for better monitoring and forecasting. Detecting data anomalies helps in making a better decision for reducing energy usage wasted. In recent years, machine learning models are widely used for developing intelligent systems. Currently, researchers' main focus is on developing supervised learning models for predicting anomalies. However, there are challenges to train models with unlabeled data indicating data anomaly or not. In this paper, abnormalities are detected in electricity usage using unsupervised learning and evaluated using Excess Mass. The unsupervised anomaly detection model is based on Gaussian Mixture Model (GMM) and Isolation Forest (iForest). The models are compared with Local Outlier Factor (LOF) and One-class support vector machine (OCSVM). The proposed framework is tested with actual electricity usage and temperature data obtained from Numenta Anomaly Benchmark (NAB), which contains normal and anomaly data in time series. Finally, it has been observed that the iForest out-performed as the detection model for the selected use case. The outcome showed that the iForest can quickly detect anomalies in electricity usage data with only a sequence of data without feature extraction. The proposed model is suitable for the Smart Home Energy Management System's practical requirement and can be implemented in various houses independently. The proposed system can also be extended with the various use cases having similar data types.

Keywords:
Electricity usage Unsupervised learning Excess mass Machine learning iForest Gaussian mixture model

Introduction
*With a combination of The Internet of Things (IoT) and Smart Home Energy Management Systems (SHEMS), it becomes a scorching topic nowadays for a better process of extracting valuable knowledge. It helps for better management and visualization of electricity usage. In India, it was stated by Sial et al. (2019) that over 65% of electricity is generated from non-renewable assets, i.e., thermal power plant fuel, such as coal. In 2018, 26.9% of greenhouse gas emissions were raised due to electricity, and the residential and commercial sector distributes 32% of this emission (EPA, 2018). These gases have farranging environmental and health effects; if individuals get alert to abnormal electricity usage, they can make a few small changes to save electricity. As MEC (2014) said, it does not require significant physical changes to have electricity and contribute to a greener environment. Moreover, it was reported in Malaysia and in different places in the world; during the Movement Control Order (MCO) and Conditional Movement Control Order (CMCO) period in COVID-19, electricity consumption in the household increased. If consumers can track this consumption, they will be able to take action accordingly. The government (MEC, 2019) indicates the importance of saving electricity and reduce wastage by the monitoring system.
The anomaly detection approaches are grouped into three methods, i.e., supervised, semi-supervised, and unsupervised (Ayodele, 2010). Where supervised and semi-supervised methods require labeled datasets indicating anomaly or not. Simultaneously, the unsupervised method will work for the unlabelled dataset. Here, the main objective is to detect abnormal usage, which is not labeled and different for different types of consumers. Hence, the task is to train the unsupervised machine learning (ML) model, i.e., Gaussian Mixture Model (GMM) and Isolation Forest (iForest), for anomaly detection in sequence batch Active power data. The algorithm is designed for batch learning without labeled data. For the result analysis in this paper, the data was obtained from Malaysia's telecom company. By pointing out the active power value as two categories: normal and abnormal usage. The results are finally compared with other known algorithms to get the best model in this case.
The rest of the paper is organized as follows: Section II contains a few recent literature reviews related to anomaly detection, following by the methodology described in section III; section IV contains results and discussion. Finally, section V presents the conclusion and future work related to the research presented in this paper.

Literature review
As there is a demand for electricity in sociality, an energy supply crisis is also a significant bottleneck to an economy. Wastage of energy usage needs to be reduced. An individual has to wait for monthly billing to understand electricity consumption change. It became easy to visualize with the current resources of smart meters and SHEMS. However, it is still difficult for a normal user to point out abnormal usage. Hence, researchers have done few works related to this anomaly detection. In this section, a few very recent works are discussed Wang and Ahn (2020) had worked with anomaly detection for real-time by classification-model using labeled data. They had integrated the support vector machine (SVM) algorithm, the k-nearest neighbors (kNN) model with the cross-entropy loss function for developing an anomaly detection process for finding the data correctness in the electrical load dataset. Yip et al. (2018) used anomaly detection technology to evaluate energy usage behavior for identifying the outliers caused due to electricity frauds and faulty meters. They had used Linear Programming and trained using labeled datasets too. Similarly, another network fault prediction architecture was developed, but it also contains labeled data indicating which session fault (Emerson et al., 2020).
Moreover, Sial et al. (2019) used four different heuristic methods to indicate the anomalies obtained data from the smart meters installed in the IIIT-Delhi campus hostels. They had presented empirical evaluation to demonstrate the effectiveness of their system.
Another work (Saad and Sisworahardjo, 2017) had presented a contextual anomaly detection model for detecting irregular power usage using an unsupervised ML algorithm with temporal context obtained from meter data. Hu et al. (2018) had proposed a system involving one feature selection for a multivariate dataset for time-series data and the next trained model using the OCSVM model. However, the author's main issue is the number of identified discords is usually more than the detected outliers in real-world situations.

Methodology
This section presents the presented technique for computing anomaly detection for electricity usage sequence data in this paper. An overall all system flow is illustrated in Fig. 1. Time-series data is preprocessed by reshaping to one column to train the models. The Excess Mass (EM) method is used for evaluating the outcome, which indicates the trained model's performance. The best model is stored for the selected house. The algorithm evaluations conclude the specific technique will give better outcomes and use as abnormal electricity usage detection. Finally, the best model is stored for future detection. The process is briefly explained in this section.

Data collection
Initially, data was collected from Malaysia's Telecom Company from the SHEMS using API for this research work. The data is the Active Power value obtained from smart meters installed in a house's residential area. The dataset contains hourly data from 2019-01-10 to 2020-02-06. The Active power value is plotted in Time-series in Fig. 2a. Here, assume that a univariate time series data, DTrain={ 1 , 2 , 3 ,.., } and is the length of sequence data DTrain. Further, to test the model, this system is tested using benchmark data used to evaluate an anomaly detection system (Lavin and Ahmad, 2015), known as Numenta Anomaly Benchmark (NAB). This dataset can confirm that the system works if the data value is evaluating, as it contains real streaming evaluated data from different domains and applications. The temperature value obtained from NAB data is plotted in Time-series in Fig. 2b  From the graph presented in Fig. 2. It can be understood that both data set is in a different form. But both are evaluated data. In Fig. 2a, there is a slight increase in electricity usage from March 2019 to January 2020. In Fig. 2b, data value increases from September 2013 to January 2014 and decreases from January 2014 to May 2014. For better visualization of the dataset DTrain selected, the total data value is plotted in the histogram, as shown in Fig. 3. The plot indicates the active power value and its frequency of use in the selected dataset.

Unsupervised machine learning model
The provided dataset from the SHEMS API does not contain any labeled data indicating anomaly or not. Hence the model is trained so it will group data according to the data. A probabilistic model Gaussian Mixture Modelling (GMM) and a tree-based model Isolation Forest (iForest) (Garcia-Font et al., 2018) are developed. These models are also compared with other available models (i.e., Local Outlier Factor (LOF) and One-class support vector machine (OCSVM)). For training, these models total = 7,739 and = 7267 of DTrain was selected finally from electricity usage and temperature data, respectively.

Gaussian mixture modeling
GMM model is used for obtaining a trained unsupervised ML model. A GMM works by using a parametric probability density function; it represents a weighted sum of Gaussian component densities (Reynolds, 2009). A GMM is implemented as a probabilistic model for clustering active power data. It considers all usage points derived from a finite Gaussian distribution mixture with unknown parameters.
GMM model is used as unlabelled cluster data. It does account for variance type of data. Mathematically, Univariate Case (One-dimensional) Gaussian model's probabilities can be calculated using Eq. 1, followed by N value calculation using Eq. 2.
For summarizing, in this approach, 'm' Gaussians to the data is fitted. Then finds the Gaussian distribution parameters and for each cluster and the weight of a group. Finally, for each data plot, probabilities are calculated, which belong to each collection.

Isolation forest
Isolation Forest (Liu et al., 2008) algorithm can explicitly point out abnormal data instead of first grouping the normal data. The concept is the same as the fundamental processes of decision tree algorithms. The tree partitions are generated randomly using the selected feature and then split randomly by the selected feature's minimum and maximum value. Next, an anomaly score is calculated to make decisions by using Eq. 3.
where h(x) is the path length of usage value x, and n is the size of the selected set. Moreover, c(n) is stated by using Eq. 4.
here, is the size of the training dataset. is the harmonic value. This can be calculated using Eq. 5, where is 0.5772156649 (known as Euler-Mascheroni constant).

Excess mass (EM)
EM method is used here to evaluate the trained model. EM is known well to evaluate unsupervised anomaly detection algorithms (Goix, 2016). EM is based on the notion of density contour clusters. In section IV, the EM score is presented for evaluation. The higher the score indicates a well-trained model.

Tools
For developing the proposed system, Python Language is used. Hence, Python 3.0 environment was installed. It also used a few python libraries for different other tasks. Such as 'Numpy' and 'Pandas' libraries for data structures and data processing tools. 'Matplotlib' library was used for data  771  120771  240771  360771  480771  600771  720771  840771  960771  1080771  1200771  1320771  1440771  1560771  1680771  1800771  1920771 Active power visualization. Finally, the 'sklearn' library was integrated for the Gaussian Mixture model implementation.

Evaluation using energy consumption data
The reason is to select GMM for clustering because GMM can handle the variate type dataset, as shown in Fig. 4, and the calculated probability is plotted.
For evaluating and cross-checking the EM score, the model is trained using different subsets and tabulated in Table 1 for GMM, here 2,361. It can be clear that the EM score is excellent and close to each other. It has also been clear while detecting the first half and second half data; it plots 12 anomalies. While detecting abnormalities in the complete dataset, it detects 24 anomalies. Hence, it shows the anomaly detection is possibly correct.  For further evaluation, each model's EM score is calculated after training using the decision function to predict confidence scores, presented in Table 2. The iForest and GMM tend to out-perform the detection performance. According to evaluation using EM score, all data is clustered using OCSVM, iForest, LOF, and GMM, and plotted anomalies, as shown in Fig. 5. Here, it is clear iForest and GMM detection of abnormalities plotted is almost the same. Other than the excellent EM score of iForest and GMM, they both tend to detect similar anomalies.
Moreover, in Fig. 6, the histogram indicates normal usage and abnormal usage. These plots will present the user with a clear visualization of abnormal energy usage on the specific active power. It demonstrates that 0.75M to 1.0M active power is an anomaly usage of the selected house. This detection will be different for different homes. Hence, the model train is more reliable and not dependent on the specific household. The detection is not statistically based, and each model must perform independently according to household usage.
The number of anomalies detected is not similar in the four methods. However, detecting abnormal usage for iForest and GMM is almost the same; even the detection number is also the same. The number of anomalies detected is presented in Table 3 for each trained model. Here, as stated by Hu et al. (2018), it is also proven that the number of anomalies detected is more than usual. Finally, can conclude that iForest and GMM can be used for abnormal unlabeled electricity usage detection process. Hybrid of iForest and GMM can help to find the exact anomaly point. According to

Evaluation using benchmark data
A known dataset NAB is used to train each model (i.e., One-Class SVM, iForest, LOF, and GMM) and then obtained EM score by using decision function by predicting confidence scores. Here, the outcome obtained from temperature sensors data is used, and the EM score is presented in tablature form in Table  4. Here also like energy consumption data, iForest and GMM had out-perform the detection of anomaly process. The NAB dataset can confirm that the selected algorithm can detect anomaly data, as it contains clean data and anomaly data. Like detecting anomalies in electricity usage, the second database also detected similar anomalies using iForest and GMM algorithms. The time series plot with anomaly detection for the database is presented in Fig. 7.

Conclusion
In this paper, a system is presented and evaluated to study abnormal data in sequence active power data. It is the best way to detect unusual electricity consumption. It was obtained the first iForest and next GMM trained with good EM score for detecting anomalies. To conclude, this strategy of detecting abnormal usage could eventually benefit any individual using a smart home energy monitoring system. The beneficiation is from saving electricity by reducing energy wastage and cost. Correctly implementing the system's approach will also play a significant role in the smart home market. The study had proved that the unsupervised technique allowed the consumers to show that these models will bring a great deal for smart homes. It can also be considered for unusual energy usage or energy theft detection or for other similar evaluating datasets.
In the future, more investigation will be processed by using other unsupervised ML models (such as DBSCAN (Sheridan et al., 2020)). The evaluation was currently processed considering data outlier; next can consider novelty detection. As mentioned by Carreño et al. (2019), there is a difference between outlier and novelty. Next, on data visualization, it can be seen there are few missing data in the time series. It is also essential to handle this missing value (Jesmeen et al., 2018;2019). Moreover, handling these missing data might Anomaly detection using GMM value anomaly enhance the EM score. In the case of energy consumption anomaly detection, it is also required to detect anomalies in seasonal-based and discriminate actual anomaly and seasonal anomaly.

Funding
This work was supported in part by Telekom Malaysia under Grant TMRND [MMUE/190007].

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.