Real time end-to-end glass break detection system using LSTM deep recurrent neural network

The aim of this paper is to propose a new design for a glass break detection system using LSTM deep recurrent neural networks at an end-to-end approach to reduce false positive alarm of state of the art glass break detectors. We utilized raw wave audio data to detect a glass break detection event in End-to-End learning approach. The key benefit of End-to-End learning is avoiding the need for hand-crafted audio features. To address the issue of a vanishing gradient and exploding gradient problem in conventional recurrent neural networks, this paper proposed deep long short term memory (LSTM) recurrent neural network to handle the sequence of the input audio data. As a real-time detection result, the proposed glass break detection approach has a clear advantage over the conventional glass break detection system, as it yields significantly higher precision accuracy (99.999988 %) and suffers less from environmental noise that might cause a false alarm.


Introduction
*Glasses are increasingly used in the construction of offices and residential places because of the advantages of comfort, well-being, style and light sustainability. Despite its benefits, it is also prone to security risks at night or when a person is not present, as it would be easy for an intruder to smash the glass based door, then reach inside and open the latch lock. Therefore, glass break detection plays an important role in ensuring the security of offices and residential places, as most of the burglars or intruders enter the home through glass doors and windows. A glass break detector is an electronic sensor that detects breakage vibrations or shattering sounds of glass panes. A glass break detector can be used for the protection of the internal and external perimeter building. When the glass pane is shatters or breaks, it generates sound over a wide band of vibrations and frequencies. These shattering glass sounds have a kind of distant frequency. Generally, these can range from 3 to 5 kHz, depending on the type of glass and the presence of an interconnected plastic layer. Most conventional electronic glass break detectors process use pre-determined frequency, amplitude and vibration thresholds to determine whether the glass has broken. Generally, conventional glass break detectors can be grouped into three main categories, (such as Activate Detectors, Physical Vibration Detectors and Acoustic Detectors). Active detectors send a set of frequency energies towards the window glass panes and receive the reflected frequency energy. Any change observed in the reflected frequency energy triggers an alarm or activates another circuit (Clark and Lewis, 1996;Zidan, 2015). The Physical Vibration detector is composed of a piezoelectric element. Whenever the glass is broken, there is some vibration caused in the molecules. These vibrations are noted by the detectors and converted in to an electric signal, then the alarm system is triggered (Sharapov, 2011).
Acoustic based glass break detectors contain one or more acoustic audio transducers that can detect an electrical signal in response to a high amplitude and frequency sound created due to breaking of the glass plate. If a burglar is trying to break through a window, the detector would pick up on the high-pitched shattering sound and a pre-determined frequency composition of breaking event to trip the alarm (Cecic and Fong, 1997;Matesa, 2015;Rickman, 1995). In fact, classification of glass breaking sounds and some loud anomalous audios (such as, gunshots, thunder, and people shouting, dropping and hitting objects) remains a challenging task despite decades. The chances of false alarms in glass break detectors are high, because shock and anomalous loud sounds have similar frequency and vibration thresholds of pre-defined glass breaking sounds (Clavel et al., 2005). The recent development in technology has improved towards overcoming this drawback. There is an ongoing success in the performance of Artificial Intelligence (AI) in dealing with video and audio surveillance applications (such as speech recognition, computer vision, voice translation, and much more in past few years), and smart security surveillance systems are shifting from conventional electronic sensor based classification techniques to modern machine learning and deep learning methods. Among them, Conte et al. (2012) proposed abnormal audio event detection in an urban area, Mahler et al. (2017) discussed a home interior security system, Dufaux et al. (2000) proposed an impulsive sound detection system in a public square, and Zidan (2015) studied protection of nuclear facilities using hardware sensors. Gestner et al. (2007) proposed Digital Signal Processing (DSP) based glass break detectors in homes and offices. Peng et al. (2014) focused on impulsive sound detection and surveillance system in public transport. Aurino et al. (2014) discussed anomaly detection in automatic surveillance application. Kiktova et al. (2015) proposed a gunshot and shout sounds detection system in a city environment which can be noted as particular applicable for surveillance responsibilities, wherein audio can continually make contributions. In this paper, we advocate for a new architecture of glass break detection system to reduce false detection alarm using long short term memory (LSTM) deep recurrent neural network in an end-to-end approach.

Outline
The rest of the paper is organized as follows. Section 3 will discuss data acquisition of acoustic audio signals. Section 4 will describe the LSTM deep recurrent neural network. Section 5 will explain endto-end (LSTM) deep recurrent neural network approach on glass break detection events. Section 6 will summarize the experimental result of proposed deep learning model, and Section 7 will provide conclusions.

Data acquisition
For data acquisition, we manually collected annotated dataset of glass break and non-glass break acoustic audio for training, testing, and validation of the proposed system. Input audio signals are recorded with an acoustic sensing built-in microphone from a laptop. Collected audio signals for glass break detection is generally at a 44100 sampling rate per second at 2 sec time frames.
We collected 5000 audio (.wav) slices samples for audio dataset under various noise level environments, as shown in Fig. 1. This dataset is composed of two types of sound classes consisting of 2500 slices samples of breaking glass sounds data (breaking glass sounds with different noise level) and 2500 non-breaking glass sounds slices from environmental sound and noises (combining of shouted sounds, cars horn, household, alarm , animals, farm and child playing, people conversation sounds), as shown in Fig. 2.

Deep recurrent neural network (long short term memory -LSTM)
Primary topics of research on deep learning were image and audio analysis. Although these are many sources of image data that can be collected in recent years, audio analysis has been restrained until lately. Few publicly collected sound data and complex sequence characteristics of audio data (such as frequency features, energy levels) are the causes for the limited research in this field (Graves et al., 2013). Recurrent networks are a kind of artificial neural network intended to understand patterns in continuous data (which includes text data, speech data or sequence of numerical sensors data, time series data of stock exchanges and social networks). Recurrent networks (RNNs) vary from conventional feed forward neural networks in that the feedback loop is associated with their previous decisions and takes its own outputs as an input for each timestamp. RNN discovers correlations between events separated by many timestamps, and these correlations can be denote as "long-term dependencies" (vanishing gradients) and "shortterm dependencies (exploding gradients)" (Sak et al., 2014).
The first drawback with conventional recurrent neural network is finding the correlation between current events and long-term memory of past timestamp (vanishing gradients problems). Updated weights of RNN are too small (almost unchanged) and many iterations are needed to update the new weights. The second drawback in RNN is finding the correlation between current events and short-term memory of recent timestamp (exploding gradients problems). Updated weights of RNN are too large and the updated weights is too distant from current weights (Gers et al., 2000).
The architecture of an LSTM Network has been shown to be particularly effective when stacked into a deep configuration, towards handling the vanishing gradient and exploding gradient issues of traditional Recurrent Neural Network. In the LSTM structure, the recurrent hidden layer consists of a set of recurrently connected subnets called "memory blocks". Each memory block includes one or more self-connected memory cells and three multiplicative gates to control the flow of information (Gers et al., 2000).
The processes of carrying memory forward of LSTM graphically is described in Fig. 3. An architecture of (LSTM) RNN is as follows. In the first gate, we decide what we need to forget from the data (forget gate); in the second gate, new information is stored into the cell state throughout the whole process (Input Gate); in the final gate, the new output is produced based what we decided (output gate). This is what basically how LSTM works to handle complex sequences of data at different timestamps.

Forget gate
The first stamp in LSTM is to identify information that is not required and will be discarded from the cell state. This decision is made by a sigmoid layer called a Forget Gate Layer.
Graphically representation of the Forget Gate is shown in Fig. 4.
the Forget gate is denoted as and cell state as .
The hidden state at a previous timestamp is ℎ −1 , and the current input is denoted as . The previous hidden state ℎ −1 are cascade together at a same timestamp, modified by a Weight matrix and summed with a bias value of the forget gate. The result of the function is squashed by the sigmoid activation function which is a standard tool for considering very large or very small values of the Forget gate, as well as rendering gradients workable for back propagation through time. The final value of forget gate ( ) can be between 0 and 1 according to the output of sigmoid activation function. If the value of is 0, then the value of the event is necessary to forget; 1 means complete info of previous timestamp is needed to remember for current state (Pascanu et al., 2013).

Input gate
This step is to decide what new information that we are going to store in the cell state.
Graphical representation of the Input Gate is shown in Fig. 5. (2) denotes a sigmoid layer called "input gate layer" that decides which value will be updated. And from Eq. 3 by Gers et al. (2000) ~= ℎ( [ℎ −1 , ] + ) ~d enotes a tanh layer that creates a vector of "new candidate values" that could be added to the state. Then, we'll combine these two gates to update the state. From Eq. 4 (Gers et al., 2000), We then update the old cell state ( −1 ) into the new cell state( ).
First, we multiply the old state ( −1 ) by , forgetting the things we decide to forget earlier. Then, we add *~. This is the new candidate value , scaled by how much we decided to update each state value.

Output gate
The Output Gate decides what part of the cell state to output.
Graphical representation of the Output Gate is shown in Fig. 6. From Eq. 5 (Gers et al., 2000), denotes an output Gate that will run as a sigmoid layer that decides what part of the cell state going to output.
From Eq. 6 (Gers et al., 2000) , the updated cell state ( ) will pass through tanh activation function to get push values (between -1 and 1), so that LSTM (RNN) only produces the new output information (ℎ ) related to the goal of coming next (Li and Wu, 2015).

Experiments
In the experiment, 2 second variable-length size audio sequences with 50% overlap time frame are recorded (glass breaking and non-glass breaking sound samples) and the dataset transformed into (299 × 299 × 3) shape fixed-length raw temporal image form and further reshaped from a fixed-length input image into a bottleneck tensor size (2048 dimensional) byte vector array form. If sufficient training data are available, treating raw temporal acoustic audio wave directly to the entire neural network works well for target classification, as opposed to hand crafted heuristics spectral audio features. Before training with LSTM RNN, 5000 samples of audio in the dataset (2500 glass break, 2500 non glass break) are randomly split into 10fold cross validation form with training (70%), validation (10%), and test (10%) sets. Training set and validation set data are used in training with three time-delay hidden layers LSTM recurrent neural net, which computes the sigmoid and tanh activation functions of a weighted sum for each timestamp. For network training, we tried to set the specific initial and final learning rates in a range from 0.0005 to 0.001 for stable convergence. To prevent over fitting during training, we used the early termination method during training and L2 Regularization dropout. Then, the truncated backpropagation through time (BPTT) learning algorithm is adopted to reduce the cost function and optimization process. The gradients are computed for each subsequence and back-propagated to its start. After the model finally updates the parameters in the LSTM networks, the output gate of the LSTM decides the N sequence of audio as a glass break or not. All experiments were conducted with tensor flow library in the Python environment.
The system architecture of proposed end-to-end glass break detection system using deep LSTM (Recurrent Neural Network) and conventional handcrafted feature based glass break detection system are shown in Fig. 7 and Fig. 8.

Result and discussions
To measure detection accuracy of the proposed LSTM model in offline during training, we split the data into 10-fold cross validation form with training/validation/test sets. The experiment on 10fold cross-validation without replacement can prevent the use of sub-segments from the same recordings in training and validation. Cross entropy and the mean square error rate are used as an accuracy measure of the proposed classification criteria. Experimental results of proposed glass break detection system show that we obtained a trained set accuracy of 100%, validation set accuracy of 100% and invisible test accuracy of 99.999% correct detection result for 5000 samples of audio dataset. In the online experiments, a microphone is used to record at every 2 sec time frame of audio (.wav) with sampling rate (44100 kHz). Recorded audio is analyze with proposed LSTM (Deep Recurrent Neural Network) end-to-end learning approach to detect glass breaking sound using laptop built-in microphone. To measure the accuracy of the online system, we ran our proposed model on the raspberry-pi device and test with non-stop 48 hours detection with different noise (such as, human speaking, clap sounds, Door opening/Closed, Bell sounds, horns).
During the two days (48 hours) of testing, only two false glass break alarm detection alarm is occurred (e.g., sensitive to cough sounds). That means that online proposed glass break detection model correctly detects the glass breaking sound at a 99.999988% detection accuracy. To solve the false positive alarm of new environmental noise (such as cough sounds), we recorded and added this false alarm sound to the proposed training system and retrained to perform the detection model better.  Table 1 describes the experimental results of the state of the art and the methodological comparison of hand crafted features and sensor based glass break detection system. According to the experimental results from Table 1, the proposed End-to-End glass break detection system can perform good detection with the least false alarm errors as compared to other conventional electronic glass break detectors and hand crafted feature based Machine Learning methods.

Conclusion
The major drawback of conventional glass break detectors is false alarms. Sounds such as thunder, shouting, gunshot, hitting objects are similar in frequency and threshold value to glass breaking sounds events that may cause false positives in the alarm system. Therefore, this research proposed a new architecture for glass break detection approach based on LSTM deep recurrent neural network, to improve the correct detection accuracy with less false alarms. In this approach, we utilized raw wave audio data to detect a glass break detection event in End-to-End learning approach. The key benefit of End-to-End learning is avoiding the need of hand-crafted audio features. To address the issue of a vanishing gradient and exploding gradient problem in conventional recurrent neural networks, this paper proposed deep long short term memory (LSTM) recurrent neural network to handle the sequence of the input audio data. As a real time detection result, the proposed glass break detection approach has a clear advantage over the conventional glass break detection system, as it yields significantly higher precision accuracy (99.999988 %) and suffers less from environmental noise that might cause a false alarm. With the availability of sufficient computational power of embedded applications and data, deep learning has become practical and ever more present in powerful and intelligence applications to security surveillance.