### 1. Introduction

### 2. Machine Learning Theory

### 2.1 Definitions, Types, and Basic Concepts of Machine Learning

*N*number of training data input/output pairs

*x*is the training data input and is referred to as a feature. It can have a complex structure such as an image or a time-series signal. In principle, the output

*y*can be anything, but generally in the case of categorical variables, the problem in Eq. (1) becomes a problem of classification or pattern recognition. When

*y*is a real value, it results in a regression problem (Bishop, 2006; Murphy, 2012). The most basic data set for creating such a system is called a sample. Normally, the collected samples are divided into two sets: a training set that is used to create the system and a test set that is used to evaluate the system’s performance after it has been created. The difference between

*f*(

*x*) which is predicted from the input

*x*, and

*y*which is actually obtained, is expressed as

*ϵ*. In acoustics, this normally corresponds to noise.

*x*is provided without any sample class information, and the goal is to find new patterns in this input data. As such, it is a much less clear problem than supervised learning, and there is no clear error metric. In underwater acoustics, a considerable number of previous sonar application studies have used machine learning for classification purposes, such as target type/state classification (Choi et al., 2019; Fischell and Schmidt, 2015; Ke et al., 2018; Wang et al., 2019) and target and clutter signal classification (Allen et al., 2011; Murphy and Hines, 2014; Young and Hines, 2007). In many of these studies, the properties of the data that were used for learning were recognized beforehand owing to the goals of the studies. As such, supervised learning was mainly used rather than unsupervised learning. Besides, studies on underwater source localization (Das, 2017; Das and Sejnowski, 2017; Gemba et al., 2019; Gerstoft et al., 2016; Nannuru et al., 2019) or seabed classification (Buscombe and Grams, 2018; Diesing et al., 2014) have used various machine learning algorithms that include unsupervised learning.

*N*number of classes have different values from each other; therefore, these can be considered good features. Observed samples are generally expressed in the form of feature vectors. However, because using many features is not necessarily favorable, it is important to select only the part of the feature set that is highly useful. In addition, the amount of computation may increase exponentially as the dimensions of the feature vectors increase. This is called the curse of dimensionality (Hastie et al., 2009; Murphy, 2012). As a result, feature extraction methods may vary according to the field where pattern recognition is used, and researchers often use methods that reduce dimensionality by converting the extracted feature values into different values or employing techniques such as principal component analysis (PCA) (Murphy, 2012).

*N*number of classes. Sometimes, when there are models with various degrees of complexity, each model’s misclassification rate for the training set is calculated and compared to the others in order to select the most appropriate model (Hastie et al., 2009). However, machine learning systems are created to build models that recognize and classify new samples. Therefore, a true performance evaluation can only be accomplished by measuring the performance of the system using a sample set other than the training set, i.e., a test set. The performance that the system shows in regard to the test set is called generalization ability (Hastie et al., 2009). The standard for selecting machine learning models is to select models with excellent generalization ability. However, it may not be possible to use a test set depending on the circumstances. In such cases, the training set may be split in two, with one part used for training and other part used for measuring the model’s performance, assuming that the training set is very large. The latter sample set is called the validation set. In this case, the learning and validation process are repeated for several models, and the best model is selected (Hastie et al., 2009). In reality, there are many cases where there is insufficient data to split the training set in two. In such cases, researchers use resampling techniques that use the same sample several times. Typical methods include cross-validation and bootstrapping (Kohavi, 1995).

*x*, and this must be classified as the most likely class. Here, the qualitative standard of “most likely” is defined as the quantitative standard of the posterior probability

*p*(

*y*|

*x*). That is, in classification problems, success is achieved by finding the class with the greatest posterior probability and classifying the target as that class. If it is based on Bayesian statistics,

*p*(

*y*|

*x*) can be estimated by using the Bayes rule shown below (Bishop, 2006).

*p*(

*y*) and likelihood

*p*(

*x*|

*y*). The probability distributions of each of these is estimated through learning or training, and the estimation methods can be broadly divided into parametric and nonparametric methods (Murphy, 2012). Parametric methods can only be applied to certain types of probability distributions that can be expressed as parameters. Typical methods include the maximum likelihood method and the maximum posterior method. The nonparametric method is suitable for cases that do not actually follow a certain probability distribution, and a well-known typical method is the k-nearest neighbor method.

### 2.2 Supervised Learning

#### 2.2.1 Support vector machine (SVM)

#### 2.2.2 Neural network: multi-layer perceptron

*y*=

*f*(

*x*). The perceptron theory is an algorithm that can mathematically solve neural networks. The simplest perceptron model is a single-layer perceptron that consists of an input layer and output layer, as shown in Fig. 2(a) (Goodfellow et al., 2016).

*x*value is the input value, and

_{i}*w*is the weight for that value. The circle between the input layer and output layer is the activation function. The activation function makes the learning calculations easy by imitating a biological neuron that only allows signals to pass through if they are above a fixed level (Goodfellow et al., 2016). Single-layer perceptron can only be applied to problems that can be expressed linearly, and it is difficult to apply them to problems that have a nonlinear structure. This is resolved by adding new layers between the input layer and output layer. A neural network structure that includes a new layer between the input layer and output layer, i.e., a hidden layer, as shown in Fig. 2(b), is called a multi-layer perceptron. A neural network that has several hidden layers is called a deep neural network (Goodfellow et al., 2016). Deep neural networks are discussed again in the deep learning part of Section 2.4.

_{i}### 2.3 Unsupervised Learning

*x*without the sample’s class information. In sonar application research, various techniques that employ unsupervised learning have been attempted in studies on underwater source localization and seabed classification.

#### 2.3.1 K-means clustering

#### 2.3.2 Principle component analysis

*A*= [

*a*

_{1}…

*a*] ∈

_{N}*R*

^{N}^{×}

*is the principal component vector, and*

^{N}*a*. They can be found even when using singular value decomposition (Murphy, 2012). Generally, the direction of the axis that is selected by the first principle component shows the largest variance in the data, and the amount of variance becomes progressively smaller. Therefore, PCA can be used to reduce the dimensions of feature vectors while minimizing information loss. Because of this, it is widely used in fields such as data visualization and compression (Murphy, 2012).

_{N}#### 2.3.3 Gaussian mixture model

*x*, the probability that

*x*will occur is expressed as the sum of several Gaussian probability density functions, as shown below.

*shows the probability of selecting the k-th Gaussian distribution. It is set so that 0 < π*

_{k}*≤ 1 and*

_{k}*,*

_{k}*μ*, and

_{k}*Σ*are estimated for the given data. The expectation-maximization (EM) method is generally employed as the estimation method (Dempster et al., 1977).

_{k}#### 2.3.4 Autoencoder

#### 2.3.5 Sparse dictionary learning

*x*=

*Dα*,

*x*is the input data, and

*D*is defined as the dictionary matrix. Here, the elements of

*D*are defined as atoms. That is,

*x*can be expressed as a linear combination of the column vector atoms that make up the dictionary. The solution that has the most coefficients that are 0 is found from among

*α*. This is expressed mathematically, as shown below.

*α*

^{*}is the sparse representation coefficient. The constraint condition ‖∙‖

_{0}represents the

*l*

_{0}-norm. This is a method of finding the total number of elements that are not 0 in the vector. However, the

*l*

_{0}-norm is a non-convex function, and thus, it has difficulty finding an accurate solution. As an alternative, it is sometimes changed to the

*l*

_{1}-norm constraint condition, and a fast approximate solution method such as orthogonal matching pursuit (Elad, 2010) or sparse Bayesian learning (Wipf and Rao, 2004) is used.