Optimization of Input Variables for Salinity Modeling in the Nakdong River Estuary Using Exploratory Data Analysis

Article information

J. Ocean Eng. Technol. 2025;39(4):379-393
Publication date (electronic) : 2025 July 18
doi : https://doi.org/10.26748/KSOE.2025.013
1Professor, Department of Civil and Envirionmental Engineering, Konkuk University, Seoul, Korea
2Researcher, Industry-University Cooperation Foundation, Inje University, Gimhae, Gyeongsangnam-do, Korea
3Professor, Department of Business Administration, Tech University of Korea, Gyeonggi-do, Korea
4CEO, Haekwang Eng., Haenam-Gun, Jeollanam-do, Korea
5Professor, Department of Fire and Disaster Prevention Engineering, Inje University, Gimhae, Gyeongsangnam-do, Korea
Corresponding author Jong-Sung Yoon: +82-55-320-3434, civyunjs@inje.ac.kr
Received 2025 March 5; Revised 2025 May 1; Accepted 2025 June 16.

Abstract

The Nakdong River Estuary is a vital ecological and water resource facing management challenges due to seawater intrusion and seasonal hydrological changes. The continuous opening of the estuarine bank since 2022 has increased the need for accurate salinity prediction to support ecosystem restoration and water-resource management. This study aims to improve the reliability and efficiency of salinity prediction models by optimizing the input variables using exploratory data analysis. Multidimensional observational data collected from January 2019 to June 2022 were analyzed to identify the statistical characteristics and seasonal variability of key factors. Principal component analysis was used to select critical variables, and a data-driven variable selection process was implemented. Experiments using a long short-term memory deep learning model showed that the optimized variable set improved the prediction performance by approximately 4.5% compared to models using all available variables. These results demonstrate that a data-centric approach enables the development of robust and interpretable prediction models that can be utilized by non-specialists in environmental hydraulics. The findings are expected to expand the practical application of artificial intelligence technologies in environmental management and contribute to the development of data-driven decision-support systems in estuarine environments.

1. Introduction

An estuarine zone, where freshwater and seawater mix, plays a vital role as a habitat for various species and as a repository of ecological diversity. In addition, estuarine zones provide essential resources such as drinking water, agricultural water, and fisheries to nearby residents, serving important economic and social functions. However, estuarine zones face various challenges due to both natural and anthropogenic factors. Saltwater intrusion caused by tidal backflow, water-quality degradation, flood damage, and ecosystem change are common global issues in estuarine zone management. In particular, these issues are more severe in estuaries located near densely-populated areas

The Nakdong River Estuary is a representative example of facing these issues. This estuarine zone is ecologically important because of its role as a habitat for diverse and internationally protected bird species. It also plays a crucial role as the primary source of domestic and agricultural water for Busan and surrounding areas. However, in the Nakdong River Estuary, tidal backflow caused by ocean currents frequently leads to flood damage and salinity intrusion into agricultural land. To address these issues, the Nakdong River Estuary bank was constructed in 1987 (Fig. 1). The bank blocks the backflow of seawater, enabling stable water supply and effective water-quality management. However, the construction of the estuary bank has led to new ecological concerns, such as a decline in the diversity of fish species and reduction in the number of habitats for migratory bird. These issues have become major factors affecting the sustainability of the Nakdong River Estuary, highlighting the need for effective management strategies.

Fig. 1

Nakdong River estuary: (a) Installation location, (b) Marine environment observation sites, (c) Overview of the estuary (Jo, 2022)

As a solution for ecosystem restoration, the local community and environmental organizations in Busan confirmed the potential for ecological recovery through pilot gate openings of the Nakdong River Estuary bank in 2017 (Noh et al., 2020). Consequently, regular gate openings have been implemented during spring tides since February, 2022. However, these gate operations have introduced new challenges in the management of freshwater ecosystems, drinking water, and agricultural water because of changes in salinity levels. Accordingly, salinity prediction in the Nakdong River Estuary is emerging as a critical element for ecosystem protection and water-resource management.

Several studies have attempted to predict the salinity levels in the Nakdong River Estuary by applying a range of methodologies, including numerical and statistical models. Traditionally, numerical models have been employed to simulate the physical processes of riverine and marine environments accurately. For instance, Han et al. (2011) used ECOMSED, a three-dimensional hydrodynamic and sediment transport model based on shallow water equations, to estimate the salinity intrusion upstream of the estuary bank caused by tidal phenomena around the Nakdong River. However, numerical models have practical limitations, as they require substantial computational resources and complex initialization of input parameters.

An increasing number of recent studies have focused on utilizing machine learning (ML) and artificial intelligence (AI) for salinity prediction. Several AI-based studies have overcome the limitations of traditional physics-based models and offered new possibilities. Lee et al.(2022) applied different machine learning (ML) techniques to predict the salinity levels in the Nakdong River, demonstrating the potential of ML for salinity prediction in the Nakdong River Estuary. In addition, they showed that the LightGBM model based on decision trees outperformed other algorithms in terms of the prediction accuracy. Moreover, AI-based models enable rapid and effective predictions using only a simple set of input variables. Woo et al.(2022) utilized the deep learning algorithm long short-term memory (LSTM) to predict salinity up to approximately 5-km upstream of the Nakdong River Estuary bank, indicating that statistical models can outperform numerical models even under more complex conditions. These results are attributed to the superior ability of statistical models to efficiently learn nonlinear correlations and enhance the prediction accuracy.

Salinity prediction using AI models offers two advantages. The first is its short computation time, which enables near-real-time prediction. This provides a significant advantage in situations that require urgent decision-making, because predictions can be utilized promptly. The second advantage of AI models is their high user accessibility, which allows even nonexperts in hydrology and hydraulics to easily utilize them. For example, a manager of the Nakdong River Estuary can obtain prediction results by simply inputting basic data without requiring a complex theoretical background or expertise in numerical computation, which significantly contributes to improving the efficiency of on-site management. Although AI models are powerful tools, they have certain limitations. In particular, the selection of the input variables is crucial, because it significantly influences the performance of the model. The model performance is highly dependent on the quality and relevance of the input data, and if non-experts make inappropriate judgments during the variable selection process, prediction results may be distorted. For example, even if variables unrelated to salinity are included, an AI model may still produce highly accurate results by learning patterns in the input data. However, despite its high accuracy, the model may lack interpretability regarding the significance of the variables, or may lead to overfitting. This may undermine the transparency and interpretability of the model, potentially leading to misguided decision making. Therefore, the careful selection of input variables is essential when using AI models.

This study aims to optimize the input variables of available data sources to build a salinity prediction model for the Nakdong River Estuary. To this end, an exploratory data analysis (EDA) was conducted on observational data around the Nakdong River to identify the key factors among the different variables that significantly influence salinity prediction. In addition, the impact of input-variable optimization on the prediction accuracy and reliability of AI models was systematically evaluated with the aim of contributing to the development of efficient and reliable prediction models.

The novelty of this study is clearly demonstrated in its approach compared with previous research. Previous studies primarily relied on domain experts to select key input variables based on their specialized knowledge, followed by the optimization of the model's hyperparameters to enhance the performance. Such an approach has limitations in that it requires domain-specific knowledge during the variable selection process, making it difficult for non-experts to utilize it effectively. By contrast, the approach proposed in this study enables non-experts to select the key input variables for the model by utilizing the statistical characteristics of the data. To achieve this, EDA was conducted to systematically analyze the statistical characteristics of the variables and their relationships with salinity prediction, and the input variables were optimized based on these findings. This approach is significant because it reduces reliance on domain expertise during the variable selection process and offers a data-driven, objective, and reproducible methodology. As a result, this study opens up the possibility of broader application of AI models through a data-driven variable selection method accessible to non-experts, which is expected to significantly enhance the practical applicability in the field of environmental hydraulics.

2. Research Data

This section outlines the data collection process and examines the key characteristics of salinity in the Nakdong River Estuary based on the collected data. The seasonal patterns in observational data according to different gate-operation methods are also analyzed.

2.1 Data Collection

To develop a salinity prediction model for the Nakdong River Estuary, this study defined an approximately 80-km stretch from the Changnyeong-Haman Weir in the upper Nakdong River to the Nakdong River Estuary bank as the target area. Publicly available observational datasets were collected from online sources. Salinity data for the Nakdong River Estuary were collected from bottom-layer measurements at the Nakdong Bridge station provided by the Busan Institute of Environmental Research. Hydraulic, hydrological, and water quality data were obtained from MyWater (Water Information Portal) and Automatic Water Quality Monitoring Network, both operated by the Korea Water Resources Corporation. During the data collection phase, all available data were included without any prior assumptions regarding the relevance of specific variables to salinity, adopting an inclusive approach. Fourteen observational variables were obtained from the five monitoring stations, and each variable was selected to comprehensively reflect the hydrological and environmental conditions of the Nakdong River Estuary. The detailed observational variables for each monitoring station are summarized in Table 1.

Status of observational data collection in the Nakdong River Estuary and surrounding areas

The collected observational data comprised site-specific time series from January 1, 2019, to June 30, 2022. The temporal resolution of the data was hourly from 2019 to 2021, with 10-min intervals for the 2022 data. Fig. 2 presents a visualization of the data for the main observational variables. The visualized variables included the water level, rainfall, total discharge at the Changnyeong-Haman Weir, water temperature, electrical conductivity, dissolved oxygen at Sangdong, water level at the Gupo Bridge, salinity, water temperature, electrical conductivity at the West Busan Nakdong River Bridge, internal water level, external water level, rainfall, and total discharge at the estuary bank. These data were used to understand the seasonal and annual variability of the Nakdong River Estuary and to analyze the impact of gate operation methods on the performance of salinity prediction models.

Fig. 2

Observation data near the Nakdong River Estuary: (a) Changnyeong-Hamanbo total discharge, (b) Changnyeong-Hamanbo water level, (c) Changnyeong-Hamanbo rainfall, (d) Sang-dong water temperature, (e) Sang-dong electrical conductivity, (f) Sang-dong dissolved oxygen, (g) Gupo Bridge water level, (h) Nakdong River Bridge salinity, (i) Nakdong River Bridge water temperature, (j) Nakdong River Bridge electrical conductivity, (k) Banks of the Nakdong River outer water level, (l) Banks of the Nakdong River inner water level, (m) Banks of the Nakdong River rainfall, (n) Banks of the Nakdong River total discharge.

2.2 Characteristics of Salinity in the Nakdong River Estuary

Analysis of the collected data revealed that the salinity levels in the Nakdong River Estuary exhibited distinct seasonal and spatial characteristics. Fig. 3 shows the annual and monthly variabilities in the salinity levels during the observation period. From 2019 to 2021, when the Nakdong River gates were opened irregularly, salinity levels generally remained low, with temporary increases observed only during specific periods when the gate was open. By contrast, from 2022 onward, with the gates continuously opened, salinity levels generally remained high, but the variability differed across periods. This is interpreted as a result of the more frequent seawater inflow owing to the continuous opening of the gates. These patterns of salinity variation suggest that differences in gate operation methods directly influence the salinity patterns in the Nakdong River Estuary. In particular, the distinct changes in data patterns since 2022 further emphasize the impact of changes in gate operation methods on the salinity.

Fig. 3

Salinity variations in the Nakdong River Estuary during temporary and continuous gate opening periods: (a) Entire period (2019.01.01–2021.12.31 for temporary opening, 2022.02.18–2022.06.30 for continuous opening), (b) Monthly distribution

2.3 Seasonal Pattern Analysis of Observational Data According to Nakdong River Estuary Bank Operation Methods

To analyze the impact of gate operation methods on the salinity, the study period was divided into a temporary opening phase from January 2019 to December 2021 and a continuous opening phase from February 2022 to June 2022, and the data were compared accordingly. Figs. 4 and 5 illustrate seasonal variations in salinity during each period. In these figures, the horizontal axis represents each physical parameter, and the vertical axis depicts the seasonal density distribution.

Fig. 4

Seasonal characteristics of the marine environment around the Nakdong River Estuary during the temporary gate opening period (2019.01–2021.12)

Fig. 5

Seasonal characteristics of the marine environment around the Nakdong River Estuary during the continuous gate opening period (2022.02–2022.06)

In Fig. 4, the seasonal density distribution of rainfall shows that rainfall remains nearly absent or very low during winter and spring, while it increases and reaches high values in summer. Due to the influence of rainfall, the water levels within the estuary bank and discharge from the Changnyeong-Haman Weir exhibit low and stable distributions during winter and spring, but tend to show increased variability in summer. This characteristic indicates that the marine environment of the Nakdong River Estuary basin is significantly influenced by seasonal variations.

In Fig. 5, the variables related to rainfall also exhibit low variability in winter and significant changes in summer. However, following the continuous opening of the estuary bank, the variability of the internal water level increases, which can be interpreted as a reduction in the difference between internal and external water levels. This suggests a direct impact of continuous gate opening on the salinity and water-level fluctuations in the estuarine area.

3. Research Methodology

This section quantitatively investigates the influence of various independent variables on the salinity based on the observational data collected around the Nakdong River through various statistical analyses and visualizations. It also presents selection and optimization methodologies for the input variables that significantly impact the accuracy of the prediction model and spatiotemporal variation patterns.

3.1 Exploratory Data Analysis

EDA was conducted based on observational data to select the key variables necessary for building the salinity prediction model. Through this process, the statistical characteristics of each variable were identified and their correlations with the salinity were analyzed. The EDA procedure was structured into three main components.

3.1.1 Data preprocessing

Outliers were identified for each variable to ensure the reliability of the observational data. In this study, considering that non-experts may perform the tasks, cases with abnormal distributions were removed, and extreme values were visually inspected and excluded. To ensure data completeness, missing values were handled using linear interpolation and standard scaling was applied to eliminate the effects of scale differences during the analysis process.

3.1.2 Correlation analysis – identification of variables influencing salinity

In this study, a correlation analysis was performed among the observed variables to identify the key factors that significantly influenced salinity prediction. Correlation analysis is a statistical method that quantitatively assesses the relationships between variables by measuring how closely changes in one variable are associated with those in another. Correlation coefficients are typically used to represent the linear relationship between two variables. The coefficient ranges between −1 and 1, where values closer to +1 indicate a strong positive correlation, meaning that as one variable increases, the other tends to increase as well. Conversely, a correlation coefficient close to −1 signifies a strong negative linear relationship, meaning that as one variable increases, the other decreases. A correlation coefficient close to zero indicates little to no linear relationship between the two variables. Correlation analysis was conducted for the following three cases. First, the long-term correlations were analyzed based on data from the entire observation period. Subsequently, seasonal data during the temporary opening period of the estuary bank gates were analyzed to assess the seasonal variability. Finally, data from the continuous opening period of the Nakdong River Estuary bank gates were analyzed to examine the impact of gate operation methods on the salinity. The results of the correlation analysis were used to identify the key variables that significantly influenced the salinity and served as the basis for optimizing the input variables.

3.1.3 Principal component analysis – identification of key variables and data pattern analysis

Principal component analysis (PCA) was applied to reduce the dimensionality of the data and extract key patterns based on correlations among observational variables. PCA is a powerful analytical tool that transforms high-dimensional data into lower-dimensional data while preserving essential information. PCA restructures data based on correlations among variables, with new axes called principal components designed to explain the maximum variance in the data. This process facilitates data visualization and is effective in identifying key variables and patterns. In this study, PCA was applied separately to the entire data period, temporary opening period, and continuous opening period of the Nakdong River Estuary bank gates. Thus, the key variables influencing salinity were identified more specifically, and the relative importance of each variable was analyzed.

3.2 Input Variable Optimization

Based on the key variables identified through the EDA during the continuous opening period of the Nakdong River Estuary bank gates, three groups of input variables were constructed (Table. 2). To compare the impact of input-variable optimization on the performance of the salinity prediction models, an LSTM-based salinity prediction model (Woo et al., 2022) was applied.

Experimental conditions and parameters for LSTM

The model performance was quantitatively evaluated using the root mean square error (RMSE). The RMSE is calculated by taking the square root of the average of the squared differences between the predicted and actual values, and it is widely used as a metric for evaluating model prediction performance.

RMSE=i=1n(yi^-yi)2n

Here, yi^ represents the predicted value for the i-th observation, yi denotes the actual value of that observation, and n represents the total number of observations.

First, the performance of the initial salinity prediction model was evaluated using all the available observational data as input variables. Subsequently, two input variable groups derived from the EDA were applied to the same prediction model to compare their performances.

Through this comparative analysis, the optimal combination of input variables was identified and the impact of input-variable optimization on the model's prediction accuracy and reliability was systematically evaluated. The variable-optimization approach in this study differs from traditional expert-driven variable selection methods by introducing a data-driven statistical approach, thereby presenting the possibility for non-experts to utilize it efficiently.

4. Results and Discussion

4.1 Exploratory Data Analysis of Observational Data around the Nakdong River

4.1.1 Correlation analysis

(1) Correlation Analysis Among Variables Over the Entire Period

Fig. 6 shows the correlations among the observational variables over the entire period from January 2019 to June 2022. The Salinity (NRB salinity) exhibited weak correlations with most variables and a weak negative correlation with discharge from Changnyeong-Haman Weir (CHR discharge). In Figs. 68, the larger circle sizes (●) indicate stronger correlations between the variables.

Fig. 6

Correlation analysis of observational variables during the entire data period (2019.01–2022.06)

Fig. 7

Seasonal correlation analysis of observational variables during the temporary gate opening period in the Nakdong River Estuary (2019.01–2021.12): (a) Winter, (b) Spring, (c) Summer, (d) Autumn

Fig. 8

Seasonal correlation analysis of observational variables during the continuous gate opening period in the Nakdong River Estuary (2022.02–2022.06): (a) Winter, (b) Spring, (c) Summer

A relatively high positive correlation of 0.43 was observed between the CHR discharge and rainfall (CHR rainfall), indicating that the discharge tends to increase with increasing rainfall. Additionally, the water level at the Changnyeong-Haman Weir (CHR WL) exhibited a negative correlation of −0.22 with the discharge. This is interpreted as reflecting the physical characteristic that the water level decreases as the discharge increases.

Notably, a very strong positive correlation of 0.99 was observed between the internal water level of the estuary bank (bank inner WL) and rainfall (bank rainfall), indicating that localized rainfall has a direct impact on water level fluctuations within the estuary bank. Overall, correlations among variables such as rainfall, discharge, and water level fluctuations reflect the dynamic interactions between the river and estuary. However, the analysis over the entire period without considering the operation methods of the estuary bank gates indicated that their influence on salinity changes was relatively minor. Therefore, to accurately characterize salinity variations in the Nakdong River Estuary, it is necessary to develop a model that distinguishes and interprets data according to the operational status of the estuary bank gates.

(2) Seasonal Correlation Analysis of Variables during the Temporary Opening Period of the Nakdong River Estuary Bank

Fig. 7 illustrates the seasonal correlations among the observational variables during the temporary opening period of the Nakdong River Estuary bank from January 2019 to December 2021. The salinity in the Nakdong River Estuary bank (NRB salinity) exhibits weak correlations with most variables, which is interpreted as a result of the environmental conditions when the gates remain closed. When the gates are closed, direct hydrological exchange with the sea is restricted; therefore, the effects of changes in rainfall, discharge, and water level on the salinity are relatively diminished.

In particular, the low or seasonally-varying correlation between the CHR discharge and salinity reflects these environmental factors. Generally, an increase in discharge is expected to enhance freshwater inflow into estuaries, leading to a decrease in salinity. However, when the gates are closed, this effect may not manifest directly or may be significantly delayed.

The water temperature (SD water temp) and electrical conductivity (SD cond) at Sangdong exhibit strong positive correlations across all seasons, suggesting that the physical and chemical characteristics of the river and estuary are more influenced by freshwater conditions than by seawater. Additionally, a pronounced negative correlation between electrical conductivity and dissolved oxygen (SD DO) is observed, particularly in summer and autumn, which is interpreted as reflecting a physicochemical change where rising temperatures lead to increased water temperatures and decreased dissolved oxygen concentrations.

The positive correlation between the internal water level (inner bank WL) and rainfall (bank rainfall), especially during summer, indicates that rainfall directly impacts the internal water level of the estuary bank. However, the weak correlation between the salinity and these variables is attributed to the limited mixing of freshwater and seawater compared to when the gates are open and seawater inflow occurs.

In conclusion, when the gates are closed, the influences of rainfall, discharge, and water level fluctuations on the salinity are limited, and the physical and chemical characteristics of the estuary tend to follow freshwater-dominated variation patterns.

(3) Seasonal Correlation Analysis of Variables during the Continuous Opening Period

Fig. 8 illustrates the seasonal correlations among the observational variables during the continuous opening period of the estuary bank from February 2022 to June 2022. During the continuous opening period, unlike the temporary opening period, a greater impact on the salinity is expected owing to changes in water temperature, rainfall, and water level as a result of the active mixing between freshwater and seawater.

Notably, during the summer season, the relationship between the salinity at the Nakdong River Bridge and the internal water level of the estuary bank, which showed a weak correlation during the temporary opening period, shifted to a negative correlation following continuous opening. This is interpreted as reflecting the removal of the controlled regulation of internal water level changes owing to the shift from temporary to continuous opening, where fluctuations in the internal water level either suppress or promote saltwater intrusion and salt-wedge retreat.

Moreover, differences can be observed in the relationship between the water temperature and salinity at the Nakdong River Bridge, both of which exhibit strong seasonality when comparing the continuous opening period with the temporary opening period. Unlike the temporary opening period, which shows a weak negative correlation regardless of the season, the continuous opening period exhibits a positive correlation between the water temperature and salinity in winter and summer, indicating that the salinity increases with increasing water temperature. However, in spring, a negative correlation is observed, which is attributed to the influence of thermal (or density) inversion phenomena occurring during this season, causing the relationship between the salinity and water temperature to exhibit a distinctly different pattern compared with other seasons. These changes reflect shifts in the data patterns corresponding to altered environmental conditions following continuous gate opening.

In summary, the correlation analysis results indicate that during the continuous opening period of the estuary bank, the correlations among variables tend to vary more distinctly with seasonal changes than during the temporary opening period. This is attributed to the continuous opening of the estuary bank, which clarifies the interactions among the physical and chemical variables of the Nakdong River, resulting in relatively enhanced seasonality in the relationships among these variables.

During the estuary bank opening period, the continuity of data leads to relatively stronger correlations among the variables, offering the advantage of maximizing the explanatory power of the independent variables for the dependent variable (salinity). However, an increase in the correlations among multiple variables introduces the risk of multicollinearity, which may undermine the interpretability of the model. For example, as summer approaches, both the water temperature and precipitation tend to increase (Kim, 2020), leading to a strong correlation between the two variables. This increase in correlation raises the likelihood of multicollinearity during modeling, which can result in one of the regression coefficients becoming excessively large or diminished. This poses the risk of overestimating or underestimating the effect of specific independent variables on the dependent variable (salinity), thereby compromising the reliability of the model interpretation. Variable selection techniques are necessary to address such issues in model interpretation, and the application of statistical methods such as PCA should be considered. Therefore, in the modeling process using data from the estuary bank opening period, it is essential to adopt an approach that systematically accounts for seasonality and time-varying patterns in the data and effectively manages the interactions among variables that evolve over time.

4.1.2 Principal component analysis

Fig. 9 presents the PCA of seasonal observational variables during the temporary opening period of the Nakdong River Estuary bank, visualizing the relationships among variables in the principal component space (PC1, PC2). In the figure, the variables of the analyzed data are represented as vectors, where vectors pointing in the same or similar directions can be interpreted as variables with high mutual correlation. For example, rainfall at the bank rainfall and CHR rainfall are positioned in similar directions and can be grouped together. Variables measured in nearby areas, such as the water level and water temperature, exhibit similar patterns and can be classified into the same group. The NRB temp, SD temp, and NRB salinity form an orthogonal relationship, indicating independent correlations. This indicates that the saltwater inflow caused by the temporary opening of the Nakdong River Estuary bank has a more dominant influence on salinity changes than that of physical causality.

Fig. 9

Principal component analysis of seasonal observational variables during the temporary gate opening period in the Nakdong River Estuary (2019.01–2021.12)

Fig. 10 presents the results of the PCA of the observational variables during the continuous opening period of the Nakdong River Estuary bank from February 2022 to June 2022, shown by season. Clusters can be observed along the PC1 axis in chronological order from left to right, corresponding to winter, spring, and summer. Among them, the CH WL, NRB temp, and NRB salinity play significant roles in the variance along the PC1 axis, as indicated by the magnitude of their vectors. This indicates that these variables exhibit strong seasonality and are significantly influenced by time.

Fig. 10

Principal component analysis of seasonal observational variables during the continuous gate opening period in the Nakdong River Estuary (2022.02–2022.06)

Furthermore, the seasonal clustering observed in the data during the continuous opening period shows a more distinct separation by season compared to the PCA results of the temporary opening period (Fig. 9). This indicates that the influence of seasonal factors on data variance is enhanced following the transition to continuous gate opening. Additionally, the representation of variables such as the NRB salinity, bank rainfall, CH rainfall, and CH WL as long vectors indicates that these variables are major contributors to model uncertainty following the continuous opening period. This reflects that following continuous gate opening, external factors such as rainfall and discharge fluctuations are no longer artificially controlled, allowing natural scientific phenomena involving interactions with chemical characteristics, such as the salinity concentration and water temperature, to be observed in the data.

4.2 Input Variable Optimization of Observational Data around the Nakdong River for Building a Salinity Prediction Model

Based on the correlation and PCA analyses of the data observed during the continuous opening period of the Nakdong River Estuary bank, variables with similar directions (loadings) in the PCA were found to exhibit high correlations with each other. Accordingly, in this study, variables with similar directions in the PCA loading plot were grouped together as one cluster and representative variables within each group were selected to serve as model input variables. For example, bank rainfall and CH rainfall, as well as discharge at the estuary bank and Changnyeong-Haman Weir, exhibited similar directions in the PCA and were classified into the same group. Only representative variables were selected as the final input variables. This variable selection approach was employed to reduce redundancy among variables and enhance the interpretability and efficiency of the model. The PCA results showed that variables with similar physical and chemical properties were grouped into the same principal components, supporting the validity of the approach for reducing redundancy among variables and selecting representative variables. Based on this analysis, three experiments were conducted to optimize the input variables; detailed descriptions of the experiments are presented in Table 3. The first experiment (Scenario 1) used eight input variables, excluding the NRB cond, which has a direct relationship with NRB salinity. The second experiment (Scenario 2) involved grouping variables and selecting representative variables from each group, resulting in six variables used for modeling. The third experiment (Scenario 3) expanded the scope of the variable groups and selected representative variables within each group using four variables to build a model aimed at minimizing the redundancy among the variables. This optimization process is expected to enhance the efficiency of the data analysis and improve the reliability of the model.

Performance of salinity prediction models based on input variable groups

4.3 Evaluation of the Impact of Input Variable Optimization on the Performance of the Nakdong River Salinity Prediction Model

Building on the study by Woo et al. (2022), this study selected the optimal combination of input variables for a time-series prediction model using the deep learning algorithm LSTM to predict the salinity levels in the Nakdong River Estuary. Woo et al. (2022) developed an LSTM-based salinity prediction model using data reflecting the spatiotemporal characteristics of the Nakdong River Estuary, such as the discharge, water level, and CHR rainfall. They analyzed the performance of the model and identified the optimal sequence length. In the study by Woo et al. (2022), salinity predictions were made for a one-year validation dataset (2020) covering only the temporary opening period, and the results showed no significant difference in the model prediction accuracy according to the sequence length. However, in this study, because data from both continuous and temporary opening periods were included, the sequence length was found to affect the accuracy of the model. Therefore, in this study, a model was developed by determining the sequence length that yielded the highest accuracy through hyperparameter optimization during model construction.

Fig. 11 presents the results of the salinity prediction at the Nakdong River Bridge using the LSTM model according to the variable selection scenarios outlined in Table 3. The key parameters and experimental conditions of the LSTM model used in this study are presented in Table 2, based on which all prediction experiments were conducted. A comparison of the salinity prediction performance according to variable optimization showed that all three scenarios exhibited generally similar prediction trends. However, at certain intervals, Scenario 3 exhibited patterns that were more closely aligned with the observed values and achieved more accurate predictions, whereas Scenarios 1 and 2 tended to underestimate compared to the observations. This suggests that Scenario 3, through the variable optimization process, was more sensitive to specific environmental changes. The analysis of the prediction performance confirmed that variable optimization improves the data efficiency. In particular, enhancing the accuracy and sensitivity of the model during periods of atypical environmental changes in the observational data (highlighted by dashed circles in the figure) necessitates prior implementation of the variable optimization process.

Fig. 11

Prediction of salinity in the Nakdong River using representative variables identified through EDA-based variable combination analysis

5. Conclusions

This study aimed to optimize the input variables to construct a salinity prediction model for the Nakdong River Estuary, thereby improving prediction accuracy and reliability. EDA was systematically conducted to select variables that significantly influenced salinity prediction, based on which an efficient model was designed.

The summarized results are as follows.

  • (1) The correlation analysis revealed that during the continuous opening period of the estuary bank, the correlations among variables exhibited more distinct seasonal variations than during the temporary opening period. This is attributed to the continuous opening of the estuary bank, which clarified the interactions among the physical and chemical variables of the Nakdong River, resulting in relatively enhanced seasonality in the relationships between these variables.

  • (2) Certain independent variables may be over-or underestimated because of their influence on the dependent variable (salinity), thereby hampering the model interpretation. Variable selection techniques are necessary to address such issues in model interpretation, and the application of statistical methods such as PCA should be considered. Therefore, in the modeling process utilizing data from the estuary bank opening period, it is essential to adopt an approach that systematically accounts for seasonality and time-varying patterns in the data and effectively manages the interactions among variables that evolve over time.

  • (3) A comparison of the salinity prediction performance according to variable optimization showed that while all three scenarios exhibited generally similar trends, Scenario 3 provided more accurate Optimization of Input Variables for Salinity Modeling in the Nakdong River Estuary Using Exploratory Data Analysis 393 predictions with patterns closely matching the observed values during specific environmental change periods. This suggests that variable optimization not only enhances the data efficiency, but also plays a crucial role in increasing the sensitivity of the model to specific environmental changes.

In contrast to previous studies that relied on domain experts for variable selection, this study presents an approach that enables non-experts to identify key variables by utilizing the statistical characteristics of the data. This methodology enables the development of objective and reproducible models based on data-driven approaches. Consequently, this study is expected to contribute to solving problems in the field of environmental hydraulics by expanding the practical applicability of AI models.

However, the model exhibited a relatively low prediction accuracy in sections where the salinity increases rapidly, which was primarily attributed to the insufficient generalization of the model to abrupt change zones (extrapolation zones) that were not represented in the training data. The model exhibited limitations in predicting rapid salinity increases caused by changes in external factors. To address this issue, it is necessary to integrate physics-informed data models that incorporate the actual physical laws. Physics-based models have the advantage of providing robust predictions based on physical constraints, even under environmental changes or extreme conditions not present in the observational data (i.e., extrapolation zones). Therefore, it is necessary to develop models that can more accurately predict abrupt natural environmental changes. To address this issue, the development of physics-informed data models is underway, and the results of these modeling efforts will be compiled and presented in future publications.

Notes

The authors declare that they have no conflict of interests.

This research was supported by the National Research Foundation of Korea (NRF) funded by the government (Ministry of Science and ICT) (No. 202301190002).

References

Han C. S, Jung S. W, Roh T. Y. 2011;The study of salinity distribution of Nakdong River estuary. Journal of Korean Society of Coastal and Ocean Engineers 23(1):101–108. https://doi.org/10.9765/KSCOE.2011.23.1.101.
Lee H. J, Jo M. G, Han J. K, Chun S. J. 2022;Nakdong River estuary salinity prediction using machine learning methods. Smart Media Journal 11(2):31–38. https://doi.org/10.30693/SMJ.2022.11.2.31.
Kim J. W. 2020;Prediction on the ratio of added value in industry using forecasting combination based on machine learning method. Journal of the Korea Contents Association 20(12):49–57. https://doi.org/10.5392/JKCA.2020.20.12.049.
Woo J. W, Kim Y. J, Yoon J. S. 2022;Prediction of salinity of Nakdong River estuary using deep learning algorithm (LSTM) for time series analysis. Journal of Korean Society of Coastal and Ocean Engineers 34(4):128–134. https://doi.org/10.9765/KSCOE.2022.34.4.128.
Noh H. K, Ryu H. K, Ryu J. H, Kim H. Y, Chun J. H. 2020. Status and plan of ‘Operation rule improvement and ecological restoration plan of Nakdong estuary’. Proceedings of the Korea Water Resources Association Conference 21–22. https://koreascience.kr/article/CFKO202019762875454.pdf.
Jo M. 2022. February. 18. 낙동강 하굿둑 35년 만에 빗장 풀렸다 The Nakdong River Estuary Barrage opened for the first time in 35 years. Busan is Good; https://www.busan.go.kr/news/snsbusan01/view?dataNo=66395&curPage=1.

Article information Continued

Fig. 1

Nakdong River estuary: (a) Installation location, (b) Marine environment observation sites, (c) Overview of the estuary (Jo, 2022)

Fig. 2

Observation data near the Nakdong River Estuary: (a) Changnyeong-Hamanbo total discharge, (b) Changnyeong-Hamanbo water level, (c) Changnyeong-Hamanbo rainfall, (d) Sang-dong water temperature, (e) Sang-dong electrical conductivity, (f) Sang-dong dissolved oxygen, (g) Gupo Bridge water level, (h) Nakdong River Bridge salinity, (i) Nakdong River Bridge water temperature, (j) Nakdong River Bridge electrical conductivity, (k) Banks of the Nakdong River outer water level, (l) Banks of the Nakdong River inner water level, (m) Banks of the Nakdong River rainfall, (n) Banks of the Nakdong River total discharge.

Fig. 3

Salinity variations in the Nakdong River Estuary during temporary and continuous gate opening periods: (a) Entire period (2019.01.01–2021.12.31 for temporary opening, 2022.02.18–2022.06.30 for continuous opening), (b) Monthly distribution

Fig. 4

Seasonal characteristics of the marine environment around the Nakdong River Estuary during the temporary gate opening period (2019.01–2021.12)

Fig. 5

Seasonal characteristics of the marine environment around the Nakdong River Estuary during the continuous gate opening period (2022.02–2022.06)

Fig. 6

Correlation analysis of observational variables during the entire data period (2019.01–2022.06)

Fig. 7

Seasonal correlation analysis of observational variables during the temporary gate opening period in the Nakdong River Estuary (2019.01–2021.12): (a) Winter, (b) Spring, (c) Summer, (d) Autumn

Fig. 8

Seasonal correlation analysis of observational variables during the continuous gate opening period in the Nakdong River Estuary (2022.02–2022.06): (a) Winter, (b) Spring, (c) Summer

Fig. 9

Principal component analysis of seasonal observational variables during the temporary gate opening period in the Nakdong River Estuary (2019.01–2021.12)

Fig. 10

Principal component analysis of seasonal observational variables during the continuous gate opening period in the Nakdong River Estuary (2022.02–2022.06)

Fig. 11

Prediction of salinity in the Nakdong River using representative variables identified through EDA-based variable combination analysis

Table 1

Status of observational data collection in the Nakdong River Estuary and surrounding areas

Observation site Observation parameter Data period Temporal resolution Measurement unit
Changnyeong Hamanbo Reservoir Discharge 2022. 2. 1–2022. 6. 31 10 min m3/s
Water level 2019. 1. 1–2022. 6. 31 2019. 01–2021. 12: 1 h
2022. 01–2022. 06: 10 min
EL.m
Rainfall mm

Sang-dong Water temperature 2019. 1. 1–2021. 12. 31 2019. 01–2021. 12: 1 h
2022. 01–2022. 06: 10 min
°C
Electrical conductivity μS/cm
Dissolved oxygen mg/L

Gupo Bridge Water level 2019. 1. 1–2021. 12. 31 2019. 01–2021. 12: 1 h
2022. 01–2022. 06: 10 min
EL.m

Nakdong River Bridge Salinity 2019. 1. 1–2022. 6. 31 2019. 01–2021. 12: 1 h
2022. 01–2022. 06: 10 min
psu
Water temperature °C
Electrical conductivity μS/cm

the banks of the Nakdong River Outer water level 2019. 1. 1–2022. 6. 31 2019. 01–2021. 12: 1 h
2022. 01–2022. 06: 10 min
EL.m
Inner water level EL.m
Rainfall mm
Discharge m3/s

Table 2

Experimental conditions and parameters for LSTM

Parameter Value
Hidden units 64
Num layers 5
Epochs 100
Sequence length 6
Activation (gates) Sigmoid
Activation (cell) tanh
Optimizer Adam
Learning rate 0.01
Loss function MSE

Table 3

Performance of salinity prediction models based on input variable groups

No. The number of input variables Input variables RMSE
1 8 Changnyeong Hamanbo Reservoir water level 0.5962
Changnyeong Hamanbo Reservoir rainfall
Nakdong River Bridge water temperature
the banks of the Nakdong River outer water level
the banks of the Nakdong River inner water level
the banks of the Nakdong River rainfall
the banks of the Nakdong River discharge
Nakdong River Bridge water temperature
2 6 Changnyeong Hamanbo Reservoir water level 0.5905
Changnyeong Hamanbo Reservoir rainfall
the banks of the Nakdong River inner water level
the banks of the Nakdong River outer water level
the banks of the Nakdong River discharge
Changnyeong Hamanbo Reservoir water level
3 4 Nakdong River Bridge water temperature 0.5695
the banks of the Nakdong River inner water level
the banks of the Nakdong River outer water level,
the banks of the Nakdong River outer water level,