Freshwater ecosystems are facing significant challenges due to the extinction of endemic species, often caused by the invasion of aggressive species in degraded habitats. Overfishing and unsustainable practices are further threatening fish populations, destabilizing aquatic environments. This crisis is rooted in the exponential growth of the human population, which intensifies environmental degradation. To address declining wild fish stocks, aquaculture has emerged as a vital solution, not only for food production but also for environmental conservation by restoring natural habitats and replenishing wild populations. In this context, deep learning techniques are revolutionizing aquaculture by enabling precise monitoring and management of aquatic environments. The ability to process and analyze large volumes of visual data in real-time helps in accurately detecting, tracking, and understanding fish behavior, which is crucial for both optimizing aquaculture practices and preserving natural ecosystems.
This thesis presents a real-time system for detecting and tracking fish in underwater environments, utilizing a custom fish detector called YOLO-FishScale
based on the YOLOv8
algorithm. This detector addresses the challenge of detecting small fish, which vary in size and distance within frames. It enhances YOLOv8 by adding a new detection head that uses features from the $P_2$ layer and replaces the Conv module with SPD-Conv
, improving performance on small and low-resolution targets. Additionally, CBAM
(Convolutional Block Attention Mechanism) modules are integrated to enhance feature fusion, resulting in more accurate fish detection and tracking.
[!NOTE] The original YOLOv8 model employs a backbone network that down-samples the image through five stages, resulting in five feature layers ($P_1$, $P_2$, $P_3$, $P_4$, and $P_5$). Here, each $P_i$ layer represents a resolution of $1/2^i$ of the original image. To address the challenge of detecting small fish more effectively,
YOLO-FishScale
proposes the addition of a new detection head to theYOLOv8
architecture. This additional detection head utilizes features from the $P_2$ layer, which is specifically designed to enhance micro-target detection capabilities. Small objects are particularly challenging to detect due to their low resolution, which provides limited information about the content needed to learn patterns.
To train new custom models, you can modify the train.py
and train.sh
scripts according to your specific requirements. After configuring these files, execute them to initiate the training process. To test all the trained models, ensure the appropriate permissions are granted and then run the test.sh
script.
To train the YOLO-FishScale
model,, a new comprehensive dataset was constructed. The Fishscale Dataset is an extensive collection of images and labels, compiled from three renowned fish datasets:
For more details, see the following README.md . The proposed final model demonstrates superior performance compared to the baseline models. However, when tested on a completely different dataset, it became evident that the model’s performance was significantly lower. Notably, the FishScale dataset is characterized by high-quality images, with a limited number of samples captured using low-resolution cameras. To address this issue and enhance the robustness of the system, an additional private dataset provided by my Department (DIEM, UNISA) was utilized.
This section delves into the performance evaluation of various models trained on the FishScale dataset. The comparative analysis encompasses multiple models, all of which were assessed using a consistent test set—identical to the one employed by A.A. Muskit et al. in their research2. To ensure a fair evaluation, all models were subjected to the same primary validation thresholds: conf_tresh = 0.15
and nms_tresh = 0.6
. These values were chosen to ensure a good trade-off between precision and recall.
This subsection highlights the results achieved by the models developed by A.A. Muskit et al. trained on the combined Deepfish + Ozfish dataset. It is noteworthy that the results reported here diverge from those documented in their paper. The primary reason for this discrepancy lies in the application of different threshold settings: specifically, conf_tresh = 0.25
and nms_tresh = 0.45
were employed in their original work. This variation in threshold values significantly influences model performance metrics, such as precision, recall, and F1-score. As a result, the models from Muskit et al. may exhibit different detection capabilities than what is observed in this analysis, which utilizes stricter thresholds to refine the results.
Model | Precision ↑ | Recall ↑ | F1-Score ↑ | mAP(50) ↑ | mAP(50-95) ↑ | Parameters ↓ | Gflops ↓ |
---|---|---|---|---|---|---|---|
YOLOv3 | 0.67 | 0.72 | 0.690 | 0.739 | ❌ | 61,576,342 | 139.496 |
YOLO-Fish-1 | 0.70 | 0.71 | 0.705 | 0.745 | ❌ | 61,559,958 | 173.535 |
YOLO-Fish-2 | 0.74 | 0.69 | 0.714 | 0.736 | ❌ | 62,610,582 | 174.343 |
YOLOv4 | 0.59 | 0.79 | 0.675 | 0.787 | ❌ | 64,003,990 | 127.232 |
The following table summarizes the results from models I developed from scratch, specifically tuned for fish detection, incorporating adaptations such as $P_2$ layer, SPD-Conv
and CBAM
modules to enhance performance. All these models were trained on the FishScale Dataset + Private Dataset.
Model | Precision ↑ | Recall ↑ | F1-Score ↑ | mAP(50) ↑ | mAP(50-95) ↑ | Parameters ↓ | Gflops ↓ |
---|---|---|---|---|---|---|---|
YOLOv8s | 0.85,6 | 0.706 | 0.774 | 0.822 | 0.51 | 11,125,971 | 28.40 |
YOLOv8s-P2 | 0.85 | 0.72 | 0.779 | 0.829 | 0.519 | 10,626,708 | 36.60 |
YOLOv8s-p2-SPD | 0.867 | 0.717 | 0.785 | 0.831 | 0.52 | 12,187,284 | 55.70 |
YOLOv8s-p2-CBAM | 0.844 | 0.719 | 0.778 | 0.83 | 0.512 | 15,808,192 | 50.90 |
YOLOv8s-FishScale | 0.854 | 0.725 | 0.784 | 0.833 | 0.529 | 17,358,240 | 70.00 |
Although YOLOv8s-FishScale’s precision is slightly lower than YOLOv8s-p2-SPD’s (85.4% vs. 86.7%), its higher recall and comparable F1-Score highlight a well-rounded performance profile. This makes YOLOv8s-FishScale suitable when the goal is a balance between minimizing false positives and capturing as many true instances as possible. However, YOLOv8s-p2-SPD remains competitive with slightly better precision and lower computational requirements.
Fine-tuning YOLOv8s-FishScale using the weights from YOLOv8s proved to be ineffective due to the addition of new layers in FishScale, which prevented exact weight matching as required by YOLO. Consequently, to enhance performance, a weight-merging approach was adopted, structured as follows:
Fine-tuning the model with these merged weights led to an increase in performance across all metrics, as shown in the table below, demonstrating the effectiveness of the weight-merging approach:
Model | Precision ↑ | Recall ↑ | F1-Score ↑ | mAP(50) ↑ | mAP(50-95) ↑ | Parameters ↓ | Gflops ↓ |
---|---|---|---|---|---|---|---|
YOLOv8s-Fishscale † | 0.853 | 0.736 | 0.79 | 0.839 | 0.537 | 17,358,240 | 70.00 |
YOLOv8s-Fishscale ☨ | 0.873 | 0.732 | 0.796 | 0.844 | 0.540 | 17,358,240 | 70.00 |
The difference between the two models lies in additional data augmentation techniques employed to further enhance performance.
Islam et al. 5 proposed an innovative, yet straightforward, conditional GAN-based model designed to enhance underwater images. This models centers on a generator network that learns to map the distorted image $X$ to an enhanced output $Y$ trough a dynamic, adversarial relationship discriminator network.
The following models were trained for a reduced number of epochs (50 instead of 100) using the YOLOv8s-FishScale ☨ weights. As shown in the table, the FUnIEGAN + YOLOv8s-Fishscale ☨ model, which was trained by freezing all encoder layers of FUnIEGAN except the last one and fine-tuning the remaining layers, achieved a slight improvement in precision, recall and f1-score.
Model | Precision ↑ | Recall ↑ | F1-Score ↑ | mAP(50) ↑ | mAP(50-95) ↑ | Parameters ↓ | Gflops ↓ |
---|---|---|---|---|---|---|---|
FUnIEGAN (freezed) + ☨ | 0.858 | 0.693 | 0.767 | 0.811 | 0.515 | 24,388,355 | 198.40 |
FUnIEGAN + ☨ | 0.875 | 0.735 | 0.799 | 0.845 | 0.541 | 24,388,355 | 198.40 |
Saleh, Alzayat, Laradji, Issam H., Konovalov, Dmitry A., Bradley, Michael, Vazquez, David, Sheaves, Marcus, 2020. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Sci. Rep. 10 (1), 1–10. doi:10.53654/tangible.v5i1.110. ↩
A. A. Muksit, F. Hasan, M. F. Hasan Bhuiyan Emon, M. R. Haque, A. R. Anwary, and S. Shatabda, “Yolo-fish: A robust fish detection model to detect fish in realistic underwater environment,” Ecological Informatics, vol. 72, p. 101847, 2022. doi:10.1016/j.ecoinf.2022.101847. ↩ ↩2 ↩3
Fish4Knowledge Dataset. g18L5754. Fish4Knowledge Dataset. Open Source Dataset. Roboflow Universe, October 2023. Available at: https://universe.roboflow.com/g18l5754/fish4knowledge-dataset. ↩
Australian Institute Of Marine Science, 2020. Ozfish dataset - machine learning dataset for baited remote underwater video stations. ↩
M. J. Islam, Y. Xia and J. Sattar, “Fast Underwater Image Enhancement for Improved Visual Perception,” in IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3227-3234, April 2020, doi:10.1109/LRA.2020.2974710. ↩