A Comparative Study of Deep CNN Architectures for Static American Sign Language Recognition

Authors

  • Chi Zhang Author

DOI:

https://doi.org/10.61173/yackf205

Keywords:

Sign language recognition, deep learning, convolutional neural networks

Abstract

The goal of Sign Language Recognition (SLR), a crucial computer vision job, is to automatically understand sign motions in order to lower communication barriers between the hearing and deaf communities. Despite advances in deep learning, achieving high accuracy and deployment efficiency in real-world SLR systems remains challenging. In this work, a comparative analysis of four Convolutional Neural Network (CNN) architectures—Custom CNN, ResNet-50, EfficientNet-B0, and Inception-V3—for static American Sign Language (ASL) fingerspelling classification was performed. Using pre-trained models, this study applied data augmentation, transfer learning, and fine-tuning to the ASL Alphabet dataset, which comprises more than 87,000 images in 29 classifications. All models are trained with consistent protocols using PyTorch, including early stopping and learning rate scheduling. The results show that EfficientNet-B0 achieved the highest accuracy of 99.8% with minimal misclassifications, outperforming ResNet-50 (99.6%) and the Custom CNN (99.2%). Inception-V3 performed substantially worse, with 84.3% accuracy and a noisier confusion matrix, indicating more errors in distinguishing similar gestures. Confusion matrices confirmed that EfficientNet-B0 and ResNet-50 produced highly reliable, nearly diagonal predictions. The Custom CNN, while slightly less accurate, offered a lightweight baseline. These findings demonstrate the benefits of transfer learning and contemporary model scaling strategies in attaining high ASL identification accuracy, while also emphasizing the necessity of striking a balance between accuracy and computing efficiency for real-time deployment in real-world applications.

Downloads

Published

2025-08-26

Issue

Section

Articles