Introduction

Over the last few decades, machine vision became a very important part of human society. It's long been used in industrial fields such as manufacturing, but it's also becoming more of an everyday thing for individuals. Automated cars are a great example that uses the machine vision for the good of everyone, and now the systems are so developed that those cars can drive hundreds of miles without a single accident.

However, the development of technology also imposed a new threat to people, especially on the privacy side. Hundred of cameras on the road can accurately identify certain individuals, being able to track their movement and actions. Also, cars can be automatically identified with the number plates easily, giving unnecessary information about individuals that can be used to harm people. This is not happening yet in most of the world due to issues with ethics, but some countries actively use technology to often control their citizens. It is not just important to learn about the technology, but it is also important to realize what is the right application of the machine vision.

One amazing thing about machine vision is that the development of many tools made it easier for people to use the technology individually. Even students like us can use the tools such as PyTorch and TensorFlow to build a machine vision model that can achieve a certain goal in a fairly short amount of time. For this project, we will focus on creating a possible model that can be helpful for people, and avoid any harm that can happen.

Application of Machine Vision in Sign Languages

Creating a machine vision application that can correctly recognize sign languages can help many different individuals who are deaf. People who are born with a lack of hearing ability often have problems with reading and writing due to the challenge of getting an education. One study in Korea states that about 30% of deaf people in Korea are illiterate, meaning it is hard for them to communicate with other people and transform their ideas into a written form.

Having an accurate machine vision model that can detect the sign languages and translate (interpret) them into sounds or written words can greatly help people to communicate with people lacking hearing ability. The stakeholders will be people who are willing to communicate with deaf people especially those who cannot speak and write, or people who don't know sign language.

This is not easy, because building a model that can accurately detect sign language is the same as creating a program that translates languages perfectly. As same as other translators out there, the model always has the risk of translating the messages in the wrong way, where miscommunication can happen. To prevent this to happen, the model that recognizes the sign language should be very accurate to minimize the errors.

With the skillset that we have, it is almost impossible to create a machine vision model that will perfectly translate sign languages into English. Thinking that even Google has not been able to create a perfect translator, this isn't even a project that can be held at an individual level. However, what we can do is build a model that can do a basic detection - sign languages to alphabet letters.

By using the MNIST Sign Language Dataset (https://www.kaggle.com/datamunge/sign-language-mnist), we will create a CNN(Convolutional Neural Network) machine vision model with TensorFlow to build and improve a model that can accurately classify the specific sign language a person is making. As we stated earlier, our major goal will be to try to make the model as accurate as possible to prevent possible errors and miscommunication that can happen in real applications.

The dataset includes 27455 images of training data on sign languages that represent alphabet letters. The dataset also includes 7172 images of testing data to evaluate how well the model works after going through training with training data. Both training and testing datasets divide the images by different labels, A being 0 and Z being 25 to represent all 26 alphabets. Images are stored as a single row data in CSV representing 784 pixels needed for 28 * 28 images in grayscale, from 0 to 255, which is the size and values often used for MNIST applications. Unfortunately, the dataset lacks specific data for the alphabet J and Z, because gesture motion is involved to correctly represent those sign languages, meaning that labels 9 and 25 will be missing. As you can see from this exception, most sign languages are not motionless - this proves the model to be limited, even with high accuracy. But building a model that can do very basic stuff can be helpful for the future creation of better models, or give important insight to improve your application.

CNN is an effective method to create an efficient model for a lot of machine vision applications, but especially to sign languages. CNN model will effectively divide the image into features, where an image will be translated into multiple numbers that show what tendencies each image has and what can the computer interpret about it. Due to the unique shape of hands you can observe in each sign language, if we can train the model to recognize these shapes well, we will be able to build a good CNN model with high accuracy.

After building a working model, it can be evaluated using real-world data possibly created by our own. Ideally, we can create real-time interpretation software that speaks or writes down the detected sign language. Just trying to test our images of sign languages can be a powerful test too, checking if the model can translate correctly with the difference in data style. However, if all the real application is not capable, we believe that building a model with the highest accuracy possible will be enough for this project.

Importing Dataset and Libraries

The dataset was pre-downloaded and uploaded to our google drive. After importing necessary libraries to deal with the dataset (especially pandas), the dataset was imported to the Google Colab and checked to see if they were imported correctly.

Organize Dataset

Training and testing datasets were divided into the x element and y element to be used in the training. All the pixel data inside the x elements were normalized for future calculations, and x elements were reshaped back into a 28*28 image form. From the histogram, we can see that the dataset is missing the labels 9 and 25 (alphabet J and Z) as mentioned earlier, but other than that all the training data is evenly distributed and are capable of doing a balanced training.

Training The First Model

The first model was very simple. It had one convolutional layer, one max-pooling layer, and two dense layers. It has all the basics you need to form a CNN model, but we can see from the result that it overfits the training data fast, having the accuracy of 100% on train data but having around 85~88% accuracy on the test data. Since the model was improved to guess the training data with high accuracy, the model cannot guess a new image correctly. The accuracy is not that bad for the first try, but we know that we might want higher accuracy for the real application.

Convolutional Layer

The most important layer in the CNN model is the convolutional layer. By using kernels with different weights, the model tries to catch the relationships between the pixels inside an image. As the model continues training, the convolutional layer will change the weights of kernels so that it can effectively catch desired features from images.

The above figure shows all the 32 channels of the convolutional layer after applying 32 different kernels to an image. We can see that each channel looks different, trying to catch different aspects of the image. The weight values for each kernel start as random values but fits themselves to values that will yield the highest accuracy in the case of our model.

Having more filters (kernels) can help with improving the accuracy of the model, but there is always the risk of over-fitting the data with a higher number of filters. So, it is important to find out the most optimal values for the training.

Max Pooling Layer

Max pooling for the CNN model is done so that you can effectively reduce the number of parameters you have to deal with while maintaining the accuracy of the model. You want to reduce the number of parameters so that the training does not take forever, but keep the feature data you gained from the convolutional layer for better results. If the pool size is 2*2, the resulting image will be one-fourth of the original - meaning that you only have to deal with 25% of the original data now. Also, reducing data does not just mean faster training, but also helps the model to not be over-fitted.

The effectiveness of pooling can be explained simply by showing the number of parameters. Every time data goes through the convolutional layer, the amount of data is multiplied by the number of filters. For my model, this is 32. If I create 3 convolutional layers without pooling, the number of parameters for training will be multiplied by 32 32 32 = 32,768 times. If we do max pooling after each convolutional layer, this number will be reduced to (32/4) (32/4) (32/4) = 512 times. Simply saying, this will reduce the number of parameters the model has to deal with by 64 times.

Max or average pooling is usually used, because max pooling is effective in getting the most distinct number inside the pool, possibly keeping the feature data you want for the model. Average pooling can also be used to see the general tendency inside the pool, which will also be helpful depending on the data type.

We used max-pooling for our model. Looking at each of the 32 channels created from the pooling, we can observe that resolution is lower than the convolutional layer but still maintains the major shape each channel had, showing how max-pooling can be effective in CNN application.

Improving The Model

Optimizer

We chose Adam as our optimizer because we learned that it is efficient in most machine learning cases. After some experiment of swapping the optimizer around, Adam took a fair amount of time on training, while showing high accuracies. For example, using Adagrad on the first model only gave 47% accuracy in the evaluation, and using RMSprop gave a bit less accuracy and took more time to compute. We are sure that changing parameters around for the model will increase accuracy for each optimizer, but it did not seem like a useful investment of time since we already know a faster and accurate optimizer.

Dropout

Dropout is an easy way of preventing the overfitting of the model by randomly discarding nodes during the training. As mentioned several times already, having too many parameters always have the risk of overfitting. By dropping random nodes during the training, the model gets a higher chance of encountering a parameter it did not encounter in the previous iterations, lowering the chance of over-fitting the data.

By adding a single Dropout(0.25) layer, we saw a 1~2% increase in the accuracy for the evaluation. The accuracy tended to increase with a higher drop-out value, but just until a certain point around 50%. We can assume that dropping too much data leaves not enough parameters for the model to learn about the dataset.

By adding multiple dropout layers with different values, we were able to see that it was less fitted to the training data because the accuracy of the training data did not hit 100%, and accuracy on the evaluation increased. After experimenting multiple times, we realized that having two 50% drop-out layers in between the dense layers helped raise the test data accuracy by about 4~5%.

Dense Layers

By trying to add more dense layers, we realized that having more drops the accuracy. However, deleting one of the two dense layers dropped the accuracy too, meaning that one dense layer is perfect for this application.

Changing the number of neurons also mattered, where having a number around 256~512 gave the best accuracy. We assume that having too few neurons will not be able to form enough network to compute well while having too much will overfit the model again.

We decided on using the ReLU as our activation function because it is easy to implement, efficient, and also fast. ReLu function is commonly used for most machine learning applications, especially to machine vision because it makes the images less linear, making the computation faster.

Learning Rate

Decreasing the learning rate from the default 0.001 to 0.0001 just made the training slower, and did not increase the accuracy at all. Also, raising the value to 0.01 did not help either, being stuck in one particular loss value and not able to train the model anymore. So we figured out that leaving the learning rate to the default is the best choice.

Convolutional Layers

By adding one more convolutional layer, the accuracy for the test data rose to 94~95% but was not able to go above that. After a certain epoch that hit the 95% accuracy, the accuracy started going down afterward, meaning that the model starts overfitting to the training data after that. We took a wild guess and added the third layer, and we found a tendency where the accuracy for the test data went up by a little bit.

And then we tweaked with the kernel size, where we found that having a bigger kernel size (5,5) gives a better result for the first convolutional layer, but a smaller kernel size (3,3) for later layers helped with raising the accuracy. This can be explained because having a too-small kernel size for the bigger image will likely create too specific channels (images) of the picture, not being able to catch the right pattern that happens with different sign languages.

The final change we did was changing the filter number. Having a high filter number made the training very slow, but tended to have higher test data accuracy in general. After some testing, gradually increasing the filter number almost has the same effect as putting a high value for every filter, but needed much less training time.

Batch Size and Epochs

Changing the batch size, as far as they are not too small or not too big, did not change the maximum accuracy for the model. Different numbers of epochs were needed, but eventually, they ended up in the same place. So to make the computation faster, we decided to use the batch size of 100, which gave good accuracy and fair training time.

The Final Model

The final model was created using the following layers:

It became a little more complex than the first model, and we can check that the new model is better by looking at the accuracy. Even though the accuracy of the training data decreased by a bit, but the accuracy of the testing data increased from 85-88% to 96-97%. Thinking that getting 1 wrong out of 25 sign languages is much better than getting 1 out of 8 wrong, the model was effectively improved by multiple iterations.

One observation we found from the accuracy graph is that accuracy tends to fluctuate around the maximum accuracy or the lowest loss value. This was the reason why we were not able to improve the model further. We might be able to tweak the hyperparameters more to get better results, but at this point, we are unsure how to approach the solution.

Wrong Values and Analysis

Top 9 Wrong Signs

To look at the tendency of sign languages the model get's wrong, we decided to visualize the TOP 9 sign languages that were predicted wrong by the model. From looking at the percentage of missed labels, we realized that the model is not good at recognizing the certain type of images. It's not that every sign language is misinterpreted equally, but the 96~97% accuracy by the model meant that specific labels are guessed wrong.

Actual Label and Prediction Visuals

We compared TOP2 labels that got predicted wrong by their actual label and predicted label. We realized that they resemble in shape, such as both having just the fist, or having a similar number of fingers toward the same directions. This means that the model was trying to do a good job predicting, but some images had almost the same features that the CNN model was not able to guess correctly. Since the CNN model tries to predict by what kind of features the images have, this can be especially challenging.

This investigation gave us a feeling that the CNN model can have a great limitation. Since every image is resized to a small resolution, some images are even hard for humans to interpret correctly. It is not surprising that the machine was not able to differentiate between some sign languages if they became harder to recognize while preparing the dataset needed for training. One possible solution for this problem can be having images with higher resolution, but having bigger images means that training time will be longer, taking much more time to evaluate if a model is effective or not.

Confusion Matrix

A confusion matrix was created to see the relationship between each class. The matrix looks great with the blue diagonal cells, but we can see some light-blue cells around that shows that some predictions were wrong. The matrix is normalized since each sign language does not have the same number of data. We can see that most of the labels have 1 as their value meaning that all the predictions were right, but some of them are around 0.9-1.0, and few of them have relatively low accuracy, showing some false negatives and false positives.

Overall, the model is effective in most applications, except for some sign language that they fail to recognize correctly.

Effectiveness and Limitations

The final model has 96~97% accuracy, meaning that it can get interpret the sign language alphabet correctly most of the time. Due to the special condition my team had where there was only one person, we were not able to implement a real-time interpretation of sign language or try to evaluate our model with data created by ourselves. However, the learning experience of creating a fully working CNN model and improving the accuracy from 85% to 97% was a valuable time.

Thinking of the nature of a language, it is important to achieve 100% accuracy to prevent any type of misinterpretation. For example, even with 97% accuracy to interpret each letter correctly, the possibility of interpreting a 5 letter message correctly is 85%, and 10 letter message correctly is only 73%. Also, alphabets are the not only thing that is used in sign languages - it is rather rare to just use the alphabets for communication. So the biggest limitation for this project is that even with CNN it was challenging to achieve very high accuracy that is ideal for real-world usage, and also cannot interpret sign languages with motion.

Multiple limitations made the model hard to be used in real-world applications, but we believe that this can be a basic step that contributes to building a functional interpreter for sign language. There can be other machine learning models that can achieve higher accuracy, or we can approach in a different direction to deal with the gestures. In either direction, I believe it is important for us engineers to keep trying so that we can eventually serve good to human society without harming.