![]() ![]() Then, we copy the image into a (white) target image of size 128×32. Usually, the images from the dataset do not have exactly this size, therefore we resize it (without distortion) until it either has a width of 128 or a height of 32. Input: it is a gray-value image of size 128×32. Both the ground truth text and the recognized text can be at most 32 characters long. While inferring, the CTC is only given the matrix and it decodes it into the final text. The IAM dataset consists of 79 different characters, further one additional character is needed for the CTC operation (CTC blank label), therefore there are 80 entries for each of the 32 time-steps.ĬTC: while training the NN, the CTC is given the RNN output matrix and the ground truth text and it computes the loss value. The RNN output sequence is mapped to a matrix of size 32×80. The popular Long Short-Term Memory (LSTM) implementation of RNNs is used, as it is able to propagate information through longer distances and provides more robust training-characteristics than vanilla RNN. ![]() RNN: the feature sequence contains 256 features per time-step, the RNN propagates relevant information through this sequence. While the image height is downsized by 2 in each layer, feature maps (channels) are added, so that the output feature map (or sequence) has a size of 32×256. Finally, a pooling layer summarizes image regions and outputs a downsized version of the input. Then, the non-linear RELU function is applied. First, the convolution operation, which applies a filter kernel of size 5×5 in the first two layers and 3×3 in the last three layers to the input. These layers are trained to extract relevant features from the image. OperationsĬNN: the input image is fed into the CNN layers. 1: The NN written as a mathematical function which maps an image M to a character sequence (c1, c2, …). ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |