16-bit Floating Point Adder

Today’s post is based on the master thesis of Arturo Barrabés Castillo titled Design of Single Precision Float Adder (32-bit Numbers) according to IEEE 754 Standard Using VHDL.

Since DLS doesn’t support more than 16 bits per wire/pin, I’ll apply the same algorithms on 16-bit floating point numbers. I kept the same component names to easily find the connection between the paper and the schematics below. There are also some differences from the paper which I’ll point out when describing the relevant part of the circuit.

Figure 1 shows the final component. It has 3 inputs (A, B, and fsel) and 1 output (F). A and B are the two 16-bit floating point numbers. fsel is the function select signal, with 0 for addition and 1 for subtraction. F is the result of the operation.

Figure 1: 16-bit Floating Point Adder component

Step 1: Check for a special case

The first step is to check if we have a special case. Special cases are considered all the cases for which the result of the operation between the 2 inputs can be determined without performing the operation. I.e. adding 0 to a number results in the number itself, adding opposite sign infinities results in a NaN, etc. (see the paper for all the cases).

The block handling this part is called n_case (figure 2). It has 2 outputs, S and enable. S is the result of the operation if a special case is detected, otherwise Undefined. enable is used to turn on the rest of the circuit in case the current combination cannot be handled by this block.

Figure 2: n_case block

There are 2 bugs in the n_case code from the paper.

First, the classification of inputs A and B (intermediate signals outA and outB) are wrong when A or B are powers of 2 (0 < E < 255 and M = 0). Both codes on pages 22 and 96 end up producing a value of 000 for outA and outB, meaning the numbers are zero. In order to fix this problem, we just have to ignore the mantissa, in case E is in the range (0, 255) (or (0, 31) in our case of 16-bit numbers).

The second bug has to do with the treatment of B. The n_case block takes as input only the two numbers and ignores the selected operation. If fsel is 1 (subtraction), B’s sign should be reversed before checking the special cases. If we don’t do that, subtracting any number from 0 will result in the number itself, instead of its negative value (e.g. 0 - 1 = 1).

Both problems have been addressed in the circuit shown in figure 2. Other than that, the rest of the circuit consists of simple checks for each case. I decided to reverse B’s sign before making any checks, in case of subtraction. This means that 0 - 0 results in a negative zero, because the case where A is zero has higher priority than the case where B is zero. Not much of a problem, but it should be mentioned.

Step 2: Prepare for addition

If the enable output of the n_case component is HIGH, it means that both A and B are regular numbers (normals or subnormals). Before being able to add them together, we have to align their decimal points. This is done by finding the exponent of the largest number and shifting the mantissa of the other to the right, until both exponents are equal. Remember that shifting the decimal point to the left means that the mantissa is shifted to the right. Also, for each shifted position, the exponent is incremented by 1. This part is handled by the preadder component (figure 3).

Figure 3: preadder block

The preadder takes as input the two 16-bit numbers and the enable signal from the n_case component. First it expands both numbers to 21 bits, by introducing the implicit bit (1 for normals and 0 for subnormals) and adding 4 guard bits at the end. This is done in the selector (figure 4). The selector also outputs a 2-bit signal e_data indicating the type of both numbers (00 when both are subnormals, 01 when both are normals and 10 in case of a combination).

Figure 4: preadder selector

Note that the expanded 21-bit numbers are broken into 2 signals. A 16-bit for the lower part and a 5-bit signal for the higher part.

After expanding the numbers, depending on the value of e_data the correct path is chosen. If both numbers are subnormals, they are handled by the n_subn block (figure 5). In this case the output exponent is 0 (which the same for both numbers). The two mantissas are compared and the largest one is placed on MA and the other on MB. If the A input is larger than B, then Comp is HIGH, otherwise it’s LOW.

Figure 5: preadder n_subn

Side note: I’ve used a 16-bit magnitude comparator to compare the 2 mantissas, eventhough the extended mantissas are 15 bits wide. The 16th bit is taken from the exponent, because I know that both numbers are subnormals and thus have an exponent of zero.

If the numbers are both normals or a combination of a normal and a subnormal, they are routed to the n_normal block (figure 6 and figure 7). Again both numbers are compared and the output exponent EO is set to the largest exponent. MA holds the largest number’s mantissa. The smaller mantissa (Mshft output from the comp_exp component) is shifted to the right based on the difference of the two exponents.

Figure 6: preadder n_normal
Figure 7: preadder n_normal comp_exp

Side note: The comp_exp component uses the same 16-bit magnitude comparator as in the previous case. This time, since the exponents are non-zero, I made the 1st bit equal to 0 and the rest of the bits come from the extended mantissa.

After this step, all outputs from n_subn and n_normal are routed to the mux_adder and the final outputs of the preadder are calculated, based on e_data.

The outputs of the preadder block are the following:

  • SA and SB: The signs of the two numbers
  • C: 1-bit flag indicating if number A is larger than number B
  • Eout: The final exponent, equal to the largest exponent
  • MAout: The mantissa of the largest number
  • MBout: The mantissa of the smallest number

Step 3: The addition/subtraction

We now have both numbers in the order and the format required to perform the selected operation. Since both exponents are equal, the only thing required is to add/subtract the mantissas. This is handled by the block_adder component (figure 8). The block_adder takes as input the two signs, SA and SB, the two mantissas A and B, the selected operation (0 for addition, 1 for subtraction) and the Comp flag which indicates if A is larger than B.

Figure 8: block_adder

The signout block (figure 9) is responsible for calculating the sign of the result, the true operation to be performed between the two inputs and for reordering the inputs in case the true operation is different than the selected operation. The true operation is the actual operation to be performed between the two inputs. I.e. if we subtract a negative number from a positive number, the true operator is addition, which is different than the selected subtraction.

Figure 9: block_adder signout

The final addition/subtraction is performed using a trimmed 16-bit CLA. The circuit is similar to the one presented in the previous article, with the only difference being that the 16-th bit of the result is used as the carry out of the addition and the sum is only 15 bits wide.

Step 4: Normalization

We now have all the parts necessary to reconstruct the result. Figure 10 shows the normalization block (norm_vector from the paper).

Figure 10: norm_vector

It might seem a bit more complicated than necessary so I’ll try to describe the circuit in pseudocode below. The job of this component is to normalize the mantissa and correct the exponent accordingly. It gets as inputs the sign of the result (SS), the exponent (ES), the mantissa (MS) and the carry out of the addition (Co). Note that this part is very different than the one found in the paper. I probably didn’t understand the VHDL correctly and my attempts to map it directly to circuit ended up producing invalid results. The circuit shown in figure 10 seems to behave correctly in all tested cases.

There are several cases to consider when normalizing the mantissa. The code below shows all of them in pseudocode (assume a combined 16-bit number ZX.YYYYYYYYYYYYYY, where Z = Co and X = MSB(M))

if Z == 1 then
    E' = E + 1
    M' = M >> 1 // M' = 1.XYYYYYYYYYYYYY
else
    if X == 1 then
        if E == 0 then // subnormal + subnormal = normal
            E' = 1
            M' = M // M' = 1.YYYYYYYYYYYYYY
        else
            E' = E
            M' = M // M' = 1.YYYYYYYYYYYYYY
        end
    else
        lz = leading_zeroes(M)
        if E < lz or lz == 15 then
            E' = 0
            M' = M << E // Shift the mantissa to the left until the exponent becomes zero
        else
            E' = E - lz
            M' = M << lz // Shift the mantissa to the left until an 1 appears on the MSB
        endif
    endif
endif

Hope it’s a bit more clear than the circuit :)

There are 2 special cases to consider. When adding two subnormal numbers, the result might end up being a normal (e.g. 0x0200 + 0x0200 = 0x0400 which is the smallest normal). In this case Z = Co = 0, X = MSB(M) = 1 and E = 0. The mantissa is already normalized (has an one at the MSB), so the exponent needs to be incremented to 1. If it’s left at 0, the result will be wrong when rounding the mantissa (see below).

The other case is then both X and Z are zero. In this case we have to shift the mantissa to the left until it becomes normalized or until the the exponent gets to 0. This can happen if the number of leading zeroes in the mantissa is larger than the exponent or if the mantissa is 0 (lz = 15). If the mantissa is zero at this point, it means that the result is 0 independently of the exponent (i.e 0x4000 - 0x4000 = 0 eventhough ES = 0x10).

Side note: The leading_zeroes component has been implemented using a 32k x 4-bit ROM.

Other than those 2 cases, the rest should be easily derived from any explanation on floating point addition.

The final step after calculating E' and M' is to round the mantissa. This involves ignoring the implicit bit (15-th bit) and rounding the guard bits. In order to keep things as simple as possible I decided to just cut off the guard bits, which is effectively rounding towards zero.

The final floating point adder circuit is shown in figure 11.

Figure 11: 16-bit Floating Point Adder

Thanks for reading. Comments and corrections are welcome.