Today’s post is based on the master thesis of Arturo Barrabés Castillo titled Design of Single Precision Float Adder (32-bit Numbers) according to IEEE 754 Standard Using VHDL.

Since DLS doesn’t support more than 16 bits per wire/pin, I’ll apply the same algorithms on 16-bit floating point numbers. I kept the same component names to easily find the connection between the paper and the schematics below. There are also some differences from the paper which I’ll point out when describing the relevant part of the circuit.

Figure 1 shows the final component. It has 3 inputs (`A`, `B`, and `fsel`) and 1 output (`F`). `A` and `B` are the two 16-bit floating point numbers. `fsel` is the function select signal, with `0` for addition and `1` for subtraction. `F` is the result of the operation.

## Step 1: Check for a special case

The first step is to check if we have a special case. Special cases are considered all the cases for which the result of the operation between the 2 inputs can be determined without performing the operation. I.e. adding 0 to a number results in the number itself, adding opposite sign infinities results in a NaN, etc. (see the paper for all the cases).

The block handling this part is called `n_case` (figure 2). It has 2 outputs, `S` and `enable`. `S` is the result of the operation if a special case is detected, otherwise `Undefined`. `enable` is used to turn on the rest of the circuit in case the current combination cannot be handled by this block.

There are 2 bugs in the `n_case` code from the paper.

First, the classification of inputs `A` and `B` (intermediate signals `outA` and `outB`) are wrong when `A` or `B` are powers of 2 (0 < E < 255 and M = 0). Both codes on pages 22 and 96 end up producing a value of `000` for `outA` and `outB`, meaning the numbers are zero. In order to fix this problem, we just have to ignore the mantissa, in case E is in the range (0, 255) (or (0, 31) in our case of 16-bit numbers).

The second bug has to do with the treatment of `B`. The `n_case` block takes as input only the two numbers and ignores the selected operation. If `fsel` is 1 (subtraction), `B`’s sign should be reversed before checking the special cases. If we don’t do that, subtracting any number from 0 will result in the number itself, instead of its negative value (e.g. `0 - 1 = 1`).

Both problems have been addressed in the circuit shown in figure 2. Other than that, the rest of the circuit consists of simple checks for each case. I decided to reverse `B`’s sign before making any checks, in case of subtraction. This means that `0 - 0` results in a negative zero, because the case where `A` is zero has higher priority than the case where `B` is zero. Not much of a problem, but it should be mentioned.

## Step 2: Prepare for addition

If the `enable` output of the `n_case` component is HIGH, it means that both `A` and `B` are regular numbers (normals or subnormals). Before being able to add them together, we have to align their decimal points. This is done by finding the exponent of the largest number and shifting the mantissa of the other to the right, until both exponents are equal. Remember that shifting the decimal point to the left means that the mantissa is shifted to the right. Also, for each shifted position, the exponent is incremented by 1. This part is handled by the `preadder` component (figure 3).

The `preadder` takes as input the two 16-bit numbers and the `enable` signal from the `n_case` component. First it expands both numbers to 21 bits, by introducing the implicit bit (1 for normals and 0 for subnormals) and adding 4 guard bits at the end. This is done in the `selector` (figure 4). The `selector` also outputs a 2-bit signal `e_data` indicating the type of both numbers (`00` when both are subnormals, `01` when both are normals and `10` in case of a combination).

Note that the expanded 21-bit numbers are broken into 2 signals. A 16-bit for the lower part and a 5-bit signal for the higher part.

After expanding the numbers, depending on the value of `e_data` the correct path is chosen. If both numbers are subnormals, they are handled by the `n_subn` block (figure 5). In this case the output exponent is `0` (which the same for both numbers). The two mantissas are compared and the largest one is placed on `MA` and the other on `MB`. If the `A` input is larger than `B`, then `Comp` is HIGH, otherwise it’s LOW.

Side note: I’ve used a 16-bit magnitude comparator to compare the 2 mantissas, eventhough the extended mantissas are 15 bits wide. The 16th bit is taken from the exponent, because I know that both numbers are subnormals and thus have an exponent of zero.

If the numbers are both normals or a combination of a normal and a subnormal, they are routed to the `n_normal` block (figure 6 and figure 7). Again both numbers are compared and the output exponent `EO` is set to the largest exponent. `MA` holds the largest number’s mantissa. The smaller mantissa (`Mshft` output from the `comp_exp` component) is shifted to the right based on the difference of the two exponents.

Side note: The `comp_exp` component uses the same 16-bit magnitude comparator as in the previous case. This time, since the exponents are non-zero, I made the 1st bit equal to 0 and the rest of the bits come from the extended mantissa.

After this step, all outputs from `n_subn` and `n_normal` are routed to the `mux_adder` and the final outputs of the preadder are calculated, based on `e_data`.

The outputs of the preadder block are the following:

• `SA` and `SB`: The signs of the two numbers
• `C`: 1-bit flag indicating if number A is larger than number B
• `Eout`: The final exponent, equal to the largest exponent
• `MAout`: The mantissa of the largest number
• `MBout`: The mantissa of the smallest number

We now have both numbers in the order and the format required to perform the selected operation. Since both exponents are equal, the only thing required is to add/subtract the mantissas. This is handled by the `block_adder` component (figure 8). The `block_adder` takes as input the two signs, `SA` and `SB`, the two mantissas `A` and `B`, the selected operation (0 for addition, 1 for subtraction) and the `Comp` flag which indicates if `A` is larger than `B`.

The `signout` block (figure 9) is responsible for calculating the sign of the result, the true operation to be performed between the two inputs and for reordering the inputs in case the true operation is different than the selected operation. The true operation is the actual operation to be performed between the two inputs. I.e. if we subtract a negative number from a positive number, the true operator is addition, which is different than the selected subtraction.

The final addition/subtraction is performed using a trimmed 16-bit CLA. The circuit is similar to the one presented in the previous article, with the only difference being that the 16-th bit of the result is used as the carry out of the addition and the sum is only 15 bits wide.

## Step 4: Normalization

We now have all the parts necessary to reconstruct the result. Figure 10 shows the normalization block (`norm_vector` from the paper).

It might seem a bit more complicated than necessary so I’ll try to describe the circuit in pseudocode below. The job of this component is to normalize the mantissa and correct the exponent accordingly. It gets as inputs the sign of the result (`SS`), the exponent (`ES`), the mantissa (`MS`) and the carry out of the addition (`Co`). Note that this part is very different than the one found in the paper. I probably didn’t understand the VHDL correctly and my attempts to map it directly to circuit ended up producing invalid results. The circuit shown in figure 10 seems to behave correctly in all tested cases.

There are several cases to consider when normalizing the mantissa. The code below shows all of them in pseudocode (assume a combined 16-bit number `ZX.YYYYYYYYYYYYYY`, where `Z = Co` and `X = MSB(M)`)

``````if Z == 1 then
E' = E + 1
M' = M >> 1 // M' = 1.XYYYYYYYYYYYYY
else
if X == 1 then
if E == 0 then // subnormal + subnormal = normal
E' = 1
M' = M // M' = 1.YYYYYYYYYYYYYY
else
E' = E
M' = M // M' = 1.YYYYYYYYYYYYYY
end
else
if E < lz or lz == 15 then
E' = 0
M' = M << E // Shift the mantissa to the left until the exponent becomes zero
else
E' = E - lz
M' = M << lz // Shift the mantissa to the left until an 1 appears on the MSB
endif
endif
endif
``````

Hope it’s a bit more clear than the circuit :)

There are 2 special cases to consider. When adding two subnormal numbers, the result might end up being a normal (e.g. `0x0200 + 0x0200 = 0x0400` which is the smallest normal). In this case `Z = Co = 0`, `X = MSB(M) = 1` and `E = 0`. The mantissa is already normalized (has an one at the MSB), so the exponent needs to be incremented to 1. If it’s left at 0, the result will be wrong when rounding the mantissa (see below).

The other case is then both `X` and `Z` are zero. In this case we have to shift the mantissa to the left until it becomes normalized or until the the exponent gets to 0. This can happen if the number of leading zeroes in the mantissa is larger than the exponent or if the mantissa is 0 (`lz = 15`). If the mantissa is zero at this point, it means that the result is 0 independently of the exponent (i.e `0x4000 - 0x4000 = 0` eventhough `ES = 0x10`).

Side note: The `leading_zeroes` component has been implemented using a 32k x 4-bit ROM.

Other than those 2 cases, the rest should be easily derived from any explanation on floating point addition.

The final step after calculating `E'` and `M'` is to round the mantissa. This involves ignoring the implicit bit (15-th bit) and rounding the guard bits. In order to keep things as simple as possible I decided to just cut off the guard bits, which is effectively rounding towards zero.

The final floating point adder circuit is shown in figure 11.