Decimal Points

Decimal points separate whole numbers from their fractional part.

Consider a binary number as 1001 . 1010 the first part, 1001 is equivalent to 9 whilst the second part is equal to 0.625 (0.5 + 0.125), making the whole number 9.625

Converting denary to binary

11.75 → 11 | 0.75 11 = 1011 0.75 = 0.11

11.75 → 1011.11

Fixed point binary system:

It can be noted that using this method, only certain fractional parts can be represented.
In other words, only the fractional numbers that can be written as a sum of those numbers specified in the table can be converted to binary using this method accurately.
Consider an 8-bit register, using two bits for the fractional part means there will be only 6 bits to store the whole number.

17.5

→ 17 | 0.5

16 8 4 2 1 1 0 0 0 1

0.5 0.1

10001.10

(correct, use longer word length)

-17.5 01110.01

32.015625 32 | 0.015625

32 16 8 4 2 1 1 0 0 0 0 0

0.15625

0.5 → 0.25 → 0.125 → 0.0625 → 0.03125 →0.015625 0.00001

100000.00001

Floating-point numbers

1.3x10

^7 is an exponent

1.3 is the mantissa

1234 = 0.1234 x 10

Step 1 ) Find exponent and check sign bit

Step 2 ) Modulate the mantissa by the exponent

Mantissa is the actual number The exponent is the modifier to be applied to the mantissa.

FPB: 001010.01

32 16 8 4 2 1 . 1/2 1/4

10.25

1001.0010

17.125 (Incorrect, negative due to sign bit)

-17.5 to bin

1001.1

0.111 010

Exponent = 2 011.1010 3 + 1/2 + 1/8 = 3.625 Mantissa = 3.625

0.110111 0100 Exponent = 4 Mantissa =

-32 16 8 4 2 1 1 1 0 1 1 1

-32 + 16 + 4 + 2 +1 = -9

Act: 13.75

Rounding Errors

Binary hits some issues when we deal with some numbers that aren’t factors of 2.
It can’t be done accurately, we have to settle on being as close as possible.

Absolute Error

The difference between the number you’re trying to calculate and the number you’ve managed to store.

Relative Error

Essentially the same as a relative error, but it is expressed as a percentage.
It gives us an idea of the scale of the error that we’re dealing with and how much we can trust the numbers.
Divide the absolute error by the number that we’re trying to represent, and times the result by 100 to get a percentage.
So 0% is entirely accurate, and 100% is completely inaccurate.
The closer the percentage value is to 0, the greater the accuracy of the number.

Target Number: 25000 Absolute Error: 0.5 Relative Error: 0.002%

Target Number: 100 Absolute Error: 0.5 Relative Error: 0.5%

Target Number: 5000 Absolute Error: 0.5 Relative Error: 0.01%

Target Number: 10 Absolute Error: 0.00000005 Relative Error: 0.0000005%

FLOP to Denary

0.010 0110

Exponent: 6

0.010

0.100 1.0000 10 100 1000 10000 16 8 4 2 1

0.1101 011

Exponent: 3

0110.1 4 2. 1/2

6.5

Normalisation

The process of improving the accuracy of a number with decimal points.

46321 (denary) can be represented as 0.46321x10^4 0.00463x10^6 ( less accurate)

Out of these representations, the first is the most precise.

This same logic can be applied to binary numbers.

An unormalised positive number consists of a sign bit (0) and one or more zeros after the decimal point.

For a normalised positive binary number, the sign bit is 0 and the bit after the sign bit is always 1.

To normalise a binary number, there must be a 1 after the decimal point. The mantissa may be lessened and the exponent will be directly increased.

To normalised a negative binary number, the first bit of both the mantissa and the exponent is 1.

The exponent may also be lessened and the mantissa increased.

01 10

- left
- right

3.5 = 00011.100

0.1110 010

Standard Form

In standard form, binary numbers must start with:

Pos: 0.1 Neg: 1.0

Arithmetic Operations

If you need to add two numbers together, they must be converted into the same form. “ So 4.63x10^6 cannot be added to 4.63x10^8 until the power is normalised.

Underflow error

An underflow error occurs when a number is too small for your register to hold. “There isn’t enough space to store a number with any accuracy, so 0 is stored.”

‎‎

📚 Seth Maurice-Brant

Explorer