Decimal points separate whole numbers from their fractional part.
Consider a binary number as 1001 . 1010 the first part, 1001 is equivalent to 9 whilst the second part is equal to 0.625 (0.5 + 0.125), making the whole number 9.625
Converting denary to binary
11.75 → 11 | 0.75 11 = 1011 0.75 = 0.11
11.75 → 1011.11
Fixed point binary system:
- It can be noted that using this method, only certain fractional parts can be represented.
- In other words, only the fractional numbers that can be written as a sum of those numbers specified in the table can be converted to binary using this method accurately.
- Consider an 8-bit register, using two bits for the fractional part means there will be only 6 bits to store the whole number.
17.5
→ 17 | 0.5
16 8 4 2 1 1 0 0 0 1
0.5 0.1
10001.10
(correct, use longer word length)
-17.5 01110.01
32.015625 32 | 0.015625
32 16 8 4 2 1 1 0 0 0 0 0
0.15625
0.5 → 0.25 → 0.125 → 0.0625 → 0.03125 →0.015625 0.00001
100000.00001
Floating-point numbers
1.3x10
^7 is an exponent
1.3 is the mantissa
1234 = 0.1234 x 10
Step 1 ) Find exponent and check sign bit
Step 2 ) Modulate the mantissa by the exponent
Mantissa is the actual number The exponent is the modifier to be applied to the mantissa.
FPB: 001010.01
32 16 8 4 2 1 . 1/2 1/4
10.25
1001.0010
17.125 (Incorrect, negative due to sign bit)
-17.5 to bin
1001.1
0.111 010
Exponent = 2 011.1010 3 + 1/2 + 1/8 = 3.625 Mantissa = 3.625
0.110111 0100 Exponent = 4 Mantissa =
-32 16 8 4 2 1 1 1 0 1 1 1
-32 + 16 + 4 + 2 +1 = -9
Act: 13.75
Rounding Errors
- Binary hits some issues when we deal with some numbers that aren’t factors of 2.
- It can’t be done accurately, we have to settle on being as close as possible.
Absolute Error
- The difference between the number you’re trying to calculate and the number you’ve managed to store.
Relative Error
-
Essentially the same as a relative error, but it is expressed as a percentage.
-
It gives us an idea of the scale of the error that we’re dealing with and how much we can trust the numbers.
-
Divide the absolute error by the number that we’re trying to represent, and times the result by 100 to get a percentage.
-
So 0% is entirely accurate, and 100% is completely inaccurate.
-
The closer the percentage value is to 0, the greater the accuracy of the number.
Target Number: 25000 Absolute Error: 0.5 Relative Error: 0.002%
Target Number: 100 Absolute Error: 0.5 Relative Error: 0.5%
Target Number: 5000 Absolute Error: 0.5 Relative Error: 0.01%
Target Number: 10 Absolute Error: 0.00000005 Relative Error: 0.0000005%
FLOP to Denary
0.010 0110
Exponent: 6
0.010
0.100 1.0000 10 100 1000 10000 16 8 4 2 1
16
0.1101 011
Exponent: 3
0110.1 4 2. 1/2
6.5
Normalisation
The process of improving the accuracy of a number with decimal points.
46321 (denary) can be represented as 0.46321x10^4 0.00463x10^6 ( less accurate)
Out of these representations, the first is the most precise.
This same logic can be applied to binary numbers.
An unormalised positive number consists of a sign bit (0) and one or more zeros after the decimal point.
For a normalised positive binary number, the sign bit is 0 and the bit after the sign bit is always 1.
To normalise a binary number, there must be a 1 after the decimal point. The mantissa may be lessened and the exponent will be directly increased.
To normalised a negative binary number, the first bit of both the mantissa and the exponent is 1.
The exponent may also be lessened and the mantissa increased.
01
10
-
- left
-
- right
3.5 = 00011.100
0.1110 010
Standard Form
In standard form, binary numbers must start with:
Pos: 0.1 Neg: 1.0
Arithmetic Operations
If you need to add two numbers together, they must be converted into the same form. “ So 4.63x10^6 cannot be added to 4.63x10^8 until the power is normalised.
Underflow error
An underflow error occurs when a number is too small for your register to hold. “There isn’t enough space to store a number with any accuracy, so 0 is stored.”
‎‎