# 5. Data representation¶

## 5.1. What’s the big picture?¶

Computers are machines that do stuff with information. They let you view, listen, create, and edit information in documents, images, videos, sound, spreadsheets and databases. They let you play games in simulated worlds that don’t really exist except as information inside the computer’s memory and displayed on the screen. They let you compute and calculate with numerical information; they let you send and receive information over networks. Fundamental to all of this is that the computer has to represent that information in some way inside the computer’s memory, as well as storing it on disk or sending it over a network.

To make computers easier to build and keep them reliable, everything is represented using just two values. You may have seen these two values represented as 0 and 1, but on a computer they are represented by anything that can be in two states. For example, in memory a high or low voltage is used to store each 0 or 1. On a magnetic disk it’s stored with, surprisingly, magnetism (whether a tiny spot on the disk is magnetised north or south).

When we write what is stored in a computer on paper, we normally use “0” for one of the states, and “1” for the other state. If a piece of computer memory had the following voltages: “low”, “low”, “high”, “low”, “high”, “high”, “high”, “high”, “low”, “high”, “low”, “low”, we could allocate “0” to “low”, and “1” to high” and write this sequence down as 001011110100. While this notation is used extensively, and you may often hear the data being referred to as being “0’s and 1’s”, it is important to remember that a computer does not store 0’s and 1’s; it has no way of doing this. They are just using physical mechanisms such as high and low voltage, north or south polarity, and light or dark materials.

Jargon Buster

The use of the two digits 0 and 1 is so common that some of the best known computer jargon is used for them. Since there are only two digits, the system is called binary. The short word for a “binary digit” is made by taking the first two letters and the last letter — a bit is just a digit that can have two values.

Every file you save, every picture you make, every download, is just a whole lot of bits. Computer scientists don’t spend a lot of time reading bits themselves, but knowing how they are stored is really important because it affects the amount of space that data will use, the amount of time it takes to send the data to a friend (as data that takes more space takes longer to send!) and the quality of what is being stored. You may have come across things like “24-bit colour”, “128-bit encryption”, “32-bit IPv4 addresses” or “8-bit ASCII”. Understanding what the bits are doing enables you to work out how much space will be required to get high-quality colour, hard-to-crack secret codes, a unique ID for every device in the world, or text that uses more characters than the usual English alphabet.

This chapter is about some of the different methods that computers use to code different kinds of information in patterns of these bits, and how this affects the cost and quality of what we do on the computer, or even if something is feasible at all.

## 5.2. Getting Started¶

More than 200 years ago a 15-year-old French boy invented a system for representing text using combinations of flat and raised dots on paper so that they could be read by touch. The system became very popular with people who had visual impairment as it provided a relatively fast and reliable way to “read” text without seeing it. Louis Braille’s system is an early example of a “binary” representation of data — there are only two symbols (raised and flat), and yet combinations of them can be used to represent reference books and works of literature. Each character in braille is represented with a cell of 6 dots. Each dot can either be raised or not raised. Different numbers and letters can be made by using different patterns of raised and not raised dots. Let’s work out how many different patterns can be made using the 6 dots in a Braille character. When working through the material in this section, a good way to draw braille on paper without having to actually make raised dots is to draw a rectangle with 6 small circles in it, and to colour in the circles that are raised, and not colour in the ones that aren’t raised.

If braille used only 2 dots, there would be 4 patterns. And with 3 dots there would be 8 patterns You may have noticed that there are twice as many patterns with 3 dots as there are with 2 dots. It turns out that every time you add an extra dot, that gives twice as many patterns (why?), so with 4 dots there are 16 patterns, 5 dots has 32 patterns, and 6 dots has 64 patterns.

So, Braille can make 64 patterns. That’s enough for all the letters of the alphabet, and other symbols too, such as digits and punctuation.

Braille also illustrates why binary representation is so popular. It would be possible to have three kinds of dot: flat, half raised, and raised. A skilled braille reader could distinguish them, and with three values per dot, you would only need 4 dots to represent 64 patterns. The trouble is that you would need more accurate devices to create the dots, and people would need to be more accurate at sensing them. If a page was squashed, even very slightly, it could leave the information unreadable.

Digital devices almost always use two values (binary) for similar reasons: computer disks and memory can be made cheaper and smaller if they only need to be able to distinguish between two extreme values (such as a high and low voltage), rather than fine-grained distinctions between very subtle differences in voltages. Arithmetic is also easy with binary values; if you have only two digits (0 and 1), then there aren’t many rules to learn - adding digits only requires circuits to calculate 0+0, 0+1, 1+0 and 1+1. You might like to work out how many combinations of decimal digits you need to be able to add if you’re doing conventional arithmetic!

In fact, every kind of file on a computer is represented using just a whole lot of binary digits — text, pictures, spreadsheets, web pages, songs — everything is stored using just two values. Even the programs (apps) that you run use binary representation — sometimes a program file that the computer can run is referred to as a “binary file”, which is a bit odd since every file on a computer is binary!

## 5.3. Representing text with bits¶

We saw above that 64 unique patterns can be made using 6 dots in Braille. Count how many different upper-case letters, lower-case letters, numbers, and symbols that you could insert into a text editor using your keyboard. (Don’t forget to count both of the symbols that share the number keys, and the symbols to the side that are for punctuation!) The collective name for these is characters e.g. a, D, 1, h, 6, *, ], and ~ are all characters.

Would 6 dots (which can represent 64 patterns) be enough to represent all these characters? If you counted correctly, you should find that there were more than 64 characters! How many bits would you need to be able to represent all the characters you counted on your keyboard?

It turns out that 7 dots is enough as this gives 128 possible patterns, and this is exactly what the ASCII code for text does. ASCII is one of the main systems that computers use to represent English text. It was first used commercially in 1963, and despite the big changes in computers since then, it is still the basis of how English text is stored on computers.

ASCII assigned a different pattern of bits to each of the characters, along with a few other “control” characters that you don’t need to worry about yet. For reasons that we will get to later, each pattern in ASCII is usually stored in 8 bits, with one wasted bit, rather than 7 bits. However, the first bit in each 8-bit pattern is a 0, meaning there are still only 128 possible patterns.

Below is a table that shows the patterns of bits that ASCII uses for each of the characters.

Binary Char Binary Char Binary Char
0100000 Space 1000000 @ 1100000 
0100001 ! 1000001 A 1100001 a
0100010 1000010 B 1100010 b
0100011 # 1000011 C 1100011 c
0100100 $1000100 D 1100100 d 0100101 % 1000101 E 1100101 e 0100110 & 1000110 F 1100110 f 0100111 1000111 G 1100111 g 0101000 ( 1001000 H 1101000 h 0101001 ) 1001001 I 1101001 i 0101010 * 1001010 J 1101010 j 0101011 + 1001011 K 1101011 k 0101100 , 1001100 L 1101100 l 0101101 - 1001101 M 1101101 m 0101110 . 1001110 N 1101110 n 0101111 / 1001111 O 1101111 o 0110000 0 1010000 P 1110000 p 0110001 1 1010001 Q 1110001 q 0110010 2 1010010 R 1110010 r 0110011 3 1010011 S 1110011 s 0110100 4 1010100 T 1110100 t 0110101 5 1010101 U 1110101 u 0110110 6 1010110 V 1110110 v 0110111 7 1010111 W 1110111 w 0111000 8 1011000 X 1111000 x 0111001 9 1011001 Y 1111001 y 0111010 : 1011010 Z 1111010 z 0111011 ; 1011011 [ 1111011 { 0111100 < 1011100 \ 1111100 | 0111101 = 1011101 ] 1111101 } 0111110 > 1011110 ^ 1111110 ~ 0111111 ? 1011111 _ 1111111 <DEL> For example, the letter c (lower-case) in the table has the pattern “01100011” (the 0 at the front is just extra padding to make it up to 8 bits). The letter o has the pattern “01101111”. You could write a word out using this code, and if you give it to someone else, they should be able to decode it exactly. Computers can represent pieces of text with sequences of these patterns, much like Braille does. For example, the word “computers” (all lower-case) would be 01100011 01101111 01101101 01110000 01110101 01110100 01100101 01110010 01110011. How would you represent the word “science” in ASCII? What about “Wellington” (note that it starts with an upper-case “W”)? How would you represent “358” in ASCII (it is three characters, even though it looks like a number)? What about the sentence “Hello, how are you?” (look for the comma, question mark, and space characters in the ASCII table). Curiosity If you only wanted to represent the 26 letters of the alphabet, and weren’t worried about upper-case or lower-case, you could get away with using just 5 bits, which allows for up to 32 different patterns. Have a look at the last 5 bits of each of the 26 lower-case letters in ASCII. Do any of the 26 lower-case letters have the same last 5 bits? Have a look at the 26 upper-case letters. Do any of the upper-case letters have the same last 5 bits? You may have noticed that none of the lower-case letters have the same last 5 bits, but they do have the same last 5 bits as their corresponding upper-case letter! For example, a = 1100001 and A = 1000001, they both have 00001 as their last 5 bits. As another example, s = 1110011 and S = 1010011, they both have 10011 as their last 5 bits. An easy way to allocate patterns in this 5 bit system would be to just use the last 5 bits for each character in the ASCII table. Therefore A would be 00001, b would be 00010, c would be 00011, etc. The word “water” would be 10111 00001 10111 10100 10010 There’s an activity that uses five-bit text codes hidden in music here. English text can easily be represented using ASCII, but what about languages such as Chinese where there are thousands of different characters? The 128 patterns aren’t nearly enough to represent such languages! That’s where codes that use more than 7 bits become important, and in a later section we’ll look at these, but first we need to explore binary number representation and develop some efficient ways to talk about longer binary numbers. Curiosity The name “ASCII” stands for “American Standard Code for Information Interchange”, which was a particular way of assigning bit patterns to the characters on a typewriter. The ASCII system even includes “characters” for ringing a bell (useful for getting attention on old telegraph systems), deleting the previous character (kind of an early “undo”), and “end of transmission” (to let the receiver know that the message was finished). These days those characters are rarely used, but the codes for them still exist (they are the missing patterns in the table above). Nowadays ASCII has been surplanted by a code called “UTF-8”, which happens to be the same as ASCII if the extra left-hand bit is a 0, but opens up a huge range of characters if the left-hand bit is a 1. There are several other codes that were popular before ASCII, including the Baudot code and EBCDIC. A widely used variant of the Baudot code was the “Murray code”, named after New Zealand born inventor Donald Murray. One of Murray’s significant improvements was to introduce the idea of “control characters”, such as the carriage return (new line). The “control” key still exists on modern keyboards. ## 5.4. Representing numbers with bits¶ The number system that humans normally use is in base 10 (also known as decimal). It’s worth revising quickly, because binary numbers use the same ideas as decimal numbers, just with fewer digits! In decimal, the value of each digit in a number depends on its place in the number. For example, in the amount$123, the 3 represents $3, whereas the 1 represents$100. Each place value in a number is worth 10 times more than the place value to its right, i.e. there are the “ones”, the “tens”, the “hundreds”, the “thousands” the “ten thousands”, the “hundred thousands”, the “millions”, etc. Also, there are 10 different digits (0,1,2,3,4,5,6,7,8,9) that can be at each of those place values.

If you were only able to use one digit to represent a number, then the largest number would be 9. After that, you need a second digit, which goes to the left, giving you the next ten numbers (10, 11, 12... 19). It’s because we have 10 digits that each one is worth 10 times as much as the one it its right.

You may have encountered different ways of expressing numbers using “expanded form”. For example, if you want to write the number 90328 in expanded form you might have written it as:

90328 = 90000 + 300 + 20 + 8

A more sophisticated way of writing it is:

90328 = (9 x 10000) + (0 x 1000) + (3 x 100) + (2 x 10) + (8 x 1)

If you’ve learnt about exponents, you could write it as 90328 = (9 x 104) + (0 x 103) + (3 x 102) + (2 x 101) + (8 x 100)

Remember that any number to the power of 0 is 1. i.e. the 8 x 100 is 8, because the 100 is 1.

The key ideas to notice from this are that the digit on the right (such as the 8 in 90328) is the one that’s worth the least, and that because we have 10 digits, each place is worth 10 times as much as the one to the right (e.g. the 2 in 90328 is the number of tens, the 3 is the number of 100s, and so on). Exactly the same happens with binary numbers.

### 5.4.1. Binary numbers¶

As discussed earlier, computers can only store information using bits, which only have 2 possible states. This means that they cannot represent base 10 numbers using digits 0 to 9, the way we write down numbers in decimal; instead, they use a base 2 number system, also called binary.

Curiosity

The base 10 (decimal) system is sometimes called denary, which is more consistent with the the name binary for the base 2 system. The word “denary” also refers to the Roman denarius coin, which was worth ten asses (an “as” was a copper or bronze coin).

Because binary is base 2, there are only 2 possible digits (0 and 1), as opposed to the 10 in our standard number system, and each place value is 2 times bigger than the one to its right (in contrast to our base 10 number system where each place is 10 times bigger).

The interactive below illustrates how this binary number system represents decimal numbers. Have a play around with it to see what patterns you can see. The decimal (base 10) representation for the binary number currently shown is given by the interactive on the far right.

To ensure you are understanding correctly how to use the interactive, verify that when you enter the binary number 101101 it shows that the decimal representation is 45, that when you enter 100000 it shows that the decimal representation is 32, and when you enter 001010 it shows the decimal representation is 10.

You should try using the interactive to convert a decimal number to binary. For example, choose a number less than 61 (perhaps your house number, a friend’s age, or the day of the month you were born on), set all the binary digits to zero, and then start with the left-most digit (32), trying out if it should be zero or one. See if you can find a method for converting the number without too much trial and error.

Can you figure out the binary representation for 23 without using the interactive? What about 4, 0, and 32? Check all your answers using the interactive to verify they are correct.

What is the largest number you can make with this binary interactive? What is the smallest? Is there any integer value in between the biggest and the smallest that you can’t make? Are there any numbers with more than one representation? Why/ why not?

You have probably noticed from the interactive that when set to 1, the leftmost bit (the “most significant bit”) adds 32 to the total, the next adds 16, and then the rest add 8, 4, 2, and 1 respectively. When set to 0, a bit does not add anything to the total. So the idea is to make numbers by adding some or all of 32, 16, 8, 4, 2, and 1 together, and each of those numbers can only be included once.

Rather than just using trial and error to figure out what a decimal number is in binary, could you figure out a systematic approach? Have a look at what 100000 is in binary. What about 011111? Is it possible to make a number over 32 if the most significant bit is set to a 0? Why? And what about 001000 and 000111? Can you see a pattern that would lead to a systematic way of converting decimal numbers to binary? Hint: start with deciding the leftmost bit, and then work along to the right, bit by bit.

So what happens if we have fewer than 6 bits? For example, with 5 bits, the place values would be 16, 8, 4, 2 and 1, so the largest value is 11111 in binary, or 31 in decimal. What’s the largest value you can store with 4 bits? 3 bits?

What would happen if we have 7 bits instead of 6? The seventh bit would have a value of 64, and it would be possible to store numbers up to 127.

Extra for Experts

Can you figure out a systematic approach to counting in binary? i.e. start with the number 0, then increment it to 1, then 2, then 3, etc, all the way up to the highest number that can be made with the 7 bits. Try counting from 0 to 16, and see if you can detect a pattern. Hint: Think about how you add 1 to a number in base 10. e.g. how do you work out 7 + 1, 38 + 1, 19 + 1, 99 + 1, 230899999 + 1, etc? Can you apply that same idea to binary?

Using your new knowledge of the binary number system, can you figure out a way to count to higher than 10 using your 10 fingers? What is the highest number you can represent using your 10 fingers? What if you included your 10 toes as well (so you have 20 fingers and toes to count with).

An important concept with binary numbers is the range of values that can be represented using a given number of bits. One bit on its own might not seem very useful, but it’s enough to store things like the state of a checkbox (checked or not checked). When we have 8 bits the binary numbers start to get useful — they can represent values from 0 to 255, so it is enough to store someone’s age, the day of the month, and so on.

Jargon Buster

Groups of 8 bits are so useful that they have their own name: a byte. Computer memory and disk space is usually divided up into bytes, and bigger values are stored using more than one byte. For example, two bytes (16 bits) are enough to store numbers from 0 to 65,535. Four bytes (32 bits) can store numbers up to 42,94,967,295. You can check these numbers by working out the place values of the bits. Every bit that’s added will double the range of the number.

Curiosity

Candles on birthday cakes use the base 1 numbering system, where each place is worth 1 times the one to its right(!) For example, the number 3 is 111, and 10 is 1111111111. This can cause problems as you get older — if you’ve ever seen a cake with 100 candles on it, you’ll be aware that it’s a serious fire hazard. Luckily it’s possible to use binary notation for birthday candles — each candle is either lit or not lit. For example, if you are 18, the binary notation is 10010, and you need 5 candles (with only two of them lit). ### 5.4.2. Shorthand for binary numbers¶

Most of the time binary numbers are stored electronically, and we don’t need to worry about making sense of them. But sometimes it’s useful to be able to write down and share numbers, such as the unique identifier assigned to each digital device (MAC address), or the colours specified in an HTML page.

Writing out long binary numbers is tedious — for example, suppose you need to copy down the 16-bit number 0101001110010001. A widely used shortcut is to break the number up into 4-bit groups (in this case, 0101 0011 1001 0001), and then write down the digit that each group represents (giving 5391). There’s just one small problem: each group of 4 bits can go up to 1111, which is 15, and the digits only go up to 9.

The solution is simple: we introduce symbols for the digits for 1010 (10) to 1111 (15), which are just the letters A to F. So, for example, the 16-bit binary number 1011 1000 1110 0001 can be written more concisely as B8E1. The “B” represents the binary 1011, which is the decimal number 11, and the E represents binary 1110, which is decimal 14.

Because we now have 16 digits, this representation is called hexadecimal (or hex for short). Converting between binary and hexadecimal is very simple, and that’s why hexadecimal is a very common way of writing down large binary numbers.

Here’s a full table of all the 4-bit numbers and their hexadecimal digit equivalent:

 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F

For example, the largest 8-bit binary number is 11111111. This can be written as FF in hexadecimal. Both of those representations mean 255 in our conventional decimal system (you can check that by converting the binary number to decimal).

The largest 16 bit binary number is 1111111111111111, or FFFF in hexadecimal. Both of these represent 65535 in decimal.

The hexadecimal system is also known as base 16. The following interactive converts hexadecimal numbers to decimal (base 10), which provides another way of thinking about them. But don’t forget that the main point is that hexadecimal is an easy shorthand for binary representation.

Which notation you use will depend on the situation; binary numbers represent what is actually stored, but can be confusing to read and write; hexadecimal numbers are a good shorthand; and decimal numbers are used if you’re trying to understand the meaning of the number. All three get used in computer science.

### 5.4.3. How to binary numbers affect us?¶

The length of a binary number determines the range of values it can represent. Often on computers we are dealing with text, images and sound rather than numbers, but they do appear in quite a few places, and the accuracy with which they are represented can affect what we can do on a computer.

For example, numbers in spreadsheets usually have a finite precision. Try putting the formula “=1/3” into a spreadsheet, and have it represented with maximum accuracy. How many decimal places does it store? This will be dictated by the number of binary digits that the spreadsheet is storing.

Many programming languages allow the programmer to specify the number of bits used to represent each variable (e.g. in the C language a “short int” is 16 bits or more, and a “long int” is at least 32 bits); if you are working with a language then they could investigate limits on how numbers are represented. Note that some languages, including Python, seamlessly changes the size of the representation of an integer if it gets too large, so it’s harder to explore these issues in Python.

Another situation where different numbers of bits in a representation is important is IP (Internet Protocol) and MAC (media access control) addresses for devices; the recent change from IPv4 to IPv6 was driven by the number of devices you could represent, and if you are interested in networks could explore the number of bits used for an address, and how many possible devices could exist before we run out of numbers.

## 5.5. Representing images with bits¶

Warning

This section assumes that you understand binary numbers. If you are confused by binary numbers still, you should go back to the binary numbers section and work through the material there again until you understand it. The first part of this section is possible to understand without understanding binary numbers, although in order to actually use the material for assessment purposes, you will need to understand binary numbers, as the key idea is representing colours using bits, and the bits in colours are decided based on numbers.

In school or art class you may have mixed different colours of paint or dye together in order to make new colours. This was probably very helpful if the exact colour you wanted was not present in your palette, in addition to just being fun to experiment with! When mixing paints, red and blue would give purple. If you mixed yellow and blue, you would get green. If you mixed red and yellow, you would get orange. If you mixed an even amount of the 3 primary colours; blue, red, and yellow together, you should get black, although often it would be a murky brown. By mixing together various amounts of the three primary colours, along with white and black, you can make many different colours.

Actually, while the colours blue, red and yellow are commonly used in art classes, the very similar primary colours that work better for printing are cyan, magenta and yellow (CMY), which are commonly found in computer printers as well as printing presses. This kind of mixing is called “subtractive mixing”, because it starts with a white canvas or paper, and subtracts colour from it. The below interactive allows you to experiment with these in case you’re not familiar with them, or in case you just like mixing colours. We’ve also added a “black” mix; it’s not strictly necessary (you can get black by putting all the other colours on full), but it’s useful for printers because it’s such a common colour.

#### CMY Colour Mixer

Computer screens and related devices also rely on mixing colours, except they go about it in quite a different way — they use a different set of primary colours, because they are additive, starting with a black screen and adding colour to it. For additive colour on computers, the colours red, green and blue (RGB) are used. Each pixel on a screen has 3 tiny lights; one red, one green, and one blue. By increasing and decreasing the amount of light coming out of each of these 3 lights, all the different colours can be made.

You can try additive colours in the following interactive; try different combinations of each slider. How do you generate yellow? What happens if they are all at zero? All at full value (255)? Halfway? What happens if one colour is at full, and the other two are at halfway? How do you get shades of purple, yellow, orange, and pink? What happens when you have the same amount of each colour? How do you get black? How do you get white?

The key idea is that you can specify the colour of a pixel using three numbers. In the above example, each number is from 0 to 255. With 256 possible values for each of the three components, that gives 256 x 256 x 256 = 16,777,216 possible colours, which is more than the human eye can detect. In other words, using just three numbers, you can specify pretty much any colour you want — and probably a lot that you don’t.

Of course, a computer screen or printout doesn’t have just one colour on it — it has millions of small pixels, each of which has a particular colour.

The following interactive allows you to zoom in on an image to see the pixels that are used to represent it. Each pixel is a solid colour square, and the computer needs to store the colour for each pixel. If you zoom in far enough, the interactive will show you the red-green-blue values for each pixel. You can pick a pixel and put the values on the slider above - it should come out as the same colour as the pixel.

Jargon Buster

The word pixel is short for “picture element”. On computer screens and printers an image is created by a grid of pixels, each one set to the required colour. A pixel is typically a fraction of a millimeter across, and images can be made up of millions of pixels (one megapixel is a million pixels).

Curiosity

The human eye has millions of light sensors in it, and the ones that detect colour are called “cones”. There are three different kinds of cones, which detect red, blue, and green light respectively. Colours are perceived by the amount of red, blue, and green light in them. Computer screen pixels take advantage of this by releasing the amounts of red, blue, and green light that will be perceived as the desired colour by your eyes. So when you see “purple”, it’s really the red and blue cones in your eyes being stimulated, and your brain converts that to a perceived colour.

Even the smallest computer screens have millions of pixels on them, and the computer needs to represent a colour for each one of those pixels. These days photographs are measured in megapixels (millions of pixels). To store the image, your computer is storing a colour for every one of those pixels, and each of those could be using the three numbers above. So a 2 megapixel photo, in its simplest form, needs 6 million numbers to be recorded to represent it accurately.

### 5.5.1. Representing high quality images using bits¶

So now, how can computers represent each possible colour using bits? You may have noticed in the above interactive that for each of red, green, and blue, there are 256 different positions the slider can be in (don’t forget to include setting the slider to 0). From the numbers section, you may remember that to get 256 different possibilities, you need 8 bits. So for example, to represent the current value of the red slider, you would need 8 bits (28 = 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 256).

Because there are three primary colours, each of which has 256 different possible values, we need 24 bits in order to have enough possible bit patterns to represent all the possible colours that this scheme can represent (3 x 8 = 24).

If you calculate 224 (i.e. the number of bit patterns you can get with 24 bits), and 256 x 256 x 256 (i.e. the number of possible colours that can be represented using the above interactive), you will find that the result of these two calculations are the same; 16,777,216. This means that there are 16,777,216 different possible colours that can be represented using this scheme, and that’s more colours than most people can distinguish, which is why 24-bit colour is regarded as high quality.

So now that we know we’ll need 24 bits to represent all the possible colours that can be made from the scheme in the interactive, how can we assign colours to bit patterns?

A sensible way is to use 3 binary numbers that represent the amount of each of red, green, and blue in the pixel. In order to do this, you can simply convert the decimal values on the interactive that specify how much of each of the primary colours is making up the resulting colour into binary, and put them side by side to make a full pattern of 24 bits. Because consistency is important in order for a computer to make sense of the bit pattern, the binary number for red should be put first, followed by green, and then finally blue. As an example, suppose you have the colour that has red = 145, green = 50, and blue = 123 (it is a shade of purple shown in the square above; you can see it if you set the sliders to those values in the interactive above). You need to convert each of the 3 numbers into binary, using 8 bits for each. You can either do this by hand if you are confident with binary numbers, use this binary number interactive with 8 columns, or use a binary piano. You should get red = 10010001, green = 00110010, and blue = 01111011. This can be written as 100100010011001001111011, which is the bit pattern for representing that shade of purple. Note that there are no spaces between the 3 numbers, as this is a pattern of bits rather than actually being 3 binary numbers, and computers don’t have any such concept of a space between bit patterns anyway — everything must be a 0 or a 1. You could write it with spaces to make it easier to read, and to represent the idea that they are likely to be stored in 3 8-bit bytes, but inside the computer memory there is just a sequence of high and low voltages, so even writing 0 and 1 is an arbitrary notation. Note that all leading and trailing 0’s on each of the components are kept — without them, it would be representing a shorter number. Make sure you work through this example yourself, to understand how it works.

As long as the computer knows this is a colour (typically because it has been taken from a file that is specifying colours, such as GIF or HTML), it will know that the first 8 bits specify the amount of red, the next 8 bits specify the amount of green, and the last 8 bits specify the amount of blue. The computer won’t actually convert the number into decimal, as it works with the binary directly — most of the process that takes the bits and makes the right pixels appear is typically done by a graphics card or a printer.

24 bit colour is sometimes referred to in settings as “True Color” (because it is more accurate than the human eye can see). On Apple systems, it is called “Millions of colours”.

When writing HTML code, you often need to specify colours for text, backgrounds, etc. One way of doing this is to specify the colour name, for example “red”, “blue”, “purple”, or “gold”. The use of names limits the number of colours you can represent and the shade might not be exactly the one you wanted. A better way is to specify the 24 bit colour directly. The problem is that strings of 24 binary digits are hard to read, and so colours in HTML use hexadecimal codes as a quick way to write the 24 bits, for example #00FF9E. The hash sign just means that it should be interpreted as a hexadecimal representation, and since each hexadecimal digit corresponds to 4 bits, the 6 digits represent 24 bits of colour information. This “hex triplet” format is used in HTML pages to specify colours for things like the background of the page, the text, and the colour of links. It is also used in CSS, SVG, and other applications.

In the 24 bit colour example earlier, the 24 bit pattern was 100100010011001001111011. This can be broken up into groups of 4 bits: 1001 0001 0011 0010 0111 1011. Substituting a hexadecimal digit for each of the 4-bit groups (using the table above) gives 91327B. This is the hexadecimal code for this colour!

The hexadecimal notation is extremely useful for people to read or write, as it is much easier to type 6 characters rather than 24 1’s and 0’s when specifying a colour!

For example, to specify the background colour of a page in HTML, the body tag can have a hexadecimal colour added to it like this:

<body bgcolor="#00FF9E">
`

You can use an HTML page to experiment with hexadecimal colours.

Understanding how these hexadecimal colour codes are derived also allows you to change them slightly without having to refer back the colour table, when the colour isn’t exactly the one you want. Remember that in the 24 bit color code, the first 8 bits specify the amount of red (so this is the first 2 digits of the hexadecimal code), the next 8 bits specify the amount of green (the next 2 digits of the hexadecimal code), and the last 8 bits specify the amount of blue (the last 2 digits of the hexadecimal code). To increase the amount of any one of these colours, you can change the appropriate hexadecimal letters.

For example, #000000 has zero for red, green and blue, so setting a higher value to the middle two digits (such as #002300) will add some green to the colour. What colours will the following codes give? #FF0000, #FF00FF, #FFFFFF ? (You can try them out using an HTML file).

### 5.5.3. Representing colours using fewer bits¶

What if we were to use fewer than 24 bits to represent each colour, i.e. each slider didn’t have as many possible positions it could be in? The following interactive shows what would happen with this limitation. You can select a colour by clicking on the image on the left, and then try to match it with the 24-bit colour sliders (if it’s too difficult, the system will offer to help you; to move the sliders by small amounts, you can use the arrow keys).

It should be possible to get a perfect match using 24 bit colour. Now try the 8-bit sliders. These ones have only 8 values for red and green, and just 4 values for blue!

The above system used 3 bits to specify the amount of red (8 possible values), 3 bits to specify the amount of green (again 8 possible values), and 2 bits to specify the amount of blue (4 possible values). This gives a total of 8 bits (hence the name), which can be used to make 256 different bit patterns, and thus can represent 256 different colours.

Using this scheme to represent all the pixels of an image takes one third of the number of bits required for 24-bit colour, but it is not as good at showing smooth changes of colours or subtle shades, because there are only 256 possible colors for each pixel. This is one of the big tradeoffs in data representation: do you allocate less space (fewer bits), or do you want higher quality?

Jargon Buster

The number of bits used to represent the colours of pixels in a particular image is sometimes referred to as its “colour depth” or “bit depth”. For example, an image or display with a colour depth of 8-bits has a choice of 256 colours for each pixel. There is more information about this in Wikipedia. Drastically reducing the bit depth of an image can make it look very strange; sometimes this is used as a special effect called “posterisation” (ie. making it look like a poster that has been printed with just a few colours).

The following interactive shows what happens to images when you use a smaller range of colours (including right down to zero bits!) You can choose an image using the menu. In which cases is the change in quality most noticeable? In which is it not? In which would you actually care about the colours in the image? In which situations is colour actually not necessary (i.e. we are fine with two colours)?

One other interesting thing to think about is whether or not we’d want more than 24 bit colour. It turns out that the human eye can only differentiate around 10 million colours, so the 16 million provided by 24 bit colour is already beyond what our eyes can distinguish. However, if the image were to be processed by some software that enhances the contrast, it may turn out that 24-bit colour isn’t sufficient. Choosing the representation isn’t simple!

So is it worth the space saving to put up with a lower quality image? An image represented using 24 bit colour would have 24 bits per pixel. In 600 x 800 pixel image (which is a reasonable size for a photo), this would contain 600 x 800 = 480,000 pixels, and thus would use 480,000 x 24 bits = 11,520,000 bits. This works out to around 1.44 megabytes. If we use 8-bit colour instead, it will use a third of the memory, so it would save nearly a megabyte of storage.

8 bit colour is not used much anymore, although it can still be helpful in situations such as accessing a computer desktop remotely on a slow internet connection, as the image of the desktop can instead be sent using 8 bit colour instead of 24 bit colour. Even though this may cause the desktop to appear a bit strangely, it doesn’t stop you from getting whatever it was you needed to get done, done. There are also other situations where colour doesn’t matter at all, for example diagrams, and black and white printed images.

If space really is an issue, then this crude method of reducing the range of colours isn’t usually used; instead, compression methods such as JPEG, GIF and PNG are used. These make much more clever compromises to reduce the space that an image takes, without making it look so bad, including choosing a better palette of colours to use rather than just using the simple representation discussed above. However, compression methods require a lot more processing, and images need to be decoded to the representations discussed in this chapter before they can be displayed. We will look at compression methods in a later chapter. The ideas in this present chapter more commonly come up when designing systems (such as graphics interfaces) and working with high-quality images (such as RAW photographs), and typically the goal is to choose the best representation possible without wasting too much space.

For the purposes of the New Zealand NCEA standards, reducing the bit depth of an image is ok as a second compression method to compare to specialised compression methods (JPEG, PNG, GIF etc.), but isn’t very suitable for explaining how compression works (in the Achieved level requirements).

Now that you know how the 24 bit and 8 bit colour schemes work and how to represent them using bits, what are the implications of this in practice? The following interactive can be used to upload your own image, and experiment with allocating different numbers of bits to each colour. You can use it to demonstrate the effect of the different numbers of bits for this data representation.

## 5.6. General representations of text¶

In the introduction we looked at 8-bit ASCII representations of text (which really use 7 bits, allowing for 128 different symbols). As with any other kind of data represented in binary, we can get improvements by considering larger (or smaller) representations.

In the curiosity section earlier we observed that 5 bits are sufficient for simple coding of the English alphabet, and for very slow coding systems (like the video that contains hidden text using musical notes) using 5 bits instead of 8 can save some time. The braille system uses only 6 bits for each character, which allows for 64 different characters, and it is also better than using 8 bits since it would take more paper and more time to read if the longer code was used.

But some languages have way more than 32, or 64, or even 128 characters in their alphabet. In fact, the majority of the world’s population use such languages! In this case, longer codes are needed, and the most widely used approach is a system called Unicode. A commonly used version of Unicode allows 16 bits per character. Because every extra bit that is added doubles the number of patterns possible, 16-bit codes have many more representations than 8 bit codes. In fact, with 16 bits there are 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 216 = 65,536 patterns that can be represented. This is enough to assign a unique pattern of bits to the main characters in the most common languages, although there are also standards that allow 32 bits (4 bytes) for each character.

The Unicode table is far too big to display in this book, but you can find a variety of tables on the internet, and use them to look up codes. This website displays all unicode characters with geographical data for appropriate characters. The 16- and 32-bit codes are usually written using hexadecimal since this is an easy abbreviation for the long binary codes, and sections of the Unicode alphabet (different languages) tend to be in multiples of 16.

The modern codes associated with Unicode are usually flexible in the size of the representation, so 8-bit characters can be used if that is sufficient, but 16- or 32- bit characters can be invoked for larger alphabets. If you are investigating these codes, you will come across standards such as the Universal Character Set (UCS), the Unicode/UCS Transformation Format (UTF-8 UTF-16, etc.), and the GB 18030 standard (which was mandated in the People’s Republic of China from the year 2000).

## 5.7. Computers representing numbers in practice¶

A common place that numbers are stored on computers is in spreadsheets or databases. Some of the things that we might think of as numbers, such as the telephone number (03) 555-1234, aren’t actually stored as numbers, as they contain important characters (like dashes and spaces) as well as the leading 0 which would be lost if it was stored as a number (the above number would come out as 35551234, which isn’t quite right). On the other hand, things that don’t look like a number (such as “30 January 2014”) are often stored using a value that is converted to a format that is meaningful to the reader (try typing two dates into Excel, and then subtract one from the other — the result is a useful number). Numbers are commonly used to store things as diverse as student marks, prices, statistics, and scientific readings.

Any system that stores numbers needs to make a compromise between the number of bits allocated to store the number, and the range of values that can be stored. For example, Excel spreadsheets have a maximum value that can be stored — try calculating 1/3, and display it to as many places of accuracy as possible. In some systems (like the Java and C programming languages and databases) it’s possible to specify how accurately numbers should be stored; in others it is fixed in advance (such as in spreadsheets). Some are able to work with arbitrarily large numbers by increasing the space used to store them as necessary (e.g. integers in the Python programming language).

There are two commonly used kinds of numbers: integers and floating point numbers. Integers are what you might know as whole numbers, and can be positive or negative, whereas floating point numbers can have a decimal point in them, and can also be positive or negative. In this section we are just going to focus on integers, as representing floating point numbers is a bit more difficult to understand (but well worth understanding if you use them a lot)!

The binary number representation in the previous section only allowed us to represent positive numbers. In practice, we will want to be able to represent negative numbers as well (such as when the amount of money earned goes to a negative amount, or the temperature falls below zero!) In our normal representation of base 10 numbers, we represent negative numbers by putting a minus sign in front of the number. On a computer we don’t have minus signs, but we can do it by allocating one extra bit, called a sign bit, to represent the minus sign. We can choose the leftmost bit as the sign bit — when the sign bit is set to “0”, that means the number is positive and when the sign bit is set to “1”, the number is negative (just as if there were a minus sign in front of it). For example, if we wanted to represent the number 41 using 6 bits (like above) along with an additional 7th bit that is the sign bit, assuming the sign bit is first, we would represent it by 0101001. The first bit is a 0, meaning the number is positive, then the remaining 6 bits give 41, meaning the number is +41. If we wanted to make -59, this would be 1111011. The first bit is a 1, meaning the number is negative, and then the remaining 6 bits give 59, meaning the number is -59.

Using 7 bits as described above (one for the sign, and 6 for the actual number), what would be the binary representations for 1, -1, -8, 34, -37, -88, and 102?

Suppose we have 8-bit numbers, with the left-most bit as a sign bit. What would the decimal values be for the following 10000110? 01111111? How about 10000000?

The representation 10000000 highlights a problem with this notation, as it represents the number -0, which is the same as 0. That is, there are two ways to represent the number 0, which is wasteful, and potentially confusing.

It turns out that there’s a notation called “two’s complement” for negative numbers, which avoids this wastage, and more importantly, makes it easier to do arithmetic with negative numbers. It’s beyond what is needed for this topic, but the following box gives some more information if you’d like to look into it.

Extra for Experts

Negative numbers are more often stored on computers using a system called “two’s complement”. This system makes it very easy to do arithmetic without having to treat negative numbers as a special case, so it’s faster and uses less circuitry. The principle is based on a fairly simple idea: for example, in decimal, if you had to subtract the number 4 from a value, it’s the same if you add 6 and subtract 10. Using the complement of the number -4 (i.e. 6) plus an indicator that it’s negative can make calculations quicker and simpler. A similar approach applies in binary, and it’s even easier because there are only two digits. More information is available here on how negative numbers work, and also on the Wikipedia page about two’s complement, although it’s quite technical.

Curiosity

In some programming languages there isn’t a check for when a number gets too big (overflows). For example, if you have an 8-bit number using two’s complement, then 01111111 is the largest number (127), and if you add one without checking, it will change to 10000000, which happens to be the number -128. This can cause serious problems if not checked for, and is behind a variant of the Y2K problem, called the Year 2038 problem, involving a 32-bit number overflowing for dates on Tuesday, 19 January 2038.

Because of the way computer memory is constructed, memory is most commonly used in chunks of 8 bits or 32 bits (or even 64 bits) at a time. That means that if the computer is representing an integer as a binary number with a sign bit, it will commonly use 32 bits, where the first bit is the sign bit, and the other 31 bits represent the value of the number.

In a computer that uses 32 bits for a number, how many different numbers could it represent? What’s the largest number it could represent? Remember that every bit you add doubles how many numbers you can make. If you double 64 another 25 times (so that it is up to 31 bits), i.e. 128, 256, 512, 1024, 2048.... you get an end result of 2,147,483,648. This means that there 2,147,483,648 numbers that can be represented with 31 bits, the highest of which is 2,147,483,647. This number is just over 2 billion. With the 32nd bit, the sign bit, this means that the number can be positive or negative. This is called a signed 32 bit integer. So with the signed 32 bit integer, you can represent any number between -2,147,483,647 and +2,147,483,647.

There is also such thing as a 32 bit *unsigned* integer. This does not have a signed bit, and the 32nd bit is included as part of the value. As a result, it can represent twice as many positive numbers (but no negative numbers) as the 32 bit signed integer above. This would be 4,294,967,296 different numbers, with 4,294,967,295 being the highest.

How many people are in the world? Would a 32 bit integer like described above be large enough to store a different identifier number for each person in the world? How many bits of accuracy would you want to allow for possible population growth?

Type of Number Unsigned Range Signed Range
8 bit signed 0 to 255 -128 to 127
16 bit signed 0 to 65,535 -32,768 to 32,767
32 bit signed 0 to 4,294,967,295 −2,147,483,648 to 2,147,483,647
64 bit signed 0 to 18,446,744,073,709,551,615 −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

So when you are storing values on a computer with very limited space, you need to be careful to pick a suitable kind of integer that has enough space, but isn’t wasting space. You also need to think about whether or not a number could potentially be negative.

Think of a few different examples for different sized integers (both signed and unsigned ones) of a piece of data that you could store in that sized integer. For example, the age of a person could be stored in an 8 bit unsigned integer (people can’t be a negative age!), and the number of students in your school could be stored in an 8 bit or 16 bit integer, depending on how big your school is! What other examples can you think of?

What are some examples of numbers you could not represent using any of these integers?

Extra for Experts

Another type of number used in computer systems is the “floating point” value. While we won’t look at it in detail, to get a taste of what’s involved, consider the bit values in a 4-bit number, which are 8, 4, 2 and 1. What would the value of a bit to the right of the one bit be? And to the right of that one?

The following version of the base conversion interactive has bits that are smaller than the 1-bit. Try representing the decimal number 3.5 using this system. How about 2.8125? What about 2.8?

This system is a fixed-point number system; floating point numbers are based on this idea, but allow for the number of digits to be fixed, but the position of the point to change (by giving an exponent value).

### 5.7.1. Numbers in programming languages¶

If you are programming in a language (e.g. Python, Java, C, C++, C#) then the limitations of data representations become important very quickly, as you will have to choose what kind of data representation you want to use, and if it is too small then it can “overflow”. For example, if you allocate a variable to be stored as a 16 bit unsigned integer, and you are counting how many characters there are in a file, then it will fail after 65,535 characters — that’s just a 65 kilobyte file.

If the amount of memory your computer has to store its data in is very limited (for example, on a small portable device), you might not want to reserve 32 bits for a number if it is never going to be over 100. Or even if there is plenty of memory, if you are storing millions of data values then using 16-bit integers instead of 8-bit integers will waste millions of bytes of memory.

Working out the size of an integer used in a particular programming language may take some investigation, as they are usually declared with names like “int” and “long”, which don’t say explicitly how many bits they use. For example, in the Java programming language, there is a data type called the “byte”, which is an 8-bit integer that includes negative numbers (it goes from -128 to 127), whereas a “short” integer is 16 bits, an “int” is 32 bits, and a “long” is 64 bits. In some cases (such as the “int” type in C) the length of an integer depends on the version of the language of the type of computer it is running on, and in other cases (such as integers in Python) the representation is automatically changed for you if the number gets too big!

## 5.8. The whole story!¶

The kind of image representations covered here are the basic ones used in most digital systems, and the main point of this chapter is to understand how digital representations work, and the compromises needed between the number of bits, storage used, and quality.

The colour representation discussed is what is often referred to as “raw” or “bitmap” (bmp) representation. For large images, real systems use compression methods such as JPEG, GIF or PNG to reduce the space needed to store an image, but at the point where an image is being captured or displayed it is inevitably represented using the raw bits as described in this chapter, and the basic choices for capturing and displaying images will affect the quality and cost of a device. Compression is regarded as a form of encoding, and is covered in a later chapter.

The representation of numbers is a whole area of study in itself. The choice of representation affects how quickly arithmetic can be done on the numbers, how accurate the results are, and how much memory or disk space is used up storing the data. Even integers have issues like the order in which a large number is broken up across multiple bytes. Floating point numbers generally follow common standards (the IEEE 754 standard is the most common one) to make it easy to design compatible hardware to process them. Spreadsheets usually store numbers using a floating point format, which limits the precision of calculations (typically about 64 bits are used for each number). There are many experiments that can be done (such as calculating 1/3, or adding a very large number to a very small one) that demonstrate the limitations of floating point representations.

The chapter does not (yet) cover other forms of data representation, and you may wish to explore these as alternatives. The common ones are:

• sound (wave files and related storage; for example, 16-bit samples are used for “CD quality”, but professional systems use 24-bit or even higher) — for some information, see the Teach with ICT page on sound representation.
• video (which are based on multiple images being played one after the other; however, these files are so large that they are almost never stored as a “raw” representation)