The Limitations of Localisation through Image Recognition in Robotics



The basic meaning of localisation in robotics is the determination of the exact position of a robot. A more exact definition would be “the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it” (Cadena, et al., 2016). As humans we may think that there is nothing easier than determining our position: We know, for example, that we are in a library in the third floor, second room next to the staircase. But how do we know this? We acquire this knowledge through our perception, as we can see books around us and remember going up the staircase onto the third floor. But now imagine a robot doing this: How could it recognise these books or the staircase? The obvious answer would probably be “object recognition”, which we humans imagine as the easiest task ever. However, it seams to be that the easier a task for a human, the harder the task for a robot. An example for this would be that a robot is able to calculate a complex integral within seconds but is not able to recognise a banana when you hold it in front of its face (camera). To make sense of this, one would first have to think about how robots perceive images.

Images and Pixels

An image or picture is made out of small units called pixels, where a “pixel itself is a single picture element [] [which] can be only one colour” (Bradley). A whole picture is made out of thousands of these pixels of varying colours, which can only be perceived though a huge zoom as can be seen in Figure 1:


Figure 1: Individual pixels (Bradley)

Now, to identify how a robot sees these images, made out of pixels, one would have to know how a computer sees images or better how images are stored on a computer: They are stored in binary, which is a number system containing the numbers zero and one. Each different colour can be represented through a specific binary number. An example would be the black and white image below, where white pixels are represented by a value of zero and black pixels by a value of one.


Figure 2: Black and white image in binary form (Bitesize)

In order to represent pixels of different colours, the binary system from above has to be extended. This can be achieved through increasing the colour debt of an image through increasing its number of bits as “[t]he number of bits indicates how many colours are available for each pixel” (Bitesize).  The black and white image above could for example be represented in a 1-bit colour debt, containing the values of 0 and 1. More colourful images would have to be represented in higher bit colour debts. A 2-bit colour debt would for example allow the values 00, 01, 10 and 11 and thus would be able to represent four different colours. Most cameras nowadays produce 24-bit images, which would be 1111 1111 1111 1111 1111 1111 in binary and equals a total of 16 million possible colours per pixel (Bitesize).


Making Sense of this Information

So now we know, how a robot perceives images: It sees a matrix, which is a rectangular array of numbers, of multi-bit binary numbers. As a single image contains thousands of pixels, the robot will perceive hundred thousand of zeros or ones. This makes it clear why it is harder for a robot to identify objects then it is to perform complicated calculations. Also, multiple pictures of the same object will not produce identical number orientations as the object could have been rotated slightly, could be a banana of a different size or could have a somewhat different colour. All these challenges make image recognition extremely difficult but not impossible: One way would be through the creation of neural networks (AI), which process the individual pixels of an image and store them in a data base. “When you feed [such as system] [] an image of something, it compares every pixel of that image to every picture [] it’s ever seen. If the input meets a minimum threshold of similar pixels, the AI declares it a [match] []” (Greene, 2018).

Limitations of Localisation through Image Recognition

In order to use this method, one needs an enormous amount of image data to “train” the neural network of the robot. In September, 2016 Google released such a data set to the public, containing annotated YouTube videos, which can then be fed into a learning algorithm (Figure 3).


Figure 3: YouTube-8M Dataset as of 2016 (Vijayanarasimhan & Natsev, 2016)

Data bases like this make localisation through image recognition much easier but still not a child’s play as computers/robots are nonetheless constantly mixing up different objects. This could then for instance lead into a robot walking into a picture of a staircase instead of going down a staircase. All of these challenges make it nearly impossible for a robot to recognise objects perfectly and thus many other ways of localisation have been developed: There is much more data available from our surroundings than just images. The importance of vision has simply been defined by humans who perceive most things through it. However, there are many other possibilities such as sound, distances and light intensity that could be used for localisation. A popular method for example is to use a combination between lasers, that scan distances between the robot and the next barrier, and probabilistic calculations, that appoint different probabilities to different areas of a map. Where the probabilities refer to the probability of the robot being at this area. The point with the highest probability will then be the most likely position of the robot.



Bitesize. (n.d.). Encoding images. Retrieved October 9, 2018, from BBC:

Bradley, H. (n.d.). Image Size and Resolution Explained for Print and Onscreen. Retrieved October 9, 2018, from Digital Photography School:

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., . . . Leonard, J. J. (2016, December). Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Transactions on Robotics, 32(6), 1309-1332. doi:10.1109/TRO.2016.2624754

Greene, T. (2018, July 18). A beginner’s guide to AI: Computer vision and image recognition. Retrieved October 9, 2018, from The Next Web:

Hawes, D. (Performer). (2018, September 14). Taster Lecture: Probabilistic Robots. Department of Engineering Science, Oxford University, Oxford, Oxfordshire, United Kingdom.

Vijayanarasimhan, S., & Natsev, P. (2016, September 28). Announcing YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research. Retrieved October 9, 2018, from Google AI Block:



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: