Skip to Content

Problem Statement

With the current usage of smart phones, a lot of applications have started using the available hardware on smart phones. These applications use the camera of the mobile device to capture or send pictures to server for processing, “google goggles” as one example. We also see application capturing the video. Latest trend or expectation of user is to match the image and provide analysis or details on it. The problem lies in analyzing the moving frames of video / live camera on mobile device due to the following:

  • Limitation of processing speed.
  • Limited RAM/ In Memory.
  • Limited space to store the target images.
  • Limited bandwidth of internet connection.

Processing Speed:

Image matching is an expensive process with respect to CPU. In most of the cases one would have multiple images to check each frame of the video output. With the growth of target image it becomes difficult for mobile to process in fraction of seconds.

Limited RAM/ In Memory:

Mobiles have limited memory for execution of application on device. This limits us with few number of target images for matching and also size of the image.

Limited space to store the target images:

Although the size of in storage memory has enhanced in the last few years yet it could be a potential issue to bundle sizeable amount of target images (to match) in the mobile application itself. Fetching it from the server will result in latency depending upon the internet connection.

Limited bandwidth of internet connection:

This is the big concern when we have good amount of data which flow around the network, in case of image matching at server side we need to send the video streams across the server and wait for the match result which might not be pleasant user experience. Other option is to send the target images from server to mobile app and let the app do the matching. Which again is going to become challenging once the number of target images grows.


The above problem can be addressed with the approach of intelligently reducing or feature extraction of the image data. Instead of using the complete image data for matching we can identify the uniqueness of limited data set or features of image for matching instead of looking at the whole image dataset. Feature could be edges, textual data or along with the coordinate. The uniqueness can be captured as various chunks of image / features vector which are good approximation of the whole image.

This solution can be applied in both server as well as client (processing on mobile) side.

The implementation of extracting the uniqueness of image could:

  • Manual

In this approach let the end user define each target image point or small rectangles which represents a uniqueness of the whole image. Platform should provide such UI for end user to define that.

Limitation of this approach is that we are bounded with number of target image and cannot expect end user to perform this operation on anything beyond 20 -30 images. Given that for each target object one can have 5 to 10 images with varying lighting or camera angles. This starts to limit the application of this approach with growing number of target objects.

  • Automated

In this approach we should apply image processing algorithm to learn the unique point or identify small rectangles(e.g 3X3 pixes) within the image which are sufficient to detect or identify the image. Algorithm would help to reduce the efforts from the end user perspective.

Simple algorithm to do this could be to take few hundred random points in the image or random small rectangles. Other approach could be to try creating clusters within the image and derive the unique points or rectangles of such clusters. Another simple approach is to create a black and white image and extract the edges for match. In automated approach we can use the extraction algorithms to extract the features of objects using the known algorithm in this area.

Automated approach removes the limitation of manual approach of matching more number of images and probably we can match 50 to 100’s of object on mobile. Limitation in automated approach is driven by two parameter one being the frame change rate (user patience, expected to wait
for a second) and maximum size of data one can store on mobile application.

For a good user experience the match has to be processed at higher speed and knowing the frame change rate one could realize it is impossible to imagine matching huge number of target object in fraction of a second. Currently as described above many machine learning techniques have been applied on server side to learn from source images and create a learned model / extracted feature vector which gets applied on the target camera frame. In a way these are two phase approach where extracting or learning is performed at server level and the extracted features are passed to mobile application for classification (matching the camera frame), where the actual recognition happens. In this case number of target object is limited by the size of extracted data and with the speed with which extracted data can be applied on each frame for better user experience. Given that mobile these days have higher processing speed and bigger memory space we can expect good amount of target object being recognized and provides better user experience.

Figure: Tradition approach of image recognition at sever side


Figure: Mobile image recognition approach


In both the approach we should also take care of the extracting textual information of the image as the matching of textual data within image improves the chances of finding the match. OCR (Optical character Recognition) has evolved and is powerful for matching the images.


Limitation of Solution

Main limitation of image detection on mobile cameradevice would always be dependent on memory space and processing speed of device. Main challenge is always going to be number of target object for matching with the frame rate.

In the server based matching the main limitation had been internet bandwidth and looking at the enhancement in internet bandwidth we could start getting same experience as client based but that the matching could be with huge amount of target images or content like “google google” concept which already exists.

Overcoming the limitation

To further enhance the automated extraction approach and resolve limitation of processing speed and storage, recommendation is to add “selection parameter” to reduce the matching problem. These selection parameters will help mobile to keep the relevant extracted files on mobile device and can be synced with other dataset by changing the selection parameter. Once again the selection parameter could be controlled or selected by end user or can be automated by mobile application. Example of user specific selection parameter can be “topic of interest”, “industry”, “sports” etc… and examples of automated parameter could be “geo-position”(geographical; latitude and longitude), time based parameter “month”, “day”, “night”, etc..

In this approach the mobile application first syncs in the data specific to the selection parameter (manual or automated). To do that it passes the parameter to the server which uses the “Data Selector” layer to fetch the extracted image data specific to “selection parameter”. This subset of specific extracted data is sent to mobile device. The variation in this approach is that it can change this data once the user changes the “selection parameter”. It is required that mobile device sync with the server on the relevant sub-set of data depending and periodicity of defined by the application.

This approach further enhances the existing matching solution and provides much robust user experience depending upon his setting or geo location.

To elaborate on the above described approach, let us take an example of finding availability of office meeting rooms by just pointing the mobile camera at it. Every office has meeting room with some number or name on it’s door.  We can take the picture of each door (meeting room number is included for better success of camera matching) in City A location and City B location. In our earlier approach it would match the image perfectly but given that once we start increasing the meeting room we will soon run out of performance to get good user experiences. So in this case it makes sense that on mobile we only try to match images for City A meeting room doors not across the whole database. In this example we will send the geo-position of the person to server and synchronize the mobile device with the selected City A image extraction features instead of all the data. This would ensure the speed is not compromised and when we travel to other location (City B) mobile automatically synchronizes the City B dataset. This way we solved the limitation of hardware by utilizing the strength of geo-location with image matching. Both complemented each other, since within office the GPRS(General Packet Radio Service) seems not very precise but image matching bring the advantage of still recognizing the objects and fetching the data, which is an added advantage in this case.

We can come up with many examples for such application such as next bus on the bus stop or train on the train station or etc…

Figure: Enhanced approach to mobile image recognition


Application of the solution

We are already seeing lots of application already trying out these such as layar, “google goggles”, augmented reality application or platform
itself providing the implementation of similar approaches.

SAP can create and leverage the enhanced mobile client recognition approach for many of its enterprise application and can provide a great business user experience in extracting data or report or status for material, document, equipment, etc..  just pointing at them and many actions can be triggered with the same.

To report this post you need to login first.


You must be Logged on to comment or reply to a post.

  1. Kartik R

    Very well explained concept.Adds a lot of value both to the user and SAP as this can be utilised in almost any industry using enterprise application




Leave a Reply