The Science of Raster to Vector Conversion

Automatically Identifying Areas for Correction in Raster to Vector Conversion

Vectorization changes the characteristics of a raster image, which is made up of pixels. Vectorization transforms these pixels into coordinates in a process that is prone to introducing errors and mistakes. The errors and mistakes form the basis of the measurement of the conversion system’s performance as detailed in the study titled Using local deviations of vectorization to enhance the performance of raster-to-vector conversion systems.

Naturally, upon converting an image from raster format to vector, a user will cross-check the vector file to determine whether elements therein are in their precise positions. A system that creates a precise vector image from a raster file saves the cross-checker a lot of time and effort. On the other hand, one that results in many errors necessitates post-processing, given the compounded amount of time required to scrutinize the vector file.

In this regard, Bodansky and Pilouk, in their 2000 study, contended that measuring the performance of vectorization systems should be based on the amount of time and effort a user is required to use in order to get a vector file that is as precise as possible vis-à-vis the raster image. The researchers also noted that enhancing a raster-to-vector conversion system’s performance should focus on making procedures that are otherwise time-consuming more efficient. Notably, their study was based on precisely this.

Research Motivation

Building on the failures and shortcomings of pre-2000 performance evaluation methods, Bodansky and Pilouk advanced a methodology that could be used to assess irregular curves. They observed that research efforts in previous studies only described methods of analyzing the performance of vectorization algorithms and systems that applied only to regular lines (circle arcs, circles, and straight lines), a problem they sought to rectify.

Additionally, Bodansky and Pilouk observed that previous performance analysis methodologies concentrated on obtaining values denoting the errors emanating from vectorizing raster linear objects. They also relied on approximated centerlines, which were used to measure the errors. Hori and Doermann developed one such methodology in their 1995 study.

Notably, approximated centerlines were only observed whenever vectorization of raster images was undertaken. In short, methods that relied on approximated centerlines could not be applied to the analysis of raster images.

Furthermore, in the evaluation process of previous methodologies, some approximation errors were found to be unacceptable. For this reason, there was a need to come up with a way of localizing segments that would result in unacceptable approximated errors. The new method should also have been able to deal with the vector files obtained by converting arbitrary raster linear objects. Of importance to note is that arbitrary means both regular and irregular linear objects.

Simply put, Bodansky and Pilouk would look for a universal method of analyzing vectorized raster images that featured all types of linear objects. This method would also automatically identify areas for correction within the approximated line segments in raster to vector conversions. In doing so, it would provide a basis for the analysis of the accuracy (performance) of vectorization, straightening, and smoothing algorithms. Here’s the approach the methodology used.


The researchers used a 3-tiered of coming up with their evaluation method. Each of the 3 tiers was based on an image of a linear object. Importantly, subsequent tiers addressed the issues and shortcomings observed when the preceding figure and its characteristics (tier) were analyzed. Combining the 3 figures in the study’s approach made Bodansky’s and Pilouk’s evaluation method more effective and reliable.

Tier 1

Fig. 1 for tier 1
Fig. 1 for tier 1 (source)

Bodansky and Pilouk described the following terms:

  1.     Linear object: a connected component of a black-and-white raster image that can be described precisely with a radius and a centerline function.
  2.     Centerline: a centerline must not have any junction, and it should lie within the confines of linear objects. Each point along the centerline should be almost equidistant from either side of the linear objects, and that this should be the shortest distance. The last property is that if a linear object is not closed, as in Fig. 1 above, the centerline’s ends, i.e., ps and pf, should be as close to the linear objects’ end-points as possible.

In Fig. 1 above:

  • γ is the border of the linear objects
  • lc is the centerline
  • lr is the approximated centerline that results from vectorizing the original raster image

The method developed for tier 1 was aimed at evaluating the approximated centerline (lr) at any point within the linear object diagram, particularly in cases where the centerline lc is not known. Bodansky and Pilouk used the centerline property, which stated that it should be equidistant from the boundaries, to analyze the asymmetry of a given point from the opposite edges of the linear object.

For their analysis, they used points a and b on the approximated centerline. This analysis established points al and ar, located on the opposite sides of point a but closest to this point. The researchers then defined the vectors that connected point a to al and ar, which they then used to create the formula below for calculating the local deviation.

Equation 1
Equation 1 (source)

At point a, the result of equation 1 was 0, but at b, the local deviation was almost equal in magnitude to the linear object’s thickness.

However, the approach for tier 1 had a problem. Its shortcoming was that it couldn’t recognize points that had a significant local deviation, for instance, in cases where the boundary bulged outwards, changing the line’s thickness sharply, as shown in Fig 2 below.

Fig. 2 for tier 2
Fig. 2 for tier 2 (source)

In Fig 2, the local deviation at point b using equation 1 would yield a smaller figure than reality. Thus, tier 2 was necessary.

Tier 2

To address the shortcomings of the approach in tier 2, Bodansky and Pilouk introduced a small change by approximating the border γ using a piecewise line, which they defined as the linear object’s discrete border. They subsequently gave it the γd designation. This new inclusion would ensure that the distance between adjacent vertices did not exceed h.

The defined h in two ways:

  • It must be smaller than the linear object’s thickness.
  • In cases wherein it is used to analyze linear objects, it must at least be 2 or 3 times smaller than a pixel.

From here on, the researchers combined the parameters in tier 2 and 3.

Tier 3

Fig. 3 for tier 3
Fig. 3 for tier 3 (source)

What follows from this point on is a series of 6 equations whose derivation is somewhat technical and does not fall within this article’s purview. Some of these equations (equation 4) in the research article were used to calculate the local deviation at point cj. Point cj is any point on the approximated centerline in Fig. 3 that is closest to gi where i= 0, 1, 2, 3… n.

Other equations (equation 5) utilized relative/dimensionless local deviations. According to the researchers, equation 5 would be appropriate for evaluating images with linear objects of different thicknesses.

All in all, the combination of these equations created a method of automatically identifying areas for correction in the raster to vector conversion process. This method can be summarized by equation 8 below. The equation was used to calculate the relative local deviation.

Equation 8
Equation 8 (source)

Given that equation 8 summarized their methodology, Bodansky and Pilouk tested it in the experiment described in the section below.

Experiment Testing the Methodology

The researchers used a fragment of a contour map (shown in the figure below), which they segmented using TROUT, a program that was to be integrated into ArcInfo’s upcoming release, to test their evaluation methodology (equation 8).

Bodansky and Pilouk digitized a raster linear object to create one approximation of the centerline denoted as Fig. 5 in the image below. They also used TROUT to calculate the second set of the approximated centerline. The result was the image labeled Fig. 6 below. For both groups of vectorized images, they deployed equation 8, which calculated each point’s relative deviations on the segmented linear object.

Fig. 5 Approximated centerline using digitization
Fig. 5 Approximated centerline using digitization (source)
Fig. 6 Approximated Centerline Using TROUT
Fig. 6 Approximated Centerline Using TROUT (source)

A comparison of the results showed that the deviations differed. For instance, the maximum deviation calculated using the approximated centerline obtained using digitization (Fig. 5) was 1.05. Notably, all the points indicated in the image had a variation greater than 0.8. In contrast, the maximum deviation calculated from the approximated centerline created by TROUT was less than 0.80.

The researchers subsequently conducted a visual check, which revealed several essential elements. Firstly, points with large values of the relative deviation (0.7 or more) represented points on the centerline that had shifted to one side of the linear object. Secondly, areas close to the border were more likely to contain noise that resulted from binarization.

Binarization refers to the process of converting a color or grayscale image into a black and white image. Notably, though, other definitions exist.

The noise reduces the evaluation process’s precision when the local deviations are being analyzed. The researchers observed that the noise is usually more pronounced in thin linear objects, i.e., those whose widths are about 4 pixels or less. However, smoothing the image reduces the noise from binarization.

Bodansky and Pilouk observed that reduced noise had a corresponding effect on the relative deviations – they decreased in value. Upon smoothing the image in Fig. 5, the maximum relative deviation dropped from 1.05 to less than 0.85. On the other hand, the maximum relative deviation for the smoothed Fig 6 was below 0.35.

The researchers developed a method of identifying areas for correction in raster to vector conversions. This method relied on the calculation of the relative deviation using equation 8 above. High relative deviations showed that the points in question were in areas that had binarization noise. Consequently, these areas were smoothed, and the relative deviation dropped significantly.

Suffice it to say that Bodansky and Pilouk advanced a method that identified areas wherein binarization noise made the approximation of the centerline difficult, resulting in high relative deviation. By knowing this relationship, it becomes easier to identify areas of a vectorized image that need smoothing.