Exploring Accuracy & Possible Data Loss in Raster-to-Vector Conversion

The rasterization of vector files or the vectorization of raster files is common, particularly when dealing with geographical information systems (GIS). However, the conversion process gives rise to accuracy concerns, which have, in turn, been the subject of multiple research studies. Russell Congalton, in his research ‘Exploring and Evaluating the Consequences of Vector-to-Raster and Raster-to-Vector Conversion,’ builds on the findings of these investigations, thereby exploring the GIS vector to raster to vector accuracy further.

Related: See our article on a related paper ‘Raster and Vector Data in Urban Climate Studies‘.

Table of Contents

Raster and Vector Data

Raster Data

Raster data, which encompasses scanned maps and satellite images, comprises grid cells, each of which is coded. The code allocated to each grid cell is based on either of two classification systems: the binary system or the color system.

In the binary system, which, as you may know, consists of a series of 0s and 1s, the 1s identify an area where a particular feature, such as a road, is located. On the other hand, the 0s represent an area that doesn’t have the element.

The color system depends on the amount of light within a given color bandwidth, reaching a satellite from the earth. Each bandwidth consists of a range of numbers from 0 to 255. As such, the system works by assigning the grid cell a number – from 0 to 255 – based on the satellite’s interpretation of the light reflection from the earth. Nonetheless, regardless of the classification system used, raster data is stored in a raster format, a collection of pixels.

Vector Data

In contrast, vector data consists of lines, polygons, and points, which are also coded. Notably, a line polygon is a collection of points, while a polygon is made up of many lines. The coding system in vector data relies on the coordinates of each point (x, y) and its identification number. It follows that when the coordinates and identification numbers for all points in the map are known, the lines and polygons are said to be coded.

The difference in the building blocks and identifiers when dealing with raster and vector data means that they can’t be used interchangeably without first changing these underlying characteristics. Thus, rasterization and vectorization are necessary. This is one avenue that creates accuracy concerns.

The second avenue relates to the digitization of maps. In yesteryears, this was done manually and would introduce new errors. Moreover, the differences in scale and the width of the raster lines were other sources of errors. To this end, researchers called for the formulation and use of digitizing standards. Nonetheless, in his study, published in the 1997 issue of the Photogrammetric Engineering & Remote Sensing journal, Russell explored the errors that emanated from the first avenue, i.e., rasterization and vectorization.

Sources of Inaccuracies in Rasterization and Vectorization

Researchers, whose investigations predate Congalton’s work but whose observations Congalton summarized, observed that the inaccuracies differed based on the type of conversion being conducted.

Rasterization

For vector to raster conversions (rasterization), they noted that three issues gave rise to the inaccuracies:

Polygon fill
Cell classification
Cell size

Polygon fill issues relate to the use of Arc/Info GIS, which assigns the same value to every cell located within a polygon based on the user’s direction. However, thereafter, the user cannot alter the process. The cell classification issue occurs because Arc/Info assigns a given value of a raster cell to the classification that occupies the largest percentage of the area in the cell. In doing so, it disregards other possible classifications.

Lastly, the cell size, when carrying out a vector to raster conversion, determines two fundamental parameters: the information that will be stored in the raster format and the accuracy of that information.

Vectorization

The issues identified include:

The pre-processing of raster files to reduce the capacity of information to be processed.
The enhancement of the raster image by reducing the number of classified pixels and finetuning lines using a thresholding function.
The post-processing of data, which includes changing the blocky appearance of the shapes, smoothing lines (to remove stair-steps), or reducing the number of corners (vertices).
Differences between raster and vector files: shapes, e.g., diagonal lines and circles in vector files appear stair-stepped when the rasterization process is complete. The second problem, arising from the differences, is the pixel protrusions also seen in diagonal shapes. As such, the conversion process dramatically changes shapes.

These observations are consistent with some of our recommendations on thresholding and smoothing the image or lines.

Russell’s Study

Motivation and Approach

Upon identifying the sources of inaccuracies, Russell also identified the research gaps that motivated his 1997 study. He noted that the methods used to measure the conversions’ accuracy, e.g., the use of the area and perimeter parameters fell short because they didn’t provide all the information needed to provide a clear picture.

As such, his study built on these deficiencies by proposing a new method that measured the displacement – the difference between the area of shapes in vector and raster files and then converting the difference to percentages. Furthermore, his study included the conversion of vector data to raster data and back to vector data. In the process, some shapes would have their areas omitted, and some would have theirs committed.

Thus, Russell measured the error based on the amount of area committed to each polygon and the area omitted. He also quantified the area that was correctly assigned. Subsequently, he represented the analysis in an error matrix.

Methodology

Russell used 2 data sets. In the first, he utilized five simple shapes: a circle, a hole within the circle, a triangle, a square, and a non-convex shape. He then rasterized these shapes using 5 sizes of square grid cells: 0.5, 0.2, 0.1, 0.05, and 0.025 inches whose areas were 0.25, 0.04, 0.01, 0.0025, and 0.0005 square inches, respectively. The second data set consisted of three shapes of equal areas: a wide rectangle, a circle, and a narrow rectangle.

For the raster to vector conversion of the rasterized shapes, the researcher used the tools in the Arc/Info GIS, which enabled him to create an overlay in which the grid cell sizes for the now vectorized file were the same as in the raster file. Using these cells, Russell calculated the areas of each newly converted polygon by simply adding the number of cells that each polygon occupied and multiplying the resultant figure by the size of the grid cell.

It’s important to note that the area measurements are essential because they help establish how much data is lost, thereby determining the GIS raster to vector accuracy after rasterizing vector shapes. The overlay function on Arc/Info made calculating the displacements, which showed how much data was lost in terms of the differences in the area of the vector and raster files, very easy.

Results

From data set 1, Russell made the following observations:

Rasterization of the hole within the circle, when using the 0.5 by 0.5-inch grid cell, lost the information about its shape. The larger circle became an ortho-convex polygon, as shown below.

Shapes obtained using 0.25 square inch grid cells source.

When the sizes of the grid cells were reduced, the hole became a non-rectangular ortho-convex shape, as shown below.

Shapes obtained using 0.04 square inch grid cells source.

The conversion process led to either a reduction or an increase in the area due to omission (areas mistakenly removed from the shapes) or commission errors (areas mistakenly added into the shapes), respectively. The differences in areas, based on the grid sizes used, are summarized in the figure below. However, the area fluctuations were observed to be minimal.

Original and new areas of shapes source.

The number of error polygons increased with the reduction in the grid cell sizes. However, the area of each polygon decreased even more.
The square remained square while the triangle lost its vertices, became smaller, and its diagonals assumed a stair-stepped outline.
Generalizing the shapes using the 0.04 square inch grid cells smoothed the shapes, as shown in the image below. The circle became an octagon while the hole and triangle lost their stair-stepped outline.

Shapes after generalization source.

In conclusion, Russell summed up his observations from data set 1 by stating that the amount of error that accompanied rasterization and subsequent vectorization depended on both the grid cell sizes (which affected the area of the shapes) and the shape of the polygons. Perhaps this conclusion was the motivation behind the inclusion of data set 2 in the study.

Nonetheless, the main reason was an earlier study Russell had conducted in 1988, which established that satellite mapping errors arose from the different land-use practices whose representations in the images differed significantly. For instance, agricultural land was represented by patterns/shapes whose perimeter to area ratio was small because crops were grown over a sizable homogenous tract of land.

In contrast, the shapes used to represent rangelands were mixed since such areas consisted of both large homogenous grassy regions and small woody regions. Representing forests was spatially complex.

Thus, the 2nd set of data was aimed at investigating the relationship between the shape and area when the latter is kept constant (1 square inch). In this case, though, Russell used 14 square grid cell sizes whose sides measured 1.00, 0.75, 0.70, 0.65, 0.60, 0.55, 0.50, 0.45, 0.40, 0.35, 0.30, 0.20, 0.10, and 0.05 inches. He also analyzed three shapes: a circle, a narrow rectangle (high perimeter to area ratio), and a wide rectangle (low perimeter to area ratio).

Like in the first data sets, all these shapes were rasterized and then vectorized. The objectives of this second part of the study were:

To determine the maximum grid cell size required to ensure the polygon is maintained throughout the vector to raster to vector conversion.
To determine the maximum grid cell size required to maintain each of the three polygons’ shapes.

Russell made the following observations:

The shapes were more closely approximated when the smallest grid size was being used.
The narrow and wide rectangles disappeared in all grid cells whose sides were above 0.50 inches or 50% of the original size (1 inch).
All polygons that were initially present in the original shapes appeared at different intervals. For the circle, this was at a grid cell size of 56% of the original area. For the narrow and wide rectangle, all the polygons appeared at 12% and 42%, respectively. Do refer to the article’s appendix section. The images will help you understand this fact.

Russell’s paper demonstrated that the vector to raster to vector conversion is plagued by a lot of problems that adversely affect the GIS raster to vector accuracy. Acknowledging this fact, Russell even provided future directions on how other researchers should approach the analysis of rasterization and vectorization to establish other sources of errors.

All in all, the GIS raster to vector conversions’ and vector to raster conversions’ accuracies are impacted heavily by the underlying differences between the two formats. They differ based on their building blocks and how they’re coded. This, coupled with other differences, are responsible for the errors recorded in Russell Congalton’s study.