Fernando Silva-Coira and friends make great strides, by developing algorithms that accept both raster and vector inputs – a first, in their research article titled Efficient processing of raster and vector data. Through their efforts, the authors solve an issue that was so problematic that it had been ‘shunned’ by researchers in previous studies.
In studies preceding Silva Coira’s, researchers and companies focused on either raster datasets or vector datasets – they rarely handled them together. In instances where they did, at least superficially, the vector datasets were first converted to raster format before being inputted into, say, a query in GIS software. So shunned were the issue of joining raster and vector datasets that dealing with them in isolation became somewhat of an industry-standard, mostly because it was – and perhaps still is – the case in Arc-GIS and GRASS GIS software. And what’s more, scholars who attempted to bridge the existing informational gap ended up facing multiple challenges, which they listed as research limitations.
However, this section isn’t meant to fault noble companies or other researchers’ work, far from it. Instead, it’s meant to emphasize Silva-Coira and friends’ game-changing and ground-breaking findings. This feat forms the basis of discussion in this article. Let’s get into it.
Before we continue, we want to give honorable mention our recent article ‘Raster and Vector Data in Urban Climate Studies‘ which may be of similar interest to users.
The Motivation for the Study
The development of the digital society is responsible for the growth of vector data. But the fact that more people now have smartphone devices, which have GPS capabilities, has been solely responsible for the explosion of this type of data. In contrast, the increase in the amount of raster data, particularly in GIS applications, has been due to hardware improvements such as satellites.
These advances, though in isolation, have created new information demands which require the combination of data stored in both raster and vector formats. However, this is not as easy as it sounds. In fact, it’s even more difficult with the large volume of data that is now available. The fact that few studies, if any, attempted to create a model that combined these two formats further makes the difficulty more pronounced.
To this end, there was a need to bridge this gap in order to avail a model or algorithm that would ease the combination of raster and vector datasets, particularly when dealing with spatial data. Ideally, this is what Silva-Coira and friends based their research on. They created a framework that would store and manage vector and raster datasets.
The researchers, cognizant that raster data needs a lot of storage when in large amounts, sought to integrate data compression into their model. They had observed, from previous research work, that compression improved the scalability and processing time of the data. As such, they integrated the k2–raster compact data structure in the framework, which had proved beneficial in terms of the space and processing time improvements as pertains to raster data. For the management and storage of vector data, they would use the R-tree, a traditional way of storing and indexing this type of data.
The framework was in the form of two algorithms that would answer operations whose input is either a raster or vector dataset without requiring any form of conversion. The first algorithm was a spatial join between the two types of data that would force a range restriction on values in the raster dataset. The second algorithm was meant to generate “the top-K different objects of the vector dataset overlapping the highest (or lowest) values of the raster dataset.”
The algorithms successfully saved storage space owing to the use of k2–raster. However, they anticipated a few challenges posed by the algorithm, considering that the processes and structures required to manage the compressed data and indexes directly were complex. The complexity would translate to more memory consumption and processing time, thereby negating the initial tradeoffs.
The paper was aimed at developing the two algorithms, and the authors succeeded despite the setbacks. They even got around the challenge they had anticipated, which led to a few recommendations as you’ll see later.
Definition of Terms
Before elaborating further on what the algorithms entailed and achieved, we need to be on the same page regarding the various terms that feature prominently in the paper’s results section. These include k2–raster, top-K, the R-tree, and spatial join.
k2–raster is a data structure whose main attribute is its compression capability. It stores a compressed integer raster matrix and also indexes it. It’s efficient because it enables users to query values of a specific cell, which it retrieves promptly. It also allows the user to find all cells that contain values in a given range.
The compression in the k2–raster data structure emanates from the integer matrix’s uniformity, which it subdivides into submatrices from which it builds a conceptual tree (k2–ary tree). This tree represents the maximum and minimum values from each submatrix and the subdivisions made. The k2–raster data structure then uses binary bitmaps and encoding schemes for the integer sequences to describe the conceptual tree more compactly. It’s a combination of all these tasks that ensures that the matrix is compressed and indexed efficiently.
In the study, K objects are the cell values that overlap the highest (or lowest) cell values when all objects are scrutinized.
Top-K is a query function used to retrieve the maximum values or scores from a dataset. It’s often used in ranking applications. It is no wonder that Silva-Coira and friends incorporated top-K in their algorithm to retrieve K objects within a vector dataset. In their case, the top-K query returned those K objects, in the vector dataset, that overlapped the raster dataset’s cells.
The k2–treap combines the k2–tree and a treap data structure, giving rise to a new data structure that returns fast answers whenever the top-K queries are made.
An R-tree is a tree data structure that stores indexes that identify spatial data efficiently. They’re incredibly important when storing and querying spatial data.
In GIS, a spatial join refers to the process of affixing data from a table whose characteristics and content are linked to a particular layer to another table. In the study, the spatial join refers to the process of attaching vector-based data to raster-based datasets and vice versa.
To judge the algorithms’ effectiveness in reducing the capacity of the data stored and increasing the processing time, the researchers established baselines. In the two of them, referred to as Plain baselines, the program would load a complete uncompressed dataset into the main memory. The third baseline utilized NetCDF (Network Common Data Form). NetCDF compresses raster matrices and allows users to access compressed data without necessarily decompressing it.
The NetCDF made for an ideal way to compare k2–raster’s compression capabilities. This is because, on the one hand, NetCDF accesses large portions of information faster than the k2–raster. However, on the other hand, k2–raster results in better access times when it comes to retrieving data from individual raster cells and when solving queries that interrogate specific conditions in the dataset. For this reason, k2–raster was better for top-K queries than NetCDF. However, a comparison was still necessary to establish how better it really was/is.
The researchers also analyzed the framework by comparing its results against the results obtained using k2–treap and k2–raster-cells (a method that uses the k2–raster representation to retrieve cells of the raster dataset that meet a particular criterion described in the query).
Upon creating and deploying the algorithms, Silva-Coira and his colleagues observed the following:
- The new framework consumed less memory than Plain baselines. The memory reduction was between 31 and 89% of the memory that Plain baselines used, that is, 3 to 9 times less memory than Plain baselines. The time improved by up to 80%.
- The new framework used between 5 and 15 times less space when completing the top-K operation than k2–treap.
- It used a maximum of 3 times less memory to finalize the spatial join when put against the Plain baselines.
- The framework was 3 times faster than NetCDF but used slightly more memory.
- The framework was 5 times faster than NetCDF and Plain baselines in retrieving answers to the top-K query. It was also 33 times faster than k2–treap and k2–raster-cells.
Based on their analysis of the results, the researchers concluded that using complex data structures doesn’t automatically translate to more memory consumption if the query algorithms are carefully designed. In this regard, Silva-Coira and friends essentially managed to deal with the challenges they had anticipated at the beginning of the study.
The research article’s findings are welcome news in the world of GIS, which hadn’t witnessed such effective utilization of both raster and vector datasets as inputs in the same operations. Further, they show that users who need to combine these two types of data only have to create a carefully considered and designed algorithm. Such an algorithm reduces the number of processes and structures required to handle the compression aspect, thereby saving space.
When designed appropriately, these algorithms even increase the processing time, provided they utilize the R-tree and k2–raster. Considering that before, GIS software packages first converted vector datasets to raster datasets before including them in a raster algorithm, hence wasting time, as it were, the framework developed by the researchers in their article is beneficial.