Spatial Data

How data is organized in a GIS

Geographic data is represented by two major data models: vector and raster.

The vector data model represents the geographic space using points, lines, and polygons. On the other hand, the raster data model represents geographic space as a surface made up cells, like a pixels in an image. In public health, we typically utilize the vector data model because we are interested in how health phenomenon are distributed within political boundaries. The raster data model is more commmon for satelite based data within the environmental sciences.

Vector data model

Three basic vector types exist: points, lines, and polygons. Points are the fundamental unit of the vector data model. They make up everything! In the health context, points often map to the location of facilities, points of interest, or homes. Lines are simply multiple points stitched together with straight line segments. This allows us to represent things like roads, rivers, or even the movement of people. Polygons extend the concept of lines by stitching together points with line segments to form an enclosed area.

All vector data links geometry to descriptive attributes. This data is typically structured in a table where each row represents a unique record tied to a specific spatial geometry. We can use this attribute to conduct queries of our data that leverage both the what and the where of our data. For example, given a dataset of hospitals and county boundaries we can ask the question, “How many Acute Care Hosptials lie within 5 miles of each county’s boundaries?”.

Data formats

Within GIS you will encounter many different vector data formats. There is no true single geospatial data format standard, which means understanding your data entails also understanding the differences between various different data formats. When analyzing, storing, and sharing your own data the choice of data format can be source of happiness or misery.

The Shapefile is the most ubiquitous vector data format. It was developed by ESRI, the market leader in GIS software and now has a mostly open specification. It has a few legacy limitations including limits for file size (2GB maximum) and column names (10 character maximum) that can make it annoying to work with at times. Despite its name, a Shapefile is also made up of at least 3 files (.shp, .shx, and .dbf). These 3 files are manditory. Any .shp file without the other mandatory files cannot be read. This can make it more tedious to share with others.

A more recent standard is the GeoJSON, which is the defacto standard GIS format for the web due to its use of JSON (JavaScript Object Notation). It is a plain text data format, meaning it is very easy to look at its contents and manipulate them. This plain text nature has downsides in the form of larger file sizes and worse performance.

The GeoParquet format is a cloud-native extension of the Apache Parquet storage format, specifically designed for big data and analytical workflows. This architecture, combined with powerful built-in compression, results in file sizes that are significantly smaller—often 60% or more—and query speeds that are much faster than Shapefiles or GeoJSON when handling millions of records.

The above file formats are typically used to only store one dataset at a time. However, often in GIS we work with a whole host of datasets and would like to organize them in one place. That’s where spatial database formats come in. The File Geodatabase is one of these such formats. It was developed by ESRI and can store vector, raster, and tabular data. The maximum size of a File Geodatabase is 256 TB and you can also set 128 character column names (not to mention, human readable labels). File Geodatabases are not fully compatible with some open-source GIS software. Like shapefiles, File Geodatabases are also composed on many files.

The GeoPackage is the modern, open-standard alternative designed to address the weaknesses of the formats mentioned above. Developed by the Open Geospatial Consortium (OGC), it is built on an SQLite database container, meaning an entire project’s worth of data is contained within a single file. This makes it much easier to share and manage than a Shapefile. Like the File Geodatabase, a GeoPackage is a spatial database format that can store vector, raster, and tabular data together. It overcomes legacy limitations by supporting file sizes up to 140 TB and allowing for long, descriptive column names. Since it is an open format, compatibility within open-source tools is often better for GeoPackages.

Format Developer / Standard Type Storage Structure Key Features / Limits Best Use Case
Shapefile ESRI Vector Multi-file (min. 3) 2GB limit; 10 char column names; ubiquitous but legacy. General desktop GIS; legacy compatibility.
GeoJSON Open Standard Vector Single file (Plain text) Human-readable; large file sizes; slower performance. Web mapping and web applications.
GeoParquet Apache (Cloud-native) Vector Single file (Compressed) High compression; very fast queries; built for big data. Big data analytics and cloud workflows.
File Geodatabase ESRI Spatial Database Multi-file (Folder) Supports vector, raster, and tables; 256TB limit; proprietary. Large-scale projects within ESRI ecosystem.
GeoPackage OGC (Open Standard) Spatial Database Single file (SQLite) Supports vector, raster, and tables; 140TB limit; cross-platform. Modern data exchange and open-source workflows.

These data formats are by no means the only ones that exist, but ones that we think you might be likely to encounter. See xkcd 927 for how this happened.

Raster data model

The raster data is generally composed of two components: a header and a matrix. The header defines the coordinate reference system (CRS), the origin (a reference coordinate that defines a corner of the image), the pixel size, and the number of rows and columns of pixels. The resolution, the level of spatial detail captured, of a raster id determined by the pixel size specified within this header. A smaller pixel size results in higher resolution, allowing for a clearer representation of the landscape, while larger pixels lead to lower resolution and a more “blocky” appearance.

The matrix defines the value at each pixel. This means that unlike vector data, raster data does not store a large table of many attributes for each pixel. Some raster data will contain multiple “bands” so that multiple values can be stored for the same pixel. Additionally, raster matricies only store data as numeric values. This means for categorical data like land use type, a raster will store integer values which represent different categories.

Data formats

Like vector formats, you will encounter a variety of formats designed to store raster data. And just as with vector data, selecting the right raster format is essential for balancing performance, file size, and software compatibility.

The GeoTIFF is the most ubiquitous raster data format. It is an extension of the standard TIFF image format that includes spatial metadata like coordinate reference systems and georeferencing information. It is widely supported by almost every GIS application and can store multiple bands of data.

A more modern evolution is the Cloud Optimized GeoTIFF (COG), which is the defacto standard for hosting imagery on the web. It is a regular GeoTIFF file with an internal organization that allows software to “stream” only the specific parts of the image needed for a view. This structure significantly improves performance by avoiding the need to download the entire file to see a small area. Its primary downside is that creating a COG requires an extra processing step compared to a standard GeoTIFF.

The NetCDF (Network Common Data Form) is a format specifically designed for multidimensional scientific data. It is widely used in climate science and meteorology because it can store “data cubes” that include dimensions for latitude, longitude, and time. This makes it much more powerful than a simple image file for tracking changes over time, though it is more complex to work with and may require specialized plugins or tools to view correctly in standard GIS software.

Format Developer / Standard Type Storage Structure Key Features / Limits Best Use Case
GeoTIFF Open Standard Raster Single file Ubiquitous; supports multiple bands; can be slow if unoptimized. Standard desktop GIS and imagery storage.
COG OGC Community Raster Single file (Optimized) HTTP Range Request support; streams data; fast web performance. Web-based imagery services and cloud storage.
NetCDF UCAR / Unidata Multidimensional Single file Supports time and depth; complex metadata; scientific focus. Climate modeling and temporal data analysis.

Again, there are many more raster formats that are not covered here, but these are some of the most prominent.