Search
  • helenmakesmaps5

How to SQL: a Guide for GIS Users

Updated: Aug 26

Until about 6 months ago, I used SQL very sparingly. I was predominantly a point-and-click desktop GIS user, so my main use for SQL was creating rules for filtering my data, symbology and styles. I thought it was mostly a language for people deep in the database management world, which was a side of geospatial that I'm not mega into.


Then I joined CARTO who are a geospatial software and data provider, and one of their big MOs is opening people's eyes to the power of SQL for geospatial analysis and BOY have my eyes been opened! SQL has so many analytical applications: filtering, proximity analysis, overlays, routing, nearest neighbour... pretty much anything you can do with point-and-click GIS tools, you can do with SQL, just with extra advantages!


In this post, I'll run through why you should consider using SQL in your workflows, and then describe how you can easily undertake common GIS tasks with SQL.


gif

The benefits of SQL for GIS

SQL is a GAME CHANGER for spatial analysis for so many reasons:

  1. It's quick. Like so quick. If you're linked in to a cloud native warehouse like Google Cloud or Snowflake, we're talking analysis in seconds that would take point-and-click GIS tools HOURS.

  2. You've probably heard the phrase "single source of truth" about a thousand times just this week. The concept is having one dataset that everyone is working from, rather than multiple copies which could end up being altered independently. SQL is a huge enabler of that because you can work from a query of the source dataset, rather than having to create new versions when you're editing/analysing. For example, say you're mapping postcodes 30 miles from your office, but your dataset is UK-wide and it's taking FOREVER to load. Rather than extracting a copy or adding a dreaded "in scope" field, you can pop a quick SQL filter on it to only show points within 30 miles of a point. Easy!

  3. No dump outputs! By this I mean outputs which are generated by processes within your workflow which you don't really need, and will probably have to go back and delete; SQL has no need for these. Running a proportional-area spatial join (e.g. working out the population of a study area based on the proportion of a census area which falls inside it) could generate up to 3 "dump outputs" before reaching your end goal; none of that with SQL!

  4. The outputs are queries. This means if the source data changes (which we all know source data loves to do!) then the outcome of your analysis changes with it (although you can also write results as outputs if needed). This means that you don't have to re-run everything from scratch when your PM comes to you and says "we've decided to add another town/station/store/park to this project."

  5. This is more a benefit to coding in general, but it's SO iterative. If you run a process and realise you made a mistake or realised your parameters weren't quite right, then you can easily change them and have your new output in no time.

  6. It is SO easy to learn! The syntax feels very natural, and follows the natural flow of a human thought process ("select fields from table," "if this then that"). You can also get really far with the basics and so be up and running really quickly, rather than a lot of programming languages where you start hitting walls pretty much straight away.

So why aren't people using SQL more? I think a big part of it is it has a marketing problem; people don't know how amazing it is and what it can be used for! But I also think there aren't that many dedicated resources out there for learning SQL, particularly for people with a GIS background. So that's what I'm going to try to do here!


Right, are you sold yet? Because I'd like to take my marketing hat off and put my analyst wig back on and get to some code!


Before we start...

So before we start, a word of caution. There are actually multiple forms of SQL, such post postgreSQL/postGIS, mySQL and SQL server, not all of which include geospatial functions. They also differ slightly. For example, SQLServer has a field type "INT" whereas in BigQuery the equivalent is "INT64." Generally the syntax of queries stays the same, but if you see the error message "function not found" or similar, that's probably the problem.


In the examples in this guide, I'll mostly be using the Google BigQuery version of SQL because most of my data is stored on GCP. I'll share postGIS alternatives if and when this is different.


 

Where can I use SQL?

I think one of the things that stresses people out about using SQL is they think they have to get to grips with specific database types and download all sorts of special software first. Not true! While there are some great free database management programs available (I recommend PGAdmin), you can use SQL straight from your GIS! Here's a quick guide on accessing SQL in a few different places.


In CARTO

In your workspace, go to maps and create a new map. Add a new source and select "From a custom SQL query" then select your connection; a SQL console will appear. So easy!


I'm of course super biased because they're my employer and put a roof over my head, but if you haven't used CARTO before I can't recommend the software enough. It's so powerful and just SO fun! You can sign up for a free two-week trial here: https://app.carto.com/signup/



In QGIS

There are two main ways to execute SQL queries in QGIS. Firstly, you can add queries as virtual layer by selecting Layer > Add Virtual Layer where a SQL console will appear.


Secondly, you can open a console via the Database drop-down and selecting Database Manager. This will allow you to query SQL-enabled data sources, but also layers you already have loaded in your project. There is a little "SQL query" button with a wrench on it that allows you to access the console; if you check the "Load as a new layer" box your query result will load in your map.


This guide is intended to be an introduction to SQL as a language rather than using the DB Manager in QGIS, so for further guidance on this please refer to the official guide.


In R/Python

There are various packages for enabling SQL in R and Python; the ones which I see most commonly recommended are SQLite 3 for Python and and SQLdf for R.


In ESRI

products so haven't included how to use SQL in ArcGIS/ArcPro - I believe this is possible when connecting to a database.I'm not much of an ESRI user anymore, but I believe SQL can be used when connecting to and adding data from a database.

 

The basics

The most basic of SQL queries

So now you have a SQL console loaded, you're ready to get querying! The most basic of SQL queries has the following syntax:

SELECT * FROM project.dataset.table

The "*"means "all fields," so this query essentially loads all rows and all columns from the specified table. As we're working with geospatial data, your tables should have a "geometry" type column. These contain the geometry type (e.g. point, linestring, polygon) and the geometric description of them (i.e. their coordinates, or the coordinates of their vertices). Often, this field will be called "geom." Check out the below example, which loads all of the populated places and columns from the Natural Earth dataset (available for free on CARTO's Spatial Data Catalog LAST PLUG!).

SELECT * FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`

Only querying specific fields

That query took 7.27MB to compute, which isn't huge. However it's always a good idea to try to make your queries as efficient as possible. This will make them run faster, and will be cheaper if you're running queries from paid-for services like Google Cloud.


One way to do this is to limit the number of columns that you're querying (always ensuring you retain the geometry column). So if I was only interested in knowing the location, name and country of each populated place, I could change the query to:

SELECT geom, name, ADM0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`

That takes us down to 440KB! That's smaller than most of the identical videos of my cat doing absolutely nothing that I have on my phone.




Orders and limits

Another way to keep a lid of the size of your queries is to apply a LIMIT where you can specify the number of features from a query that you want to show. This is particularly powerful when used in conjunction with ORDER BY as you can display the top (or bottom) number of features. In the example below, I order my table by the attribute POP2020 and limit this to 20 records which gives me a layer showing only the biggest 20 cities in terms of population size.

SELECT geom, name, ADM0NAME, POP2020 FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410` 
ORDER BY POP2020 DESC LIMIT 20

Common table queries and aliases

As your queries become more sophisticated and complex, you'll begin referencing multiple tables in one query. You also may have to break your analysis down into multiple steps to make it faster and more comprehensive. This is where Common Table Queries (CTEs) and aliases come in; you perform a "query within a query" and give that an alias which you can easily and quickly reference later on. This is achieved with a WITH statement. A really simple version of this would be:

WITH cities as (SELECT geom, name, ADM0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`)

SELECT * FROM cities

This isn't super exciting and doesn't achieve anything at this stage, but is a super useful concept to grasp - you'll see it in action a bit later!


 

Calculations, filters and analysis

Calculate field

Calculating a field with SQL is super easy - you just add your calculation as a new field as use "AS" to give that field a name. In the example below, we create a new field called "POPCHANGE1950_2020" which is the % population change in this time period.

SELECT geom, name, ADM0NAME, POP2020, POP1950, (POP2020-POP1950)/POP1950 AS POPCHANGE1950_2020 
FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`

Easy right? We can also add some complexity here by applying different calculations depending on certain conditions. We can do this by using CASE... END to create a filter statement for a new calculated variable. We set the CASE with the phrase WHEN {filter condition} THEN {calculation}, and use ELSE to state the calculation to use where no conditions are met.


For example, in the query below we run the calculation from before as a subquery, then create a new field "Description" and the new value from "POPCHANGE1950_2020" to state whether the result is greater or less than the average.

with cities as (
  SELECT geom, name, ADM0NAME, POP2020, POP1950, (POP2020-POP1950)/POP1950 AS POPCHANGE1950_2020
  FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`)

select *,
  CASE WHEN POPCHANGE1950_2020 > (SELECT AVG(POPCHANGE1950_2020) FROM cities) 
    THEN "Greater than average" 
  ELSE "Lower than average" END AS Description 
FROM cities


Select by attribute/definition queries

For seasoned GIS users, this may be the part of querying that you're more familiar with. It's the "Select by attributes/expression/definition query" bit. To do this, we just need to add a "WHERE" phrase to the end of our expression, like the below which selects only populated places in the UK.


SELECT geom, name, SOV0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410` WHERE SOV0NAME = 'United Kingdom'

Build complexity into this using operators like =, >, != and LIKE, and multiple criteria with boolean operators AND and OR. You can also use these criteria within a CASE statement (see above) to specify conditions for calculating a field.


Select/filter by location

This is where things start to get juicy! Here we can use spatial conditions to filter our tables. This is super useful if - for instance - you have a dataset that covers a huge area, but your map only covers a small, local area. Rather than waiting for your massive dataset to load, you can filter it to your area of interest (without having to clip it and therefore creating a dangerous data subset).


Location-based selects require an understanding of Spatial Predicates, which is basically a fancy term for spatial relationships between two layers. There are a lot of different predicate types, all of which are prefixed with the term "ST_" and all of which return a boolean true/false result; true if the spatial condition is met, false if it's not. To run these, you'll need two geometries; a geometry to be filtered (the target layer) and a geometry to filter with (the condition layer).


Here's a quick round up of some of the most common:

  • ST_INTERSECTS: returns a true result if any part of the condition layer (including the boundary) intersects the target layer. Probably the most commonly used predicate as it covers up a lot of geometry sins...

  • ST_CONTAINS & ST_WITHIN: contains returns a true result if all parts of the condition layers contains the target layer but do NOT touch the boundary (so for instance, a county would not contain census areas it shares a border with). ST_WITHIN is the reverse of this. So ST_CONTAINS(geometryA, geometryB) will return the same result as ST_WITHIN(geometryB, geometryA). These are particularly useful predicates when working with point geometries as the border issue is less of a problem.

  • ST_TOUCHES: this is basically the opposite of ST_CONTAINS, returning a true result where the condition layer touches the boundary of the target layer, making it really helpful for neighbour analysis.

  • ST_COVERS: again, this is a lot like ST_CONTAINS but here the entire border of the target feature should be covered by the condition feature to return true.

  • ST_DWITHIN: this returns true for target features within a specified distance of condition features. ST_DISTANCE() can also be used with a condition for the same effect or more complex distance-based conditions e.g. ST_DISTANCE(geometryA, geometryB) <=2000.

Want to know more about Spatial Predicates? Explore the full list on the Postgis website.



The syntax for the majority of select by location operations is as follows:

...WHERE SPATIALPREDICATE(geometryA,geometryB)

Some examples

Which cities are within 50km of Heathrow Airport?
WITH airports AS (SELECT geom as airportgeom FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410`
  WHERE name LIKE '%Heathrow%')

SELECT geom, name, SOV0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`, airports 
  WHERE ST_DWITHIN(geom, airports.airportgeom, 50000)

Notes

  • If you're unsure of the exact text used in a dataset, you can use the operator LIKE and wildcards (%) to help. So here, I'm looking for Heathrow Airport in my airports table, but I don't know if it will be called "Heathrow," "Heathrow Airport," "Heathrow International Airport" etc. By using LIKE wrapping the text in "%," I'm saying that the feature name should contain the text "Heathrow" but anything can come before or after.

  • ST_DWITHIN is a bit different to other spatial predicates, in that the syntax is ST_DWITHIN(geometryA, geometryB, distance).


Which countries border France?

WITH france AS (SELECT geom AS francegeom FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_admin0countries_410`
  WHERE name LIKE '%France%'),
  
allcountries AS (SELECT geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_admin0countries_410`)

SELECT allcountries.* FROM allcountries, france
WHERE ST_TOUCHES(france.francegeom, allcountries.geom)

So now you've got the basics down! Shall we move on to some analysis?



 

Analysis


I'm going to cover the follow main types of geospatial analysis here:

  1. Constructors, transformations & aggregations

  2. Proximity analysis

  3. Spatial Joins

There are so many other types of analysis you can use SQL for, and really the purpose of this blog is to give you the building blocks to combine and create your own workflows!


1 Constructors, Transformations & Aggregations

This section is going to be all about how you create geometries and move between geometry types. It doesn't sound like the most exciting but you'll be using this stuff all the time. Let's kick off with the absolute basics.


Creating a geometry

Let's start with creating a point. The syntax here is ST_MAKEPOINT(longitude,latitude). Note: if you're working in Google BigQuery, the phrase is actually ST_GEOGPOINT() because the "S" in SQL is a veeeery loose definition of the word "Standard." A use for this could be if your table had no geometry column, but you did have its longitude and latitude values and so could generate the geometry that way.


So, for instance, the code below would create a geometry field containing a point which just happens to be the location of my local pub. Incidentally, if you've been enjoying this blog I'll be there most Fridays if you want to buy me a pint - I'll be the ginger one getting increasingly Bristolian as the evening wears on. Anywho...

SELECT ST_GEOGPOINT(-0.16713247,51.367034) as geom

Creating lines and polygons is a similar approach, except you'd use ST_MAKELINE() and ST_POLYGON(), with the inputs being the geometries of each vertex. In some cases, you may need to use ST_GEOGPOINT() to create geometries for each vertex first. All of these constructors can be used either when creating a field, or as part of a "WHERE" statement (see above). This is a great workflow for quickly and iteratively replacing points-to-lines or XY-to-line tools.



The map above is generated using the code below, with lines flowing between the origins and destinations of Citibike trips in NYC. You'll notice the use of COUNT() and GROUP BY in the code below; we'll run through this in more detail in the next section, but they're essentially methods of aggregating variables.

SELECT
start_station_name, start_station_latitude, start_station_longitude, end_station_name,   end_station_latitude, end_station_longitude,  COUNT(*) AS num_trips,
ST_MAKELINE(ST_GEOGPOINT(start_station_longitude, start_station_latitude), ST_GEOGPOINT(end_station_longitude, end_station_latitude)) AS geom
FROM
  `bigquery-public-data.new_york.citibike_trips`
  WHERE start_station_name != end_station_name 
GROUP BY 1, 2, 3, 4, 5, 6
ORDER BY
num_trips DESC LIMIT 5000

Oh and remember what I said about SQL being quick? This took 2 seconds to run. Not as in the colloquial "only 2 seconds" but 2 seconds, literally.


Transformations

Transforming your geometry couldn't be simpler! All you need to do is wrap a transformation statement around your geometry field when selecting it! So, for instance, the below code would transform my USA counties polygons into polylines for their borders:

SELECT ST_BOUNDARY(geom) as geom FROM project.dataset.tiger_USAcounties

Let's explore a few constructor types:

  • ST_BOUNDARY: as above, transforms a polygon into polylines.

  • ST_BOUNDINGBOX: transforms the geometry into a polygon with sides perpendicular to lines of latitude and longitude. Helpful if you need to extract data for a certain map extent.

  • ST_CENTROID: creates a centroid from a polygon or line.

  • ST_SIMPLIFY: returns a simplified version of a geometry. The syntax for this is ST_SIMPLIFY(geometry, distanceToleranceInMetres).

  • ST_UNION_AGG: super useful one! This is the equivalent of "dissolve" in desktop GIS, and can be used in conjunction with GROUP BY to dissolve by values in specified fields. This is another example of an aggregate function, which we'll go into more detail about next.

  • ST_INTERSECTION: like the classic "intersects" GIS tool, returns a geometry which is the intersecting parts of the two input geometries.



Aggregations

Aggregations are when you want to group more than one feature together. For instance, you might be performing a Spatial Join (see below) and want to count how many stations are within a county, or you might have a point representing each customer using that station and just want to know the total.


The important thing to understand about aggregations in SQL is that if you want to aggregate one field, then you have to apply some sort of aggregation function on all fields. There are three main ways of doing this.

  1. Aggregate calculations. These apply to quantitative fields only; wrapping a field in a function such as COUNT(), SUM() or AVERAGE() will calculate that value for all features which are in each...

  2. Grouping! You can use the phrase GROUP BY to specify any groups within your aggregation. For instance, if you wanted to know the total population in each state from a dataset of county zones, you would specify to group by the "state name" variable, which will aggregate the ouput for every unique value in this variable. If every field in your group by variable is unique then... nothing happens (which you sometimes want!). Note you cannot group a geometry field, so instead you must use a...

  3. Geometry aggregate. These are specific functions for aggregating geometry fields; I use ST_UNION_AGG() which is a dissolving function for this 99% of the time. It may be 100% but my memory isn't that good.

So let's look at the Citibike example again (see below). Here we have grouped all data by the first 6 fields, meaning all trips with identical start and end points are grouped as one geographic feature. The seventh field - num_trips - is a count of the result of this grouping (so it counts the number of trips from identical start and end locations). Finally, the eighth field - ST_MAKELINE(...) as geom - is based on grouped fields, so does not need to be aggregated in any way. Make sense?


SELECT
start_station_name, start_station_latitude, start_station_longitude, end_station_name,   end_station_latitude, end_station_longitude,  COUNT(*) AS num_trips,
ST_MAKELINE(ST_GEOGPOINT(start_station_longitude, start_station_latitude), ST_GEOGPOINT(end_station_longitude, end_station_latitude)) AS geom
FROM
  `bigquery-public-data.new_york.citibike_trips`
  WHERE start_station_name != end_station_name 
GROUP BY 1, 2, 3, 4, 5, 6
ORDER BY
num_trips DESC LIMIT 5000


2 Proximity analysis

Proximity analysis in GIS is all about understanding how close things are, because as we all know the first rule of geography is that "near things are more similar than far things" (or words to that effect.


Buffers

Let's start with the most common type of proximity analysis: the humble buffer. To create a buffer around a feature, we simply need to use the ST_BUFFER(geometryVariable, bufferDistance) syntax to create a new geometry variable. In the below example, we create 10km buffers around Natural Earth airports.

SELECT ST_BUFFER(geom, 10000) as geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410` 

If we wanted to dissolve these, we could wrap this in the ST_UNION_AGG() function from earlier:

SELECT ST_UNION_AGG(ST_BUFFER(geom, 10000)) as geom FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410` 

Nearest neighbours

Nearest neighbours is a sort of family of analytical techniques which are about uncovering patterns of closeness. Let's start with a common geospatial question: how close is the closest thing from layer X to each thing in layer Y? For instance, how close is the nearest airport to each city? Easy.


To do this, we need to use the ST_DISTANCE() calculation to calculate the distance between the two geometries, and then calculate the minimum from this. As this is an aggregate function, we also need to union our geometries and group our qualitative fields.

WITH airports AS (SELECT geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410`),
cities AS (SELECT geom, name, SOV0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`)

SELECT cities.name, cities.SOV0NAME, ST_UNION_AGG(cities.geom) as geom, min(ST_DISTANCE(cities.geom, airports.geom)) AS distance
FROM cities, airports
GROUP BY cities.name, cities.SOV0NAME

An extension of this is to not just work out how far each city is from an airport, but what that closest airport is. In the code below, we take the output from the above query and make that another subquery - distance_stats. We can then join distance_stats back to our original airports table based on the distances matching using LEFT JOIN (more on that later in our Spatial Joins section).

WITH airports AS (SELECT geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410`),
cities AS (SELECT geom, name, SOV0NAME FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_populatedplaces_410`),

distance_stats AS (SELECT cities.name, cities.SOV0NAME, ST_UNION_AGG(cities.geom) as geom, min(ST_DISTANCE(cities.geom, airports.geom)) AS distance
FROM cities, airports
GROUP BY cities.name, cities.SOV0NAME)

SELECT distance_stats.*, airports.name as airport_name FROM distance_stats
LEFT JOIN airports ON distance_stats.distance=ST_DISTANCE(distance_stats.geom, airports.geom)



3 Spatial joins

Spatial joins are the bread and butter of a huge amount of GIS analysis. They're pretty straightforward to do with SQL, and basically require a good understanding of location-based querying (see our earlier section on Select by Location) and aggregations.


To perform a spatial join, you need four key things:

  1. Your target geometry that you want to join to: maybe a study area

  2. Your source geometry that you want to join from, such as a census zone dataset

  3. Source variable(s) that you want to attach to the target geometry, and

  4. Your aggregation function.

So in our first simple example, I want to count the number of airports in each country. So my target is my country polygons, and my source geometry is the airports themselves. My source variable is simply the airport geometry, and I'll be using the aggregate function count. We'll use a LEFT JOIN to perform the join; this type of join (you'll encounter main in your SQL journey!) retains all features in your target geometry.


The basic syntax for this is as follows:

SELECT target.geometry, AGGREGATIONFUNCTION(sourcevariable),
FROM target
LEFT JOIN source ON ST_SpatialRelationship(target.geometry, source.goemetry)

So, let's adapt that for our airports example:

WITH airports AS (SELECT geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_airports_410`),
countries AS (SELECT geom, name FROM `carto-data.ac_lqe3zwgu.sub_natural_earth_geography_glo_admin0countries_410`)

SELECT ST_UNION_AGG(countries.geom) AS geom, countries.name, COUNT(airports.geom) AS airportscount
FROM countries
LEFT JOIN airports on ST_CONTAINS(countries.geom, airports.geom)
GROUP BY countries.name ORDER BY airportscount DESC


 

The end!

Congratulations - you've survived an introduction to using Spatial SQL for GIS! I'm going to keep this blog dynamic and keep adding to it, so if you get to the end and have any burning questions, or a workflow that you really want to use SQL for but don't know how, then DM me @helenmakesmaps on Twitter or find me on LinkedIn (Helen McKenzie, I'm the ginger one) and I'll add it in!


Thanks so much for reading!!


gif


11,069 views0 comments

Recent Posts

See All