!pip install -q osmnx worldpoppyπ¦ ITF Transport Data, Statistics and Modelling Workshop β Exercise 1
This exercise walks you through five tasks:
| Task | What youβll do | |
|---|---|---|
| π₯ | Extract | Pull data from OpenStreetMap and WorldPop |
| π | Check | Assess its quality and relevance |
| βοΈ | Manipulate & filter | Shape the data to your needs |
| π | Calculate | Derive useful performance metrics |
| πΊοΈ | Visualise & interpret | Analyse charts and maps |
π‘ No coding experience needed. You can follow the instructions below to run the code. Editing and re-running code cells is encouraged to help learn how it works!
The workshop provides an introduction to accessing and working with a couple of open access data sources and a for future work. Using Python, or other open source programming languages, opens-up a world of possibility.
If anything is unclear or doesnβt work as expected, please ask one of the facilitators available in the room.
1 π Your Assignment
Imagine youβve been asked to assess transport infrastructure in a city that has limited official data.
Your goal: find the total length of roads, sidewalks and bike lanes in your assigned area, then divide by population to produce a per-person infrastructure availability metric you can compare with other places.
2 βΆοΈ How to Run the Code
The exercise is divided into tasks, each containing one or more cells, also called βchunksβ.
You can run the code in a few different ways.
In IPython Notebooks (.ipynb files)
- π±οΈ Hover over cell. A βΆ button should appear on the left
- π Click βΆ to run it
- π Watch the output appear below the cell and wait for it to complete.
In Quarto documents equivalent(.qmd files)
This way is recommended if you want to have more control over the user interface. Start by opening the file (such as ITF_Workshop_Exercise_1.qmd) in VSCode or an equivalent editor with the Quarto extension installed. You can then run the code chunks in one of 2 ways:
- By placing your cursor inside the code chunk and running the code with the keyboard shortcut Ctrl+Enter (Command+Enter on Mac). This will run the whole code chunk and is the equivalent of clicking the βRun Cellβ button but quicker.
- You can also control exactly which lines or even characters are execuated by selecting them and then pressing Ctrl+Enter, giving you more control and enabling you to learn by experimenting.
- By clicking βRun Cellβ at the top of each cell (Run Next Cell and Run Above options should also be available)
Letβs get started! π
2.1 π οΈ Task 1: Setup
The first task is to install the packages you will use in the workbook. These are extensions providing functions meaning you donβt have to write things from scratch in low-level Python code.
β±οΈ This step may take a few minutes.
βΆοΈ Run the command below to install the packages from inside Python, if they are not already installed.
2.2 ποΈ Task 2: Choose a City
Pick a city from the list below to analyse. Each group should try to choose different cities so you can compare results at the end!
| City |
|---|
| Chernihiv, Chernihiv Oblast |
| Dnipro |
| Donetsk, Donetsk Oblast |
| Kharkiv |
| Kryvyi Rih |
| Kyiv |
| Luhansk, Luhansk Oblast |
| Lviv |
| Mariupol |
| Mykolaiv |
| Odesa |
| Poltava, Poltava Oblast |
| Vinnytsia, Vinnytsia Oblast |
| Zaporizhzhia |
Replace βDniproβ with your chosen city below, and run the cell.
city = "Dnipro" # <-- Replace with your chosen city
print(f"City set to: {city}")2.3 ποΈ Task 3: Import the Data
The following cell downloads two main datasets:
| Data Source | What it is | What youβll use it for |
|---|---|---|
| πΊοΈ OpenStreetMap, OSM | A free, community-built map of the world, like βWikipedia for mapsβ | Estimating the total length of transport network segments |
| π₯ WorldPop | An open dataset of population estimates at high spatial resolution (100 m grid) | Estimating the population of your assigned area |
β±οΈ This download step may take a few minutes depending on the size of the city and download speeds. You should see some output appear as the data downloads.
# Import the packages from the previous step
import osmnx as ox # Download OSM data
import worldpoppy as wp # Download and WorldPop data
import geopandas as gpd
from shapely.geometry import box
# Dataset 1
gdf = ox.geocoder.geocode_to_gdf(city)
polygon = gdf.geometry.iloc[0]
west, south, east, north = polygon.bounds
bbox = (west, south, east, north)
# Dataset 2
ox.settings.useful_tags_way = ["highway", "cycleway", "footway", "sidewalk", "busway", "length"]
osm_data = ox.graph_from_bbox(bbox, custom_filter='["highway"]')
# Dataset 3
bbox_geom = box(west, south, east, north) # Convert to GeoDataFrame
bbox_gdf = gpd.GeoDataFrame( geometry=[bbox_geom], crs=gdf.crs)
pop_data = wp.wp_raster(product_name='pop_g2_r25a', aoi=bbox_gdf, years=2024)2.4 π Task 4: Review the Data
Before doing any analysis, itβs good practice to visually inspect the input datasets described above. This task will produce a map of each dataset described above. The resuls should be images showing transport infrastructure and the population distribution in the case study area, with brighter areas in the second map having higher population density.
# Cell 1: Map the road transport network from OpenStreetMap
import matplotlib.pyplot as plt
fig, ax = ox.plot.plot_graph(
osm_data,
show=False,
close=False,
bgcolor="#111111", # Background colour
edge_color="#ffcb00", # Colour of the network
edge_linewidth=0.3, # Line thickness
node_size=0, # Hide intersections
)
# Overlay the city boundary as a subtle outline
gdf.plot(ax=ax, fc="none", ec="#ffffff", lw=5, alpha=0.5, zorder=2)
# Add a title
ax.set_title(f"{city}: Transport Network", color="black", fontsize=14, pad=12)
plt.show()# Cell 2: Map the population distribution from WorldPop
import numpy as np
from matplotlib.colors import LogNorm
# Plot the population data β let worldpoppy handle the projection and axes
mesh = (pop_data.fillna(0) + 1).plot(
norm=LogNorm(),
cmap='inferno',
size=6,
cbar_kwargs={'label': 'Population per kmΒ²', 'shrink': 0.6, 'pad': 0.02}
)
# Retrieve the axes and colour bar from the QuadMesh object
ax = mesh.axes
cbar = mesh.colorbar
# Correct the aspect ratio for latitude distortion
lat_center = (pop_data.y.min() + pop_data.y.max()) / 2
ax.set_aspect(1 / np.cos(np.radians(float(lat_center))))
# Style the colour bar
cbar.set_label("Population per kmΒ²", color="white", fontsize=10)
cbar.ax.yaxis.set_tick_params(color="white")
plt.setp(cbar.ax.yaxis.get_ticklabels(), color="white")
# Style the title, background and axis labels
ax.set_title(
f"{city} (2024): {pop_data.sum() / 1e6:.1f}M People",
color="white", fontsize=14, pad=12
)
ax.set_facecolor("#111111")
ax.figure.patch.set_facecolor("#111111")
ax.set_xlabel("Longitude", color="white", fontsize=9)
ax.set_ylabel("Latitude", color="white", fontsize=9)
ax.tick_params(colors="white")
plt.show()2.5 π Task 5: Inspect and Evaluate the Data
Before doing any calculations, itβs important to understand the structure of your data and check for any potential issues.
This task doesnβt require you to write any code. Instead, focus on reading the outputs carefully and thinking critically about what you see.
After running the following cell answer the following questions.
Look through the column names. Identify at least 2 columns whose purpose is unclear to you. Can you make an educated guess at what they mean?
Look at the values in the first 5 rows. Are there any columns that appear to contain mostly empty or missing values? Do you think those columns will be important for your analysis?
# Cell 1: Inspect the structure of the OpenStreetMap road network data
# Convert the OSM network to a table of road segments (edges)
_, edges = ox.graph_to_gdfs(osm_data)
# Show the first few rows and all columns
print(f"Total number of road segments: {len(edges)}")
print(f"Number of columns: {len(edges.columns)}")
print("\nColumn names and data types:")
print(edges.dtypes)
print("\nFirst 5 rows:")
edges.head()After running the following cell answer the following questions:
- Compare the total estimated population to what you know about the city. Does it seem reasonable? If it seems too high or too low, what might explain that?
- Look at the minimum and maximum population density values. How large is the gap between them? What does that tell you about how evenly population is distributed across the city?
# Cell 2: Inspect the structure of the WorldPop population data
total_cells = pop_data.sizes['y'] * pop_data.sizes['x']
missing_cells = pop_data.isnull().sum().item()
zero_cells = (pop_data == 0).sum().item()
# Flatten the raster to a 1D array for percentile calculations, ignoring missing values
pop_values = pop_data.values.flatten()
pop_values = pop_values[~np.isnan(pop_values)]
# Calculate the population share of the top 10% most dense cells
top10_threshold = np.percentile(pop_values, 90)
top10_population = pop_values[pop_values >= top10_threshold].sum()
top10_share = top10_population / pop_values.sum() * 100
print(f"Grid dimensions: {pop_data.sizes['y']} rows x {pop_data.sizes['x']} columns")
print(f"Total cells: {pop_data.sizes['y'] * pop_data.sizes['x']:,}")
print(f"Total estimated population: {pop_data.sum().item() / 1e6:.2f} million")
print(f"Cells with zero population: {missing_cells + zero_cells:,} cells ({(missing_cells + zero_cells) / total_cells * 100:.1f}% of total)")
print(f"\nPopulation per hectare statistics:")
print(f" Minimum: {pop_data.min().item():.1f}")
print(f" Maximum: {pop_data.max().item():.1f}")
print(f" Mean: {pop_data.mean().item():.1f}")
print(f" Median: {float(np.median(pop_values)):.1f}")
print(f"\nConcentration: the top 10% most dense cells contain {top10_share:.1f}% of the total population")2.6 π Task 6: Calculate length of infrastructure per person
Now that youβve reviewed and validated your data, itβs time to calculate the key metrics for your city. Hereβs what this task does:
- Reprojects the data to a metric coordinate system (metres instead of degrees) so that distances are calculated accurately
- Calculates the total length of roads, sidewalks, bike lanes, and bus lanes in kilometres
- Estimates the total population of the city from the WorldPop raster
- Calculates infrastructure availability per person by dividing the total length of each infrastructure type by the total population
βΉοΈ The per-person figures will be in metres per person, which is a more readable unit than kilometres per person for most infrastructure types.
β οΈ A note on data quality: these results are derived from OpenStreetMap and are subject to limitations, including incomplete tagging, inconsistent mapping conventions. Treat them as indicative rather than authoritative.
import pandas as pd
import matplotlib.ticker as ticker
# ββ Step 1: Prepare and filter data ββββββββββββββββββββββββββββββββββββββββ
# Convert graph to GeoDataFrame and reproject to a metric CRS for accurate distance calculations
edges = ox.graph_to_gdfs(osm_data, nodes=False)
edges = edges.to_crs("EPSG:32636")
# Clean highway field- some OSM edges contain multiple highway tags stored as lists
edges["highway_clean"] = edges["highway"].apply( lambda x: x[0] if isinstance(x, list) else x )
# ββ Step 2: Define infrastructure types ββββββββββββββββββββββββββββββββββββ
foot_col = ( edges["foot"]
if "foot" in edges.columns
else pd.Series(index=edges.index, dtype="object") )
sidewalk_col = ( edges["sidewalk"]
if "sidewalk" in edges.columns
else pd.Series(index=edges.index, dtype="object") )
cycleway_col = ( edges["cycleway"]
if "cycleway" in edges.columns
else pd.Series(index=edges.index, dtype="object") )
bicycle_col = ( edges["bicycle"]
if "bicycle" in edges.columns
else pd.Series(index=edges.index, dtype="object") )
masks = {
# All mapped transport infrastructure
"Roads": edges["highway_clean"].notna(),
# Walking infrastructure
"Sidewalks": ( sidewalk_col.notna() | foot_col.eq("yes")
| edges["highway_clean"].isin(["footway", "pedestrian", "path"]) ),
# Cycling infrastructure
"Bike Lanes":
( cycleway_col.notna() | bicycle_col.eq("yes")
| edges["highway_clean"].isin(["cycleway"]) ),
# Bus-related infrastructure
"Bus Lanes": ( edges.filter(like="bus").notna().any(axis=1) )
}
# ββ Step 3: Calculate lengths ββββββββββββββββββββββββββββββββββββββββ
lengths_km = {
name: edges.loc[mask, "length"].sum() / 1000
for name, mask in masks.items() }
# ββ Step 4: Calculate infrastructure availability per person βββββββββββββββββ
total_population = pop_data.sum().item()
metres_per_person = {k: (v * 1000) / total_population for k, v in lengths_km.items()}
# ββ Step 5: Print summary table βββββββββββββββββββββββββββββββββββββββββββββββ
results = pd.DataFrame({
'Total Length (km)': lengths_km,
'Population': {k: f"{total_population:,.0f}" for k in lengths_km},
'Metres per Person': {k: round(v, 2) for k, v in metres_per_person.items()}
})
print(f"\n{'β' * 55}")
print(f" Transport Infrastructure Summary: {city}")
print(f"{'β' * 55}")
print(f" Total population: {total_population:,.0f}")
print(f"{'β' * 55}\n")
print(results.to_string())
print(f"\n{'β' * 55}")
# ββ Step 6: Plot infrastructure availability per person βββββββββββββββββββββββ
fig, ax = plt.subplots(figsize=(8, 5))
fig.patch.set_facecolor("#111111")
ax.set_facecolor("#111111")
colors = ['#ffcb00', '#ff8800', '#00cfff', '#00e676']
bars = ax.bar(
list(metres_per_person.keys()),
list(metres_per_person.values()),
color=colors, width=0.5, zorder=2 )
# Add value labels on top of each bar
for bar in bars:
ax.text(
bar.get_x() + bar.get_width() / 2,
bar.get_height() + 0.01 * max(metres_per_person.values()),
f"{bar.get_height():.2f}m",
ha='center', va='bottom', color='white', fontsize=10 )
# Style the chart
ax.set_title(f"{city}: Infrastructure availability per person", color='white', fontsize=14, pad=12)
ax.set_ylabel("Metres per Person", color='white', fontsize=11)
ax.tick_params(colors='white')
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.1f'))
ax.grid(axis='y', color='#444444', linestyle='--', linewidth=0.7, zorder=1)
for spine in ax.spines.values():
spine.set_edgecolor('#444444')
plt.tight_layout()
plt.show()2.7 π Task 7: Interpret and Evaluate the Results
Youβve now created a basic estimate of infrastructure length per person. But what do these numbers actually mean? This task asks you to think critically about the results and consider what they tell us, and donβt tell us.
There is no code to run in this task. Instead, read through the questions below and discuss them with your group.
2.7.1 π Part 1: Sense-check your results
Before drawing any conclusions, itβs worth asking whether the numbers are plausible.
How complete do you think the OpenStreetMap data is for your city? Recall what you observed in Tasks 4 and 5. Are there infrastructure types that seem underrepresented β for example, very few bike lanes or bus lanes?
The WorldPop population estimate may differ from official census figures. Does the figure produced by the tool seem reasonable? If itβs higher or lower than expected, how might that affect your per-person metrics?
Exercise: look at the OSM wiki describing how bicycle infrastructure is tagged in OpenStreetMap: https://wiki.openstreetmap.org/wiki/Key:cycleway. Can you identify any potential limitations of the method used in the code above to identify bike lanes? How might you improve it?
2.7.2 π Part 2: Compare with other cities
To put your results in context, it helps to compare them with cities in countries at a similar stage of development. The table below provides some reference figures for comparable cities in Poland and Romania, derived from OpenStreetMap data using the same methodology as this exercise.
| City | Country | Population | Roads (m/person) | Sidewalks (m/person) | Bike Lanes (m/person) |
|---|---|---|---|---|---|
| Warsaw | Poland | ~2,230,000 | 20.75 | 9.96 | 0.69 |
| KrakΓ³w | Poland | ~1,000,000 | 23.63 | 9.71 | 0.34 |
| Bucharest | Romania | ~2,000,000 | 7.72 | 1.63 | 0.09 |
| Cluj-Napoca | Romania | ~320,000 | 15.63 | 6.90 | 0.40 |
- How does your city compare to the benchmark cities? Is it above or below average for each infrastructure type?
- What factors might explain differences between cities β for example, city size, urban density, historical development patterns, or investment levels?
2.7.3 π¦ Part 3: Transport planning implications
If you were advising the city government, which infrastructure type would you prioritise investing in first? What additional data would you want before making that recommendation?
How might the ongoing conflict in Ukraine affect the reliability of both the OpenStreetMap data and the WorldPop population estimates for the cities in this exercise?
3 πΊοΈ Task 8: Generate Maps of the Data
So far, youβve calculated summary statistics for your cityβs transport infrastructure. However, averages can hide important spatial differences. In this task, youβll create maps to explore:
- Where transport infrastructure is concentrated
- How infrastructure relates to population density
- Whether some parts of the city appear underserved
The code below will:
- Convert the OpenStreetMap network into separate infrastructure layers
- Plot roads, sidewalks, bike lanes, and bus lanes
- Overlay the WorldPop population data
- Produce a combined map showing transport infrastructure alongside population distribution
β οΈ Because the network data can be large, the plotting step may take a minute or two for larger cities.
from matplotlib.colors import LogNorm
# Convert graph to GeoDataFrame
edges = ox.graph_to_gdfs(osm_data, nodes=False)
# Reproject to Web Mercator for plotting
edges_plot = edges.to_crs("EPSG:3857")
pop_plot = pop_data.rio.reproject("EPSG:3857")
# Clean highway field
edges_plot["highway_clean"] = edges_plot["highway"].apply(
lambda x: x[0] if isinstance(x, list) else x )
# Define infrastructure layers for mapping
layers = {
"Roads": edges_plot[edges_plot["highway_clean"].notna()],
"Sidewalks": edges_plot[(
sidewalk_col.notna()
| foot_col.eq("yes")
| edges_plot["highway_clean"].isin(["footway", "pedestrian", "path"])
)],
"Bike Lanes": edges_plot[(
cycleway_col.notna()
| bicycle_col.eq("yes")
| edges_plot["highway_clean"].isin(["cycleway"])
)],
"Bus Infrastructure": edges_plot[edges_plot.filter(like="bus").notna().any(axis=1)]
}
# ββ Plot configuration βββββββββββββββββββββββββββββββββββββββββββββββββββββ
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()
# Use high-contrast infrastructure colors
colors_dict = {
"Roads": "#d73027", # red
"Sidewalks": "#4575b4", # blue
"Bike Lanes": "#1a9850", # green
"Bus Infrastructure": "#984ea3" # purple
}
for ax, (name, layer) in zip(axes, layers.items()):
# Plot population raster in grayscale
pop_plot.plot(
ax=ax,
cmap="Greys",
norm=LogNorm(vmin=1, vmax=float(pop_plot.max())),
alpha=0.45,
add_colorbar=False
)
# Plot infrastructure layer
if len(layer) > 0:
layer.plot(
ax=ax,
linewidth=0.8,
color=colors_dict[name],
alpha=0.95
)
# Styling
ax.set_title(f"{city}: {name}", fontsize=12)
ax.set_axis_off()
plt.tight_layout()
plt.show()4 π Task 9: Interpret and Evaluate the Maps
The maps youβve generated provide a more detailed picture of transport infrastructure across the city. Rather than focusing only on city-wide averages, letβs look for areas where infrastructure appears well provided or underserved.
There is no code to run in this task. Instead, discuss the questions below with your group.
4.1 π Part 1: Identify spatial patterns
Do different infrastructure types follow similar patterns? For example, are bike lanes concentrated in the same areas as sidewalks or bus infrastructure?
Are there areas with high population density but relatively little infrastructure? What challenges might residents in these areas face?
4.2 π Part 2: Planning implications
If you were preparing a transport investment strategy for this city, which areas would you prioritise for improvement? Why?
What additional datasets would help you make a stronger planning recommendation? Examples might include:
- Traffic volumes
- Public transport ridership
- Road safety data
- Air pollution measurements
- Employment or income data
- Travel survey data
How could maps like these support real-world transport planning decisions? Consider communication with policymakers, prioritisation of investments, or public engagement.
6 π Further Reading
For more context and options to run the materials: * Rendered Website: View the fully rendered Quarto website at robinlovelace.net/itfworkshop/. * GitHub Codespaces: Launch a cloud-based development environment (requires a GitHub account) using the repository at github.com/Robinlovelace/itfworkshop. * Original Repository: Explore the original source code developed by Nick Caros at github.com/ncaros/ukraine-workshop.