🚦 ITF Transport Data, Statistics and Modelling Workshop — Exercise 1

This exercise walks you through five tasks:

	Task	What you’ll do
📥	Extract	Pull data from OpenStreetMap and WorldPop
🔍	Check	Assess its quality and relevance
✂️	Manipulate & filter	Shape the data to your needs
📐	Calculate	Derive useful performance metrics
🗺️	Visualise & interpret	Analyse charts and maps

💡 No coding experience needed. You can follow the instructions below to run the code. Editing and re-running code cells is encouraged to help learn how it works!

The workshop provides an introduction to accessing and working with a couple of open access data sources and a for future work. Using Python, or other open source programming languages, opens-up a world of possibility.

If anything is unclear or doesn’t work as expected, please ask one of the facilitators available in the room.

1 📋 Your Assignment

Imagine you’ve been asked to assess transport infrastructure in a city that has limited official data.

Your goal: find the total length of roads, sidewalks and bike lanes in your assigned area, then divide by population to produce a per-person infrastructure availability metric you can compare with other places.

2 ▶️ How to Run the Code

The exercise is divided into tasks, each containing one or more cells, also called ‘chunks’.

You can run the code in a few different ways.

In IPython Notebooks (.ipynb files)

🖱️ Hover over cell. A ▶ button should appear on the left
👆 Click ▶ to run it
👀 Watch the output appear below the cell and wait for it to complete.

In Quarto documents equivalent(.qmd files)

This way is recommended if you want to have more control over the user interface. Start by opening the file (such as ITF_Workshop_Exercise_1.qmd) in VSCode or an equivalent editor with the Quarto extension installed. You can then run the code chunks in one of 2 ways:

By placing your cursor inside the code chunk and running the code with the keyboard shortcut Ctrl+Enter (Command+Enter on Mac). This will run the whole code chunk and is the equivalent of clicking the “Run Cell” button but quicker.

You can also control exactly which lines or even characters are execuated by selecting them and then pressing Ctrl+Enter, giving you more control and enabling you to learn by experimenting.

By clicking “Run Cell” at the top of each cell (Run Next Cell and Run Above options should also be available)

Let’s get started! 🚀

2.1 🛠️ Task 1: Setup

The first task is to install the packages you will use in the workbook. These are extensions providing functions meaning you don’t have to write things from scratch in low-level Python code.

⏱️ This step may take a few minutes.

▶️ Run the command below to install the packages from inside Python, if they are not already installed.

!pip install -q osmnx worldpoppy

2.2 🏙️ Task 2: Choose a City

Pick a city from the list below to analyse. Each group should try to choose different cities so you can compare results at the end!

City
Chernihiv, Chernihiv Oblast
Dnipro
Donetsk, Donetsk Oblast
Kharkiv
Kryvyi Rih
Kyiv
Luhansk, Luhansk Oblast
Lviv
Mariupol
Mykolaiv
Odesa
Poltava, Poltava Oblast
Vinnytsia, Vinnytsia Oblast
Zaporizhzhia

Replace “Dnipro” with your chosen city below, and run the cell.

city = "Dnipro"  # <-- Replace with your chosen city
print(f"City set to: {city}")

2.3 🗂️ Task 3: Import the Data

The following cell downloads two main datasets:

Data Source	What it is	What you’ll use it for
🗺️ OpenStreetMap, OSM	A free, community-built map of the world, like ‘Wikipedia for maps’	Estimating the total length of transport network segments
👥 WorldPop	An open dataset of population estimates at high spatial resolution (100 m grid)	Estimating the population of your assigned area

⏱️ This download step may take a few minutes depending on the size of the city and download speeds. You should see some output appear as the data downloads.

# Import the packages from the previous step
import osmnx as ox       # Download OSM data
import worldpoppy as wp  # Download and WorldPop data
import geopandas as gpd
from shapely.geometry import box

# Dataset 1
gdf = ox.geocoder.geocode_to_gdf(city)
polygon = gdf.geometry.iloc[0]
west, south, east, north = polygon.bounds
bbox = (west, south, east, north)

# Dataset 2
ox.settings.useful_tags_way = ["highway", "cycleway", "footway", "sidewalk", "busway", "length"]
osm_data = ox.graph_from_bbox(bbox, custom_filter='["highway"]')

# Dataset 3
bbox_geom = box(west, south, east, north) # Convert to GeoDataFrame
bbox_gdf = gpd.GeoDataFrame( geometry=[bbox_geom], crs=gdf.crs)
pop_data = wp.wp_raster(product_name='pop_g2_r25a', aoi=bbox_gdf, years=2024)

2.4 🔍 Task 4: Review the Data

Before doing any analysis, it’s good practice to visually inspect the input datasets described above. This task will produce a map of each dataset described above. The resuls should be images showing transport infrastructure and the population distribution in the case study area, with brighter areas in the second map having higher population density.

# Cell 1: Map the road transport network from OpenStreetMap
import matplotlib.pyplot as plt

fig, ax = ox.plot.plot_graph(
    osm_data,
    show=False,
    close=False,
    bgcolor="#111111",    # Background colour
    edge_color="#ffcb00", # Colour of the network
    edge_linewidth=0.3,     # Line thickness
    node_size=0,            # Hide intersections
)

# Overlay the city boundary as a subtle outline
gdf.plot(ax=ax, fc="none", ec="#ffffff", lw=5, alpha=0.5, zorder=2)

# Add a title
ax.set_title(f"{city}: Transport Network", color="black", fontsize=14, pad=12)
plt.show()

# Cell 2: Map the population distribution from WorldPop
import numpy as np
from matplotlib.colors import LogNorm

# Plot the population data — let worldpoppy handle the projection and axes
mesh = (pop_data.fillna(0) + 1).plot(
    norm=LogNorm(),
    cmap='inferno',
    size=6,
    cbar_kwargs={'label': 'Population per km²', 'shrink': 0.6, 'pad': 0.02}
)

# Retrieve the axes and colour bar from the QuadMesh object
ax = mesh.axes
cbar = mesh.colorbar

# Correct the aspect ratio for latitude distortion
lat_center = (pop_data.y.min() + pop_data.y.max()) / 2
ax.set_aspect(1 / np.cos(np.radians(float(lat_center))))

# Style the colour bar
cbar.set_label("Population per km²", color="white", fontsize=10)
cbar.ax.yaxis.set_tick_params(color="white")
plt.setp(cbar.ax.yaxis.get_ticklabels(), color="white")

# Style the title, background and axis labels
ax.set_title(
    f"{city} (2024): {pop_data.sum() / 1e6:.1f}M People",
    color="white", fontsize=14, pad=12
)
ax.set_facecolor("#111111")
ax.figure.patch.set_facecolor("#111111")
ax.set_xlabel("Longitude", color="white", fontsize=9)
ax.set_ylabel("Latitude", color="white", fontsize=9)
ax.tick_params(colors="white")
plt.show()

2.5 🔎 Task 5: Inspect and Evaluate the Data

Before doing any calculations, it’s important to understand the structure of your data and check for any potential issues.

This task doesn’t require you to write any code. Instead, focus on reading the outputs carefully and thinking critically about what you see.

After running the following cell answer the following questions.

Look through the column names. Identify at least 2 columns whose purpose is unclear to you. Can you make an educated guess at what they mean?
Look at the values in the first 5 rows. Are there any columns that appear to contain mostly empty or missing values? Do you think those columns will be important for your analysis?

# Cell 1: Inspect the structure of the OpenStreetMap road network data

# Convert the OSM network to a table of road segments (edges)
_, edges = ox.graph_to_gdfs(osm_data)

# Show the first few rows and all columns
print(f"Total number of road segments: {len(edges)}")
print(f"Number of columns: {len(edges.columns)}")
print("\nColumn names and data types:")
print(edges.dtypes)
print("\nFirst 5 rows:")
edges.head()

After running the following cell answer the following questions:

Compare the total estimated population to what you know about the city. Does it seem reasonable? If it seems too high or too low, what might explain that?
Look at the minimum and maximum population density values. How large is the gap between them? What does that tell you about how evenly population is distributed across the city?

# Cell 2: Inspect the structure of the WorldPop population data

total_cells = pop_data.sizes['y'] * pop_data.sizes['x']
missing_cells = pop_data.isnull().sum().item()
zero_cells = (pop_data == 0).sum().item()

# Flatten the raster to a 1D array for percentile calculations, ignoring missing values
pop_values = pop_data.values.flatten()
pop_values = pop_values[~np.isnan(pop_values)]

# Calculate the population share of the top 10% most dense cells
top10_threshold = np.percentile(pop_values, 90)
top10_population = pop_values[pop_values >= top10_threshold].sum()
top10_share = top10_population / pop_values.sum() * 100

print(f"Grid dimensions: {pop_data.sizes['y']} rows x {pop_data.sizes['x']} columns")
print(f"Total cells: {pop_data.sizes['y'] * pop_data.sizes['x']:,}")
print(f"Total estimated population: {pop_data.sum().item() / 1e6:.2f} million")
print(f"Cells with zero population: {missing_cells + zero_cells:,} cells ({(missing_cells + zero_cells) / total_cells * 100:.1f}% of total)")
print(f"\nPopulation per hectare statistics:")
print(f"  Minimum: {pop_data.min().item():.1f}")
print(f"  Maximum: {pop_data.max().item():.1f}")
print(f"  Mean:    {pop_data.mean().item():.1f}")
print(f"  Median:  {float(np.median(pop_values)):.1f}")
print(f"\nConcentration: the top 10% most dense cells contain {top10_share:.1f}% of the total population")

2.6 📐 Task 6: Calculate length of infrastructure per person

Now that you’ve reviewed and validated your data, it’s time to calculate the key metrics for your city. Here’s what this task does:

Reprojects the data to a metric coordinate system (metres instead of degrees) so that distances are calculated accurately
Calculates the total length of roads, sidewalks, bike lanes, and bus lanes in kilometres
Estimates the total population of the city from the WorldPop raster
Calculates infrastructure availability per person by dividing the total length of each infrastructure type by the total population

ℹ️ The per-person figures will be in metres per person, which is a more readable unit than kilometres per person for most infrastructure types.

⚠️ A note on data quality: these results are derived from OpenStreetMap and are subject to limitations, including incomplete tagging, inconsistent mapping conventions. Treat them as indicative rather than authoritative.

import pandas as pd
import matplotlib.ticker as ticker

# ── Step 1: Prepare and filter data ────────────────────────────────────────

# Convert graph to GeoDataFrame and reproject to a metric CRS for accurate distance calculations
edges = ox.graph_to_gdfs(osm_data, nodes=False)
edges = edges.to_crs("EPSG:32636")

# Clean highway field- some OSM edges contain multiple highway tags stored as lists
edges["highway_clean"] = edges["highway"].apply( lambda x: x[0] if isinstance(x, list) else x )

# ── Step 2: Define infrastructure types ────────────────────────────────────
foot_col = ( edges["foot"]
    if "foot" in edges.columns
    else pd.Series(index=edges.index, dtype="object") )

sidewalk_col = ( edges["sidewalk"]
    if "sidewalk" in edges.columns
    else pd.Series(index=edges.index, dtype="object") )

cycleway_col = ( edges["cycleway"]
    if "cycleway" in edges.columns
    else pd.Series(index=edges.index, dtype="object") )

bicycle_col = ( edges["bicycle"]
    if "bicycle" in edges.columns
    else pd.Series(index=edges.index, dtype="object") )

masks = {
    # All mapped transport infrastructure
    "Roads": edges["highway_clean"].notna(),

    # Walking infrastructure
    "Sidewalks": ( sidewalk_col.notna() | foot_col.eq("yes")
            | edges["highway_clean"].isin(["footway", "pedestrian", "path"]) ),

    # Cycling infrastructure
    "Bike Lanes":
        ( cycleway_col.notna() | bicycle_col.eq("yes")
            | edges["highway_clean"].isin(["cycleway"]) ),

    # Bus-related infrastructure
    "Bus Lanes": ( edges.filter(like="bus").notna().any(axis=1) )
}

# ── Step 3: Calculate lengths ────────────────────────────────────────
lengths_km = {
    name: edges.loc[mask, "length"].sum() / 1000
    for name, mask in masks.items() }

# ── Step 4: Calculate infrastructure availability per person ─────────────────
total_population = pop_data.sum().item()
metres_per_person = {k: (v * 1000) / total_population for k, v in lengths_km.items()}

# ── Step 5: Print summary table ───────────────────────────────────────────────
results = pd.DataFrame({
    'Total Length (km)':      lengths_km,
    'Population':             {k: f"{total_population:,.0f}" for k in lengths_km},
    'Metres per Person':      {k: round(v, 2) for k, v in metres_per_person.items()}
})

print(f"\n{'─' * 55}")
print(f"  Transport Infrastructure Summary: {city}")
print(f"{'─' * 55}")
print(f"  Total population: {total_population:,.0f}")
print(f"{'─' * 55}\n")
print(results.to_string())
print(f"\n{'─' * 55}")

# ── Step 6: Plot infrastructure availability per person ───────────────────────
fig, ax = plt.subplots(figsize=(8, 5))
fig.patch.set_facecolor("#111111")
ax.set_facecolor("#111111")

colors = ['#ffcb00', '#ff8800', '#00cfff', '#00e676']
bars = ax.bar(
    list(metres_per_person.keys()),
    list(metres_per_person.values()),
    color=colors, width=0.5, zorder=2 )

# Add value labels on top of each bar
for bar in bars:
    ax.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 0.01 * max(metres_per_person.values()),
        f"{bar.get_height():.2f}m",
        ha='center', va='bottom', color='white', fontsize=10 )

# Style the chart
ax.set_title(f"{city}: Infrastructure availability per person", color='white', fontsize=14, pad=12)
ax.set_ylabel("Metres per Person", color='white', fontsize=11)
ax.tick_params(colors='white')
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.1f'))
ax.grid(axis='y', color='#444444', linestyle='--', linewidth=0.7, zorder=1)
for spine in ax.spines.values():
    spine.set_edgecolor('#444444')
plt.tight_layout()
plt.show()

2.7 📊 Task 7: Interpret and Evaluate the Results

You’ve now created a basic estimate of infrastructure length per person. But what do these numbers actually mean? This task asks you to think critically about the results and consider what they tell us, and don’t tell us.

There is no code to run in this task. Instead, read through the questions below and discuss them with your group.

2.7.1 🔍 Part 1: Sense-check your results

Before drawing any conclusions, it’s worth asking whether the numbers are plausible.

How complete do you think the OpenStreetMap data is for your city? Recall what you observed in Tasks 4 and 5. Are there infrastructure types that seem underrepresented — for example, very few bike lanes or bus lanes?
The WorldPop population estimate may differ from official census figures. Does the figure produced by the tool seem reasonable? If it’s higher or lower than expected, how might that affect your per-person metrics?
Exercise: look at the OSM wiki describing how bicycle infrastructure is tagged in OpenStreetMap: https://wiki.openstreetmap.org/wiki/Key:cycleway. Can you identify any potential limitations of the method used in the code above to identify bike lanes? How might you improve it?

2.7.2 🌍 Part 2: Compare with other cities

To put your results in context, it helps to compare them with cities in countries at a similar stage of development. The table below provides some reference figures for comparable cities in Poland and Romania, derived from OpenStreetMap data using the same methodology as this exercise.

City	Country	Population	Roads (m/person)	Sidewalks (m/person)	Bike Lanes (m/person)
Warsaw	Poland	~2,230,000	20.75	9.96	0.69
Kraków	Poland	~1,000,000	23.63	9.71	0.34
Bucharest	Romania	~2,000,000	7.72	1.63	0.09
Cluj-Napoca	Romania	~320,000	15.63	6.90	0.40

How does your city compare to the benchmark cities? Is it above or below average for each infrastructure type?
What factors might explain differences between cities — for example, city size, urban density, historical development patterns, or investment levels?

2.7.3 🚦 Part 3: Transport planning implications

If you were advising the city government, which infrastructure type would you prioritise investing in first? What additional data would you want before making that recommendation?
How might the ongoing conflict in Ukraine affect the reliability of both the OpenStreetMap data and the WorldPop population estimates for the cities in this exercise?

3 🗺️ Task 8: Generate Maps of the Data

So far, you’ve calculated summary statistics for your city’s transport infrastructure. However, averages can hide important spatial differences. In this task, you’ll create maps to explore:

Where transport infrastructure is concentrated
How infrastructure relates to population density
Whether some parts of the city appear underserved

The code below will:

Convert the OpenStreetMap network into separate infrastructure layers
Plot roads, sidewalks, bike lanes, and bus lanes
Overlay the WorldPop population data
Produce a combined map showing transport infrastructure alongside population distribution

⚠️ Because the network data can be large, the plotting step may take a minute or two for larger cities.

from matplotlib.colors import LogNorm

# Convert graph to GeoDataFrame
edges = ox.graph_to_gdfs(osm_data, nodes=False)

# Reproject to Web Mercator for plotting
edges_plot = edges.to_crs("EPSG:3857")
pop_plot = pop_data.rio.reproject("EPSG:3857")

# Clean highway field
edges_plot["highway_clean"] = edges_plot["highway"].apply(
    lambda x: x[0] if isinstance(x, list) else x )

# Define infrastructure layers for mapping
layers = {
    "Roads": edges_plot[edges_plot["highway_clean"].notna()],
    "Sidewalks": edges_plot[(
            sidewalk_col.notna()
            | foot_col.eq("yes")
            | edges_plot["highway_clean"].isin(["footway", "pedestrian", "path"])
        )],
    "Bike Lanes": edges_plot[(
            cycleway_col.notna()
            | bicycle_col.eq("yes")
            | edges_plot["highway_clean"].isin(["cycleway"])
        )],
    "Bus Infrastructure": edges_plot[edges_plot.filter(like="bus").notna().any(axis=1)]
}

# ── Plot configuration ─────────────────────────────────────────────────────

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

# Use high-contrast infrastructure colors
colors_dict = {
    "Roads": "#d73027",               # red
    "Sidewalks": "#4575b4",           # blue
    "Bike Lanes": "#1a9850",          # green
    "Bus Infrastructure": "#984ea3"  # purple
}

for ax, (name, layer) in zip(axes, layers.items()):
    # Plot population raster in grayscale
    pop_plot.plot(
        ax=ax,
        cmap="Greys",
        norm=LogNorm(vmin=1, vmax=float(pop_plot.max())),
        alpha=0.45,
        add_colorbar=False
    )
    # Plot infrastructure layer
    if len(layer) > 0:
        layer.plot(
            ax=ax,
            linewidth=0.8,
            color=colors_dict[name],
            alpha=0.95
        )
    # Styling
    ax.set_title(f"{city}: {name}", fontsize=12)
    ax.set_axis_off()

plt.tight_layout()
plt.show()

4 📊 Task 9: Interpret and Evaluate the Maps

The maps you’ve generated provide a more detailed picture of transport infrastructure across the city. Rather than focusing only on city-wide averages, let’s look for areas where infrastructure appears well provided or underserved.

There is no code to run in this task. Instead, discuss the questions below with your group.

4.1 🔍 Part 1: Identify spatial patterns

Do different infrastructure types follow similar patterns? For example, are bike lanes concentrated in the same areas as sidewalks or bus infrastructure?
Are there areas with high population density but relatively little infrastructure? What challenges might residents in these areas face?

4.2 🌍 Part 2: Planning implications

If you were preparing a transport investment strategy for this city, which areas would you prioritise for improvement? Why?
What additional datasets would help you make a stronger planning recommendation? Examples might include:
- Traffic volumes
- Public transport ridership
- Road safety data
- Air pollution measurements
- Employment or income data
- Travel survey data
How could maps like these support real-world transport planning decisions? Consider communication with policymakers, prioritisation of investments, or public engagement.

6 📚 Further Reading

For more context and options to run the materials: * Rendered Website: View the fully rendered Quarto website at robinlovelace.net/itfworkshop/. * GitHub Codespaces: Launch a cloud-based development environment (requires a GitHub account) using the repository at github.com/Robinlovelace/itfworkshop. * Original Repository: Explore the original source code developed by Nick Caros at github.com/ncaros/ukraine-workshop.