Ina Krapp

sfcentralities: Calculating centralities from sf objects

Tue, 10 Mar 2026 00:00:00 +0000

While I generally like to work with the sf library in R, it is unfortunate that the R ecosystem when it comes to spatial data is quite heterogeneous: Available spatial formats do not only include vector and raster data, but the structures in which they are stored are also usually heavily influenced by the design choices of authors of certain packages:

, probably the first widely used spatial package in R - at least, it is among the older ones, and , which is often treated as its successor.
, a package by the authors of “Spatial Point Patterns: Methodology and Applications with R”
is more often used for raster data, but also has capabilities to handle vector data, and is, in many ways, a successor to the package

…and then there is the option to store coordinates in an ordinary R dataframe with columns named ‘Longitude’ and ‘Latitude’ or ‘X’ and ‘Y’ and ignore spatial packages entirely.

These are just the packages and ways to work with spatial data I encountered most frequently. And spatio-temporal analysis is also growing more common (anyone interested in this field: I highly recommend taking a look at the package).

I think it is an unfortunate situation for beginners who aim to perform spatial analysis in R. Even if you have some experience, especially if it is in one package, working with another package can feel like starting from zero again. Many authors have valid reasons for their design choices, and there is a growing awareness of the issue; many packages also contain functions to transform data from one format to the other. But sometimes, I find functions implemented in packages that do not see themselves as ‘spatial’ to begin with.

For example, when I tried to find out how to calculate the geometric median in R, the R package that appeared first in the search results was - Practical Numerical Math Functions. The package is impressive for the breadth of functions it covers, but this result also shows how different the contexts are in which functions are used. The geometric median can be used by geographers to find a central point on the map, but is also used, for example, in principal component analysis - which is an entirely different story. There are even more packages in R which offer functions for the calculation of geometric medians - , for example, which is particularly optimized for high-dimensional, large datasets and very fast due to its reliance on C++. So many users of the geometric median do not analyse spatial data.

But I did, for a project, and this is a part of where (more precisely: the function st_geo_median) comes from. It takes sf objects as input, it gives sf objects as output, it is not reinventing the wheel, but beginner-friendly and easy to use - at least I hope so. But it is not entirely a package purely written with convenience in mind: Since spatial data can be in projected or geographic coordinates, using purely-number based implementations like pracma or Gmedian, which can not properly handle longitude and latitude values can give wrong results - sfcentralities will give an error if the user attempts it.

Why can other packages give wrong results? Because distances in the longitude-latitude-system are not constant: One degree corresponds to a different distance in meters depending on if you are at the equator or at the poles. So, despite the sometimes complex implementations of these packages, there are good reasons to use spatial packages for spatial data.

sfcentralities is also a bit of an attempt to bridge the gap between sf and the package. dodgr is one of the packages that just show how powerful R can be. dodgr stands for ‘Distances on Directed Graphs’, and while it can take sf objects as input and even offers a function to download data from OpenStreetMap in sf format to use for analysis, it also heavily builds on its own data format, . It allows very fast and precise distance and time calculations even in complex street networks, since it is able to take into account different modes of transport, one-way streets and other factors that go beyond a simple ‘distance in kilometers/miles’ measure.

I have been using it to calculate a similar measure to the geometric median - a measure which minimizes the sum of distances to points - along a street network. This is not really equivalent to the geometric median because it does not necessarily fulfill a global optimality condition - I can say that the geometric median is the point which minimizes the sum of distances to all points in a set since the geometric median is calculated using Euclidean (straight-line) distances. For a network distance, I can only say that a certain point I evaluated has a higher closeness than other points I evaluated. This is the closeness centrality, and in sfcentralities, it is implemented in the function st_closeness_centrality.

sfcentralities: Calculating centralities from sf objects

Tue, 10 Mar 2026 00:00:00 +0000

I published an R package which aims to simplify centrality calculations in spatial data using the sf package.

Spatial Data Analysis in R

Wed, 15 Oct 2025 00:00:00 +0000

Introduction

The spatial data ecosystem in R comprises a number of packages, each tailored to a particular data type or workflow.The most common data objects are:

Points – e.g. locations of cities or sampling sites

Lines – e.g. roads, rivers or migration routes

Polygons – e.g. country borders or land‑use zones

Rasters – gridded values such as elevation or satellite imagery

Many packages exist for these data types in R: sp/sf handle vector data, raster/terra work with rasters, stars is specialised for spatio‑temporal objects, and spatstat focuses on point pattern analysis, but also offers raster tools.

We will focus on sf, terra and spatstat here.

library(sf) # Spatial Features - vector data
library(tidyverse)
library(here)
library(spatstat) # Point Pattern Analysis
library(terra) # For robust spatial operations, especially with raster data
library(mgcv) # Generalized additive regression

The code used for this workshop can be found under: . Most of the spatial datasets used in this workshop are too large to be uploaded on Github. To replicate the analysis, they have to be downloaded from their original sources.

Get country borders and cities:

The Natural Earth project provides free vector layers that are well suited for a first look at the Korean peninsula.

Download countries and populated places from natural earth data:

countries <- st_read(here("ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp"))
cities <- st_read(here("ne_10m_populated_places/ne_10m_populated_places.shp"))

ggplot() +
 geom_sf(data = countries) +
 geom_sf(data = cities)

Select the Korean Peninsula

The countries and cities objects contain worldwide data; we extract only the parts that belong to North and South Korea.

# Filter cities and countries for North and South Korea
korea_countries <- countries %>%
 filter(SOVEREIGNT %in% c("North Korea", "South Korea"))

cities_korea <- cities %>%
 filter(ADM0NAME %in% c("North Korea", "South Korea"))

# Verify all korean cities lie in North or South Korea:
ggplot() +
 geom_sf(data = korea_countries) +
 geom_sf(data = cities_korea)

Load Night‑Light Data

The Consistent and Corrected Nighttime Light dataset (CCNL) from DMSP‑OLS provides a global raster that records artificial illumination. You can download it here:

We will use the data from 2013.

This is a tif file: An image with very high resolution.

img <- rast(here("CCNL_DMSP_2013_V1.tif"))
plot(img)

Satellite images, which covers the entire earth at high accuracy (sometimes one pixel per meter or even more detailed) can become extremly large.

terra is very effective in handling such large datasets, but it is highly advised to select only parts/areas needed for analysis before performing any operations with them. The image above, turned entirely into R’s standard dataframe format, would have been over 5 GB large.

For our analysis, we only need the Korean peninsula, so we crop the data to the extent of the peninsula before any further processing.

# Get extent of the Korean states in a format terra understands:
v <- ext(korea_countries)
# Keep only part of the image which shows the peninsula:
img_nightlight_korea <- crop(img, v)
# Plot again to verify the new image shows the peninsula:
plot(img_nightlight_korea)

Re‑project to a Projected CRS

Globally, locations are usually given in latitude/longitude. This has the disadvantage that distance calculations are difficult: One degree near the equator has a different meaning than one degree near the poles.

For local maps, projections which usually use meters/kilometers are used instead. Because they project the earth’s three-dimensional surface onto a flat two-dimensional area, they are different in each region. For the Korean peninsula, we use a projection with the EPSG code 5179, which is also the official projection of the south korean government.

cities_korea <- cities_korea %>%
 st_transform(crs = "EPSG:5179")
korea_countries <- korea_countries %>%
 st_transform(crs = "EPSG:5179")

The satellite image also has to be reprojected:

# Reproject satellite image
img_nightlight_korea <- project(img_nightlight_korea, "EPSG:5179")

Prepare Data for spatstat

Unfortunately, since spatstat, sf and terra were developed with different data formats in mind, some more modifications have to be done to turn the data into formats used by spatstat:

6.1 Define the Observation Window

First, for spatstat, one needs to define a region in which the point patterns occur, called a ‘window’. For our analysis, we treat only the land area of the korean peninsula as window:

# Create observation window: We use the union of North and South Korea
window_sf <- st_union(korea_countries)

# Convert to owin (spatstat window format)
window_owin <- as.owin(window_sf)

6.2 Convert City Coordinates to a ppp Object

In the next step, the coordinates of the korean cities are extracted and a ppp object is formed with them:

# Extract coordinates
city_coords <- st_coordinates(cities_korea)

# Create ppp object
city_ppp <- ppp(x = city_coords[, 1], y = city_coords[, 2], window = window_owin)

The cities’ locations are now formed as a ppp - point pattern process. Again, we can look at the data:

plot(city_ppp)

6.3 Convert Night‑Light Raster to an im Object

To be able to include satellite data into the analysis, its format also needs to be adjusted:

The code below turns the satellite image into the ‘im’ format, which is used for raster data in spatstat.

# Turning it into the im format used by spatstat contains two steps: 
#Turn it into a dataframe and them an 'im' object.
df <- as.data.frame(img_nightlight_korea, xy = TRUE)
nightlight <- as.im(df)
# Verify again that the data looks correct:
plot(nightlight)

Point‑Pattern Modelling

Point patterns in spatstat can be modeled using a number of different models. We explore three models:

Homogeneous Poisson – assumes random locations of cities

Clustered (Matérn) Process – assumes clustering of cities

Inhomogeneous Poisson with Night‑Light Covariate – assumes probability of a city location changes with light intensity

7.1 Homogeneous Poisson

The first model below assumes a poisson process: It assumes that points appear randomly with a certain probability in space. The process is called homogenous because this probability is assumed to be equal everyhwere in the window.

# Model 1: Homogeneous Poisson process
fit_homog <- ppm(city_ppp ~ 1)

7.2 Matérn Clustered Process

In reality, points often cluster and are more likely to be found close to other points. The second model assumes a matern clustering process.

This process models the idea the observed points were created as follows: First, a homogenous poisson process created points at random locations in the window. Then, points formed with an increased probability around these first points.

# Model 2: Clustered proess:
fit_cluster <- kppm(city_ppp ~ 1, method = "palm", clusters = "MatClust")

There are many other clustering processes which can be modeled.

7.3 Inhomogeneous Poisson with night light

One can also include covariates. Below is a model using the nightlight intensity from the satellite data as covariate:

# Model 3: Inhomogeneous with night brightness
fit_light <- ppm(city_ppp ~ nightlight)

7.4 Model Comparison by AIC

To evaluate the models, one can compare the AICs. They contain information about model accuracy while penalizing overly complex models.

# Step 5: Compare AICs
AIC_homog <- AIC(fit_homog)
AIC_cluster <- AIC(fit_cluster)
AIC_light <- AIC(fit_light)

# Display comparison
aic_comparison <- data.frame(
 Model = c("Homogeneous", "Light", "Cluster"),
 AIC = c(AIC_homog, AIC_light, AIC_cluster)
)

print(aic_comparison)

The model including nightlight intensity has the lowest AIC.

We can take at look at its coefficients:

# Print summary of best model
best_model_index <- which.min(aic_comparison$AIC)
best_model_name <- aic_comparison$Model[best_model_index]

cat("\nBest model by AIC:", best_model_name, "\n")
print(get(paste0("fit_", tolower(gsub(" ", "_", best_model_name)))))

The night light intensity has a statistically significant effect on the probability that a city is located at a certain place.

A raster regression

We’ve seen now that light can be used to predict where to find cities. It is often used for that purpose and as a proxy variable for income. Note that the correlation ‘more light = increased chance of finding a city’ holds for both North and South Korea.

But on average, South Korea is much more brightly lit at night than the north - and this is not limited to the cities we saw on the map. We will now look at a model which models this difference in brightness.

8.1 Load Population Raster Data

Our cities dataset does not perfectly capture where people live: While it gives the location of the peninsula’s largest cities, it only contains a small number of smaller towns and villages. In point data, cities are also modeled as points while in reality, they often cover large areas. To get another view on where people live in North and Korea South Korea, we will look at another dataset.

Again, we will be working with tif images. This time, we will be using 2015 population estimates from the WorldPop project:

img_north <- rast(here("prk_pop_2015_CN_100m_R2025A_v1.tif"))
plot(img_north)
img_south <- rast(here("kor_pop_2015_CN_100m_R2025A_v1.tif"))
plot(img_south)

# Re‑project them to EPSG:5179
img_north <- project(img_north, "EPSG:5179")
img_south <- project(img_south, "EPSG:5179")

The North is generally less densely populated than the South: North Korea has around 25 Million inhabitants while around 50 Million people live in South Korea.

8.2 Build a Unified Raster Stack

In the next step, we combine the data. We now have several raster datasets, but since they are from different sources, we have to make sure they ‘fit’ onto each other.

# --- Step 1: Define the exact extent of the area our data should cover
korea_mask <- rasterize(korea_countries, img_nightlight_korea, field = 1)

# --- Step 2: Prepare the Brightness Response Variable ---
# Mask the light data and then log-transform it.
korea_lights <- mask(img_nightlight_korea, korea_mask)
final_brightness <- log(korea_lights + 0.01)
names(final_brightness) <- "log_brightness"

# 3. Create Country Predictor from sf dataset of North and South Korea borders:
korea_countries <- korea_countries %>%
 mutate(is_south = ifelse(ADMIN == "South Korea", 1, 0))
country_raster <- rasterize(korea_countries,
 img_nightlight_korea,
 field = "is_south")
final_country <- mask(country_raster, korea_mask) # Mask it to the study area.

# 4. The Population Predictor ('log_pop')
# Align each population raster to the master grid (img_nightlight_korea or korea_mask).
pop_north_aligned <- project(img_north, korea_mask, method = "bilinear")
pop_south_aligned <- project(img_south, korea_mask, method = "bilinear")

# Merge them. The result is a raster with the correct extent but NAs in the ocean.
pop_aligned_merged <- cover(pop_north_aligned, pop_south_aligned)

# Log-transform the population count:
log_pop_aligned <- log(pop_aligned_merged + 1)

# Impute all NAs with 0.
pop_imputed <- ifel(is.na(log_pop_aligned), 0, log_pop_aligned)

# This clips away the values in the ocean, leaving only the data within the peninsula.
final_pop <- mask(pop_imputed, korea_mask)
names(final_pop) <- "log_pop"

# --- Step 5: Assemble the Final Stack and Proceed to Modeling ---
model_stack <- c(final_brightness, final_country, final_pop)
gam_df <- as.data.frame(model_stack, xy = TRUE)

8.3 Generalised Additive Modelling

Spatial data usually contains spatial autocorrelation (neighboring observations are correlated). There is extensive literature about possible approaches to address this issue, but no consensus.

For this course, we will use a generalized additive model, which allows to include a function that aims to predict brightness values based on x-y-coordinates.

We can fit various kinds of models such as one which only includes population, only the country (North or South Korea) or both. We will start with one which fits the s(x,y)-function that models autocorrelation.

gam_model <- gam(
 log_brightness ~ s(x, y, k = 100),
 data = gam_df
)

# View the results
summary(gam_model)

Autocorrelation is often a powerful predictor.

But that does not mean other variables, such as the indicator if an area is north or south korean, will not be significant:

gam_model_light <- gam(
 log_brightness ~ is_south + s(x, y, k = 100),
 data = gam_df
)

# View the results
summary(gam_model_light)

Areas in the South are significantly brighter in general.

Intuitively, this makes sense: Imagine you were standing at the border. Although North Korea might just be a few hundred meters away from you, if you were in South Korea, you could still expect a village on your side of the border to be much more brightly illuminated than on the other side.

Population also plays a role. Populated areas are generally brighter at night:

gam_model_light_population <- gam(
 log_brightness ~ is_south + log_pop + s(x, y, k = 100),
 data = gam_df
)

# View the results
summary(gam_model_light_population)

Again, this is what we would expect intuitively. Although South Korea’s cities emit much more light during the night, in areas where no one lives, there is no need to put up streetlights.

But in North Korea, populated areas often remain dark as well. Satellite imagery of nightlights is often used as an economic indicator for areas where there is little other data available.

References

There are many good resources on working with spatial data in R:

An Introduction to Spatial Data Analysis and Statistics: A Course in R by Antonio Paez: ISBN: 978-1-7778515-0-7 DOI: 10.5281/zenodo.5155982

Spatial Statistics for Data Science: Theory and Practice with R by Paula Moraga:

The terra package by Robert J. Hijmans has very extensive documentation online:

For more technical topics such as map projections in R, or how to work with different geodata formats, Geocomputation with R by Robin Lovelace, Jakub Nowosad and Jannes Muenchow is a good reference:

The photo of the korean peninsula at night is from NASA:

Writing Papers in RMarkdown

Tue, 29 Apr 2025 14:00:00 +0000

AI for non-programmers

Wed, 09 Oct 2024 14:00:00 +0000

Common errors in R and how to solve them

Tue, 08 Oct 2024 14:00:00 +0000

How do German energy partnerships affect trade?

Fri, 20 Sep 2024 00:00:00 +0000

This paper found that some energy partnerships have slight positive effects on overall trade and negative effects on fossil-fuel-intensive trade. However, overall, the results were very heterogenous.

Software

Sun, 19 May 2024 00:00:00 +0000

Wisp: A locally running version of Whisper

Sat, 13 Apr 2024 00:00:00 +0000

Whisper is a transcription software that allows to turn audio files into text. I created a locally running version of it, Wisp, with the aim to give it a simple, intutive user interface. My project can be found here:

Whisper itself has been developed by OpenAI, which is also the company behind ChatGPT and several other Artificial Intelligence programs. You can try it out here: ‚ ‘.

Unlike ChatGPT, Whisper does not have a user interface designed by OpenAI. Its demos, as you might have seen if you tried it out above, often are used by many people at the same time. Since they all send their requests to the same computer, people may have to wait a very long time before they receive the text. Alternatively, Whisper can be run locally, using Python, but for anyone who does not know how to use a programming language, this is not an option.

So the aim of my project was to make Whisper easy to use for anyone, on their own computer.

Wisp is supposed to run without internet connection. Any user runs it on their computer, meaning that the users won‘t have to wait before the program finished the text of someone else. Since it has a graphical user interface, it is easy to use for anyone who is familiar with standard office software like Microsoft Office.

Like with many of my other projects, I learnt a lot in the process of building this program. Before I started, I had no experience in working with audio data and very little with building a graphical user interface in Python. It was also my first time I turned a Python program into an executable windows program.

Wisp: A locally running version of Whisper

Sat, 13 Apr 2024 00:00:00 +0000

Whisper is an AI that turns speech into text. I created a simple, intutive user interface which allows anyone to run it locally.

Introduction to R

Mon, 20 Nov 2023 00:00:00 +0000

This workshop was given on the 10th of October 2023 as a SAFE Research Data Seminar at the Leibniz Institute for Financial Research SAFE.

In this tutorial, we will use R and RStudio. RStudio is a user interface which makes working with R much more convenient. This tutorial is intended to be run as a QuartoMarkdown file in RStudio, which allows you to execute the code yourself. To run it, download the and open it in a recent version of RStudio.

Where can I get R?

Here:

We’ll also use RStudio. You can get it here:

What is R?

R is a programming language often used for Data Analysis.

Programming your analysis with R instead of doing it in, say, Excel, has a large number of advantages:

It can easily be rerun. You can change your input (for example, append more data) and simply rerun everything instead of having to click through the whole analysis again. That also makes it much more replicable: You never need to worry about forgetting what you did. Anyone else can replicate your analysis easily as well.
R is non-commercial. It is free, so you never have to worry about license fees!
Related to that, anyone can implement and customize anything in R. You never have to wait for a company to finally release a new version with that new function everyone waited for.

RStudio is a user interface.

If you are completely new to programming, you may not know yet what the code below does (though you can probably guess it). Run it (by clicking on the green dart on the right) and see if you’re right.

1 + 1

You can use R as a calculator. Change this code to calculate how much 2 and 3 are. In practice, you will often use code of other people and modify it.

But of course, there’s much more to R than that.

Vectors

When you have a dataset, you’ll usually not just have one or two numbers. Assume you have data on a household. Two people are children and earn nothing, the father earns 1000 Euros a month, the mother 3000. There are different ways you could work with this data in R.

For example, you could put it into a vector. A vector can be created in R by typing c() and entering the values, separated by commas, between the brackets. The letter c is used because this is a column vector - you can imagine it as being the column of a table.

household_income <- c(0,0,1000,3000)

In the upper right of the monitor, in the field ‘Environment’, you can see the numbers now. The Environment now contains the value household_income, which is a vector with the four numbers I entered above. You can call it and look at it by typing its name - auto-complete will help with that.

household_income

If you only want to look at individual values, you can call them by their position in the vector.

household_income[4]

You can calculate with them. A major advantage of R is that it is vectorized - you can calculate with an entire vector just like you would with a single value.

Assume each member of the family gets 100 Euro by the state. Then, you can simply calculate that with this code:

household_income + 100

Note that the values in the environment are still the old ones. The result of the calculations is only shown, it is not saved. To save a value, you need to assign it to a variable name:

new_household_income = household_income + 100

The equal sign in many programming languages is used in that way. But most other mathematical symbols do the same as you know from school math.

The newly created variable is again visible in the ‘Environment’ pane. To see it here again, simply type its name.

new_household_income

When you run such code, you can assign it to the same variable name you used before. Assume each family member gets another 200 Euro tax returns:

new_household_income = new_household_income + 200

But be careful when running such code. Run it again and look at how the values (in the ‘Environment’ pane) are changing when you do so.

Each time you run the code, another 200 gets added. Running it more than once therefore gives you a wrong value. Click on the broom (above the ‘Environment’ pane) to clear out the wrong value. Don’t be concerned about deleting everything: Remember that R is easily replicable. All data this pane contained can be recreated by simply running the complete code again. Click on the ‘run all chunks above’ button in the code block (this is the grew downward arrow to the left of the green arrow, ) to recreate the environment.

You can also, for example, add or subtract vectors from each other. But you should only do that when they have the same length. Assume, for example, each family member spent some money: The children spent 10 and 20 Euros on sweets, the father was shopping for 500 Euros, the mother paid the rent for 1500 Euros. Then you can calculate how much money is left like that:

new_household_income - c(10, 20, 500, 1500 )

Assume each member pays another 10 Euros when they all go to the cinema together. How can this be calculated?

new_household_income - 10

This gives the same result:

new_household_income - c(10, 10, 10, 10 )

Functions

There are also functions available to make calculations simpler. Assume you’d want to know how much money the family owns on average. You could calculate it like that:

(new_household_income[1] + new_household_income[2] + new_household_income[3] + new_household_income[4])/4

That is a perfectly valid calculation. It adds the money of all family members and divides it by the number of family members. But it took a while to type. R has a build-in function that does the job:

mean(new_household_income)

Likewise one can find the largest or smallest value in a vector with the functions sum, max and min, and do many other things. Assume you want to find a function but don’t know if it exists. In that case, the ‘Help’ pane (below the ‘Environment’ pane) is your best friend.

Use it to find the function that sums up all elements in the vector and calculate how much all family members earn combined.

sum(new_household_income)

Most functions have several arguments, separated by commas. The first argument is usually the data the function should be applied to, and it is almost always required. But often, other arguments are optional. For example, functions like mean and sum include a na.rm argument. If you set it to TRUE, the function will ignore missing values. As default, it is set to false.

household <- c(100, 100, NA, 200)
sum(household)
# False is the default, so this is the same as above:
sum(household, na.rm = FALSE)
# Setting it to TRUE ignores the missing value:
sum(household, na.rm = TRUE)

Dataframes

There are different types of data structures in R. As mentioned above, you can imagine a vector also as column of a table. It does not necessarily have to contain numbers. The code below constructs a character vector:

household_names = c('Anna', 'Benny', 'Christian', "Daniela")

But with such vectors, of course, you can’t do many calculations.

More frequently, you’ll encounter them as parts of dataframes. These are entire tables, which you can imagine as two or more columns side by side. You can create them by giving vectors of the same length into the data.frame function:

household_data = data.frame(household_names, household_income)

Now, there’s a new field in the Environment in the upper right corner. It is called Data and contains ‘household_data’. You can click on the light blue button besides it to look at it into more detail.

This object is described as ‘4 obs. of 2 variables’, in long form: ‘4 observations of 2 variables’.

R always interprets dataframes in this way: It is assumed that each column in the table represents a specific variable ( here, name and income of each family member) and each row represents an observation for the variables (here, we have 4 rows, one for each family member).

Each column of a dataframe, being a variable, has a variable name. Since the dataframe was constructed from the named vectors household_income and household_names, the variables use these words as names. You can access them by writing the name of the dataframe, then a $-sign, then the name of the column (again, autocomplete will help and show which variables are in the dataframe).

household_data$household_income

This looks like a vector and behaves like a vector because it is - surprise - a vector. Again, you can simply take the mean, median, sum or use some other function that accepts a vector as input.

Calculate the median of the household_income from the household_data dataframe.

median(household_data$household_income)

On dataframes, you can also conduct much more complex forms of analysis - from fixed effects regression to timeseries forecasting, R covers almost all modern forms of statistical analysis. This tutorial will not cover all that, but the second workshop will show those who are interested how to conduct a standard linear regression.

To finish this first part, we’ll look at packages and what you can do with them.

Installing and activating a package

R is Open Source. A Core Team is working on it, but anyone can contribute. Contributions are typically packages that can be installed. They allow many, many things - creating maps, developing apps or animating small films, for example.

The package we’ll install now serves a more mundane purpose: Data comes in a wide variety of formats. The standard Stata format (dta), Excel-files (xlsx) and csv-files are just a few examples. Packages allow to work with all of them.

We’ll load a csv-file with the readr-package.

The library-function allows you to activate a package. But if you copy the code below, you’ll receive an error. This is because the package is not installed yet. You can install a package by going to ‘Packages’ in the lower left pane, clicking on ‘Install’ and typing the name of the package which you want to install.

The run the library function and your package is activated. Now it can be used.

library(readr)

We’ll use the package to load the data from here:

Download it and put it into the same folder that you have currently open in RStudio. A file named ‘penguins.csv’ should be visible in ‘Files’ now.

Now, there are two ways to load a file into R. One is to go to the ‘Files’ pane (another pane in the lower right) and search the csv file. Once you found it, click on it with the left mouse button and then click on ‘Import Dataset’.

The other one is using code. It has the advantage that it can easily be replicated. For that reason, if you click on ‘Import ‘, you get a code preview in the lower right corner. Copy it to the code block below.

penguins <- read_csv("penguins.csv")
View(penguins)

We already loaded readr, so you can delete the library command before running the code.

The View function shows the dataset. But you can also look at it by clicking on the word ‘penguins’ in the ‘Environment’ pane.

In the workshop ‘Linear Regression with R’, we’ll work with this dataset and look what it tells us about the penguins.

Finally, to convince yourself that R code is replicable, click on the broom in the environment pane to delete all objects. Then click on the Run-button above and on ‘Run All’. You’ll see the objects will be created again.

Linear Regression in R

Mon, 20 Nov 2023 00:00:00 +0000

Load data

If you were not attending the previous workshop ‘Introduction to R’, you’ll need to load the data first:

library(readr)

We’ll use the package to load the data from here:

Download it and put it into the same folder that you have currently open in RStudio.

Read it into R:

penguins <- read_csv("penguins.csv")
View(penguins)

Modifying data

Often, when you’ve loaded the data, you will not be immediately able to work with it. More often than not, it needs to be changed beforehand.

You can not simply write in a dataframe by clicking on it. This is because such changes would not be reproducible. You can modify a dataframe in any way you want, but in R, it has to be done using code.

This dataset contains no identifier yet. We’ll add one:

penguins$ID <- 1:nrow(penguins)

For example, assume that we knew the penguin in row 48 (which currently has a NA value) was female. The gender is in the seventh column. You indicate the position of a value in the dataframe by giving the row first and the column second.

penguins[48, 7] <- 'female'

If we want our dataset to only include rows in which all values are known, we remove the NA’s with this command:

penguins <- na.omit(penguins)

Remember that in R, it are always entire rows which get removed. It is impossible to only remove individual values because a dataframe always needs to have one entry for each column in every row.

There are also other options such as replacing NA’s with the mean value of their column, but these are not always scientifically sound. Which strategy for missing data is appropriate depends on the dataset and your research question.

Once we are confident the data is what it should be, we can start our regression.

Running a regression

We’ll use the fixest package to run regressions. Many packages in R can be used for this purpose, and base R allows to run regressions, too. But the fixest package is very fast - for large datasets, that’s important.

library(fixest)

A regression in R is a function. It has a name and function arguments. Since we work with linear regressions, we use the feols command (ols are ordinary least squares).

print_3 <- function(x){
 print(3)
 print("I printed 3")
}

print_3(4)

The first regression checks if penguins who weigh more have longer flippers:

lm(penguins$flipper_length_mm ~ penguins$body_mass_g)

feols(flipper_length_mm ~ body_mass_g, data = penguins)

They do, the estimate is positive and significant (with a very small p-value). The effect is not very large, though.

Regression results can be saved by assigning them to a name:

flipper_regression <- feols(flipper_length_mm ~ body_mass_g, data = penguins)

This ensures that you can always access them. You can call the table of results we saw above with the summary function:

summary(flipper_regression)

It is good practice to calculate heteroskedasticity-robust standard errors. This is done with the vcov-argument. You can do it in the regression function or in the summary function.

summary(flipper_regression, vcov = 'hetero')

As you can see, the estimate remains the same. But the standard error and the metrics calculated based on it (the t-value) change.

In the next step, write your own regression. Is the bill length positively related to the bill depth? Calculate heteroskedasticity-robust standard errors for this.

bill_regression <- feols(bill_length_mm ~ bill_depth_mm + bill_depth_mm^2, data = penguins, vcov = 'hetero')

summary(bill_regression, vcov = 'hetero')

Surprisingly, the relation is significantly negative. Penguins, it appears, either have a short, but thick bill or a long, but thin one.

A technical remark on lists

The regression is stored in R as a list. A list is a special form of datatype.

Important to know: If you extract an element from a list with single brackets, even if it only contains a single value, it will be a smaller list. For example, the first element of a regression is the number of observations.

bill_regression[1]
class(bill_regression[1])

To get the element itself, use double brackets:

bill_regression[["nobs"]]
class(bill_regression[["nobs"]])

Fixed effects.

In many datasets, such regressions as the ones above are not a good approach. That is because they are only valid when the data is a random sample of the underlying population. But what is the population we are considering here?

The penguin data contains three species, covers three islands and was collected over 3 years. The bills may not be formed the same for all three species. To test this, we use a fixed effect.

species_regression <- feols(bill_length_mm ~ bill_depth_mm | species, data = penguins)

Again, we need to think carefully about standard errors. If the data is clustered according to specific variables (here: species), the standard errors have to be clustered as well.

summary(species_regression, vcov = 'hetero')
summary(species_regression, vcov = 'cluster')

How does it look like when we also want to control for the sex of the penguins?

Clustering by sex:

species_regression <- feols(bill_length_mm ~ bill_depth_mm | species + sex, data = penguins )
summary(species_regression, vcov = 'hetero')
summary(species_regression, vcov = 'cluster')

Running a regression with factor variables

This form of clustering is to control for a certain variable, like species. If, instead, we wanted to know the effect such a variable has, we need to run the regression with a factor variable.

Transforming a variable into a factor is straightforward:

penguins$sex = as.factor(penguins$sex)

penguinsex <- penguins$sex

penguins$ID = as.factor(penguins$ID)
penguinID <- penguins$ID

Note that the type changed: Previously, the type of genres was ‘chr’. Now, it is ‘Factor w/ 2 levels’ “female”, “male”.

Levels are the different values a factor variable can take.

weight_regression <- feols(body_mass_g ~ sex, data = penguins)
summary(weight_regression, vcov = 'hetero')

The variable ‘sexmale’ is created automatically when the regression has a factor or character vector as input. The system creates a dummy, a variable ‘sexmale’ that can take 0 or 1, depending on if the penguin is male or not. It then estimates the effect of this dummy, compared to a baseline (here, female penguins are the baseline).

Again, we want to control for species effects:

weight_regression <- feols(body_mass_g ~ sex | species + island + species^island, data = penguins)
summary(weight_regression, vcov = 'cluster')
summary(weight_regression, vcov = 'twoway')

We see that although the effect is still significant at the 5%-level, the species fixed effects have a certain influence as well.

In the next step, create your own factor variable. Answer the following question: Does the body weight of the penguins change with the years?

penguins$year = as.factor(penguins$year)

Do penguins weigh more or less depending on the year?

year_regression <- feols(body_mass_g ~ year, data = penguins)
summary(year_regression, vcov = 'hetero')

No. The estimates are positive, but not significant (the p-value is very large). That is also important to know: If they had lost weight in 2009 or 2008, it may mean that the colonies were endangered.

Experience

Tue, 24 Oct 2023 00:00:00 +0000

An ARIMA model of the global average temperature

Sat, 12 Aug 2023 00:00:00 +0000

I wrote an ARIMA model to predict the average global temperature.

I would not call it a climate model because it only covers one of the many aspects of the climate. Of course, it is much more limited compared to those developed by experts in the field. Still, writing it taught me a lot about forecasting of time series.

Today, I would do some things different. In particular, I probably would discourage people who use ARIMA models from interpolating missing data points. The ARIMA model can still be used when some data is missing. I used interpolation originally because I also experimented with ETS models, who require a time series without gaps. But the ETS model is not in the published version of the code because its predictions were not very good.

The code (with extensive commentary) can be downloaded . It is written in a quarto document and should run on any relatively recent version of R and Rstudio. The ‘Global_Temperature.txt’ and ‘merged_ice_core_yearly.csv’ files contain the data the model uses, so they have to be downloaded into the same folder as well to run the code. For anyone who just wants to take a look at the results and doesn‘t want to run or modify the code themselves, here is the .

Edit from October 19th 2023: I uploaded a version that can be used for Workshops in Germany. It is in the subfolder ‘Workshop’.

An ARIMA model of the global average temperature

Sat, 12 Aug 2023 00:00:00 +0000

I created an ARIMA model to predict the increase in global temperature. Although the model is very simple, it can predict some characteristics of the global temperature increase well.