Farmer Wants a Wife
In 2023, in the article “Migration and Skewed Subnational Sex Ratios among Young Adults”, Michał Gulczyński showed that the sex ratio (i.e., the ratio of the number of men to the number of women) in Europe depends on regional population density — in regions with higher density, there are on average relatively more women. This results from the fact that women tend to migrate more (to bigger cities).
See a chart from that article:

The chart shows that while the sex ratio in the group of children aged 0–9 does not depend on regional population density, among young adults (25–34 years old) there is a clear relationship between population density (shown on a logarithmic scale) and males/females ratio.
Let’s try to reproduce a similar chart for Poland, using R.
For this purpose, we will use data from the Local Data Bank (BDL) of Statistics Poland (GUS) and the bdl package for R. We will need data on population density and the number of people aged 0–9 and 25–34 by sex in all regions of Poland.
The data can be downloaded for different levels of aggregation: it is available for macroregions (level 1), voivodeships (level 2), NUTS2 regions (level 3), NUTS3 subregions (level 4), powiats (level 5), and gminas (level 6).
We will prepare the chart for powiats (Polish „counties”), for the year 2024.
library(bdl)
lvl <- 5
yr <- 2024
The bdl package requires specifying variable IDs. Population density has the ID 60559, while the variable numbers containing the population by age group and sex can be found here.
Click to show the code
pop_density <- get_data_by_variable("60559", unitLevel = lvl, year = yr)
mal <- get_data_by_variable(c("72301","72302", "47736", "47724"),
unitLevel = lvl, year = yr)
fem <- get_data_by_variable(c("72296","72297", "47696", "47695"),
unitLevel = lvl, year = yr)
Since the age brackets in the GUS data are in five-year intervals, we need to aggregate them accordingly.
Click to show the code
mal$m09 <- mal$val_72301 + mal$val_72302
mal$m2534 <- mal$val_47736 + mal$val_47724
fem$f09 <- fem$val_72296 + fem$val_72297
fem$f2534 <- fem$val_47696 + fem$val_47695
Now we transform the data to prepare it for plotting.
The code below uses functions from the dplyr i tidyr packages: it selects the necessary variables, merges the tables, computes the sex ratios for the two age groups, and reshapes the table from wide to long format — as required by ggplot2.
Click to show the code
library(dplyr)
library(tidyr)
pop_density %>%
select (c(id, name, pop_density=val)) %>%
filter(substring(id,12,12)<"4") %>%
merge(select(mal, c(id, m09, m2534))) %>%
merge(select(fem, c(id, f09, f2534))) %>%
mutate(sexratio09 = m09/f09, sexratio2534 = m2534/f2534) %>%
select(id, name, pop_density, sexratio09, sexratio2534) %>%
pivot_longer(4:5, names_to='group', values_to='sex_ratio') %>%
mutate(group = ifelse(group=='sexratio09', '0-9 years', '25-34 years')) -> sexratio
head(sexratio)
## # A tibble: 6 × 5
## id name pop_density group sex_ratio
## <chr> <chr> <dbl> <chr> <dbl>
## 1 011212001000 Powiat bocheński 165. 0-9 years 1.08
## 2 011212001000 Powiat bocheński 165. 25-34 years 1.07
## 3 011212006000 Powiat krakowski 247. 0-9 years 1.06
## 4 011212006000 Powiat krakowski 247. 25-34 years 1.01
## 5 011212008000 Powiat miechowski 68.5 0-9 years 1.03
## 6 011212008000 Powiat miechowski 68.5 25-34 years 1.07
Next, we build two regression models, one for each age group. The dependent variable is the sex ratio in the respective age group (0–9 or 25–34), and the explanatory variable is population density.
Click to show the code
reg09 <- lm(sex_ratio~I(log10(pop_density)), data=sexratio[sexratio$group=="0-9 years",])
reg2534 <- lm(sex_ratio~I(log10(pop_density)), data=sexratio[sexratio$group=="25-34 years",])
summary(reg09)
##
## Call:
## lm(formula = sex_ratio ~ I(log10(pop_density)), data = sexratio[sexratio$group ==
## "0-9 years", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.092851 -0.015425 -0.000443 0.015577 0.096695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.051195 0.005755 182.645 <2e-16 ***
## I(log10(pop_density)) 0.001556 0.002640 0.589 0.556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02741 on 378 degrees of freedom
## Multiple R-squared: 0.0009176, Adjusted R-squared: -0.001725
## F-statistic: 0.3472 on 1 and 378 DF, p-value: 0.5561
summary(reg2534)
##
## Call:
## lm(formula = sex_ratio ~ I(log10(pop_density)), data = sexratio[sexratio$group ==
## "25-34 years", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.138142 -0.034130 -0.003395 0.033375 0.141351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.275149 0.010534 121.05 <2e-16 ***
## I(log10(pop_density)) -0.089491 0.004832 -18.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05017 on 378 degrees of freedom
## Multiple R-squared: 0.4757, Adjusted R-squared: 0.4743
## F-statistic: 343 on 1 and 378 DF, p-value: < 2.2e-16
Now we are ready to visualize the data.
Click to show the code
# Graphics packages
library(ggplot2)
library(plotly)
# Prepare summary labels for regression results
xmin <- min(sexratio$pop_density, na.rm = TRUE)
ymin <- min(sexratio$sex_ratio, na.rm = TRUE)
ymax <- max(sexratio$sex_ratio, na.rm = TRUE)
text_labels <- data.frame(
x = xmin,
y = c(ymin + 0.05*(ymax-ymin), ymin),
label = c(
paste0("Slope 0-9: ", sprintf("%.4f", reg09$coefficients[2]),
", p-value: ", sprintf("%.3f", summary(reg09)$coefficients[2,4])),
paste0("Slope 25-34: ", sprintf("%.4f", reg2534$coefficients[2]),
", p-value: ", sprintf("%.3f", summary(reg2534)$coefficients[2,4]))))
# ggplot chart
ggplot(sexratio, aes(x=pop_density, y=sex_ratio, col=group, label=name)) +
scale_x_continuous(trans='log10') +
geom_point(alpha=.4) +
stat_smooth(method='lm', se=FALSE) +
scale_color_manual(name='', values=c('blue', 'red'))+
xlab("population density (persons per km²)") +
ylab("sex ratio (# of males/# of females)\nin the age group") +
geom_text(
data = text_labels, aes(x = x, y = y, label = label),
color = c("blue", "red"), inherit.aes = FALSE,
hjust = 0,
vjust = 0,
size = 4
)-> G1
ggplotly(G1) %>%
style(textposition = "top right")
We can create analogous charts for higher aggregation levels using the same code, e.g. for voivodeships (lvl <- 2):
regions (lvl <- 3):
subregions (NUTS3, lvl <- 4):
and gminas (lvl <- 6):
Links to larger versions of the charts: voivodeships, regions, subregions NUTS3, powiats, gminas.