library(igraph)
library(dplyr)

Within your network you may have subgroups. In a class of students, there are often smaller homework groups, study groups, friendship groups, etc.. What is the demographic composition of these groups? Do subgroups appear more in some settings compared to others? What happens to these groups over time? To answer these types of questions you need a specific analytic tool, community detection.

At its core, community detection is a method used to analyse the sub components(groups) of your graph. It literally detects whether there are smaller groups within a network. Using mathematically informed algorithms, various community detection approaches produce measurements of how many sub groups there are in your graph and how cohesive those subgroups are.

For this tutorial I am using the Grime collaboration network data that we have been using a lot. I am going to bring in the data from 2008 and clean up the graph a bit before we dive in. This network is directed, but for the sake of the tutorial, I bring it in as an undirected graph then swap. You will read why.

grime_edge_list <- read.csv(file.choose(), header = TRUE)

grime_08 <- graph_from_data_frame(d= grime_edge_list, directed = FALSE)
grime_08_clean <- delete.edges(grime_08, E(grime_08)[which_loop(grime_08)])

The Process

Let’s get familiar with how to perform community detection. In this section, I will be using the Louvain algorithm since it is one of the more commonly known and comprehensible. To do so, use the cluster_lourvain() command. Please note, your network may not work with Louvain because it only works with undirected networks (hence why I brought the Grime network in as undirected).

I strongly recommend putting that information into an object. Then we can take a look at what is has in it.

louv <- cluster_louvain(grime_08_clean)

louv
## IGRAPH clustering multi level, groups: 17, mod: 0.61
## + groups:
##   $`1`
##   [1] "Asher D"        "Scorcher"       "Flowdan"        "Tinchy Stryder"
##   [5] "Frisco"         "Kano"           "Wiley"          "Lauren Mason"  
##   
##   $`2`
##   [1] "Dizzee Rascal"
##   
##   $`3`
##   [1] "Lethal Bizzle"
##   
##   + ... omitted several groups/vertices

Understanding the Metrics

The two main things you will want to take from these algorithms are the modularity and the membership. Modularity is a score of how separated, or modular, the network is indicating how cohesive the groups are compared to the whole network. Put simply, modularity measures the density of each group compared to the density inside the group. The statistic is measured from -1/2 to +1 with metrics closer to 1 indicating higher modularity. It is a measurement that compares what we expect to see if the graph were completely random with what is actually observed. Use modularity().

Meanwhile, membership is a vector showing which group each node is classed in. This will come in very handy for you if you want to export this information or use it for visualisations. Use membership().

modularity(louv)
## [1] 0.6084184
membership(louv)
##         Asher D   Dizzee Rascal   Lethal Bizzle        Scorcher     Bless Beats 
##               1               2               3               1               4 
##         Flowdan  Tinchy Stryder          Frisco            Kano     Treble Clef 
##               1               1               1               1               5 
##         Shystie          Blacks         Badness         Chronik         Tempa T 
##               6               7               7               7               7 
## Newham Generals          Skepta             JME            Chip             BBK 
##               7               7               7               8               4 
## Virus Syndicate          Ghetts        Mercston        Double S        Griminal 
##               9              10              11               8               8 
##         Ice Kid Nu Brand Flexxx       Wretch 32           Wiley  Bossman Birdie 
##               8              12               8               1              13 
##     The Streets            Sway    Tinie Tempah           Giggs          Jammer 
##              14              15              16              17               7 
##       Roll Deep          Devlin        P Money     Lauren Mason     Milli Major 
##               4              10               8               1              13

Visualisation Tips for community detection

There are two main ways to visualise communities in the network. First is to change the colours of the nodes to match the community they are in. To do this, you utilise the vertex.color() argument in the plot() command

par(mar = c(0,0,0,0))
plot(grime_08_clean, vertex.color = louv$membership, vertex.label = NA)

Second, you can use polygons to demonstrate the boundaries of the communities. To do this, you plot the clustering object alongside the graph object.

par(mar = c(0,0,0,0))
plot(louv, grime_08_clean, vertex.label = NA)

Notice the red edges? Good! This visualisation demonstrates nodes that bridge across communities.

Another thing to notice is that some of the isolates share colours with others in the network. This is because Rstudio only uses a set number of colours by default. It may be misleading in your report if you include isolates in your network since it appears as if they are in the same community. This is false!!! You may want to clean your network a bit more when presenting these visuals, then.

Clean your graph

One main thing you need to think about when detecting communities in your graph is its structure/composition. How your graph is structured might strongly impact the findings you get from your community detection.

IN 2008, there were a lot og Grime artists who did not collaborate with anyone else (isolates). If we were to run a community detection algorithm on the graph with all the isolates still in the graph, they would be included in the algorithms mathematics.

For example, I am using going to re-run the analysis I did above using the Louvain algorithm with and without the isolates and you will see what a difference it makes visually. To do this, I will delete the isolates from the network and the plot it.

grime_isol <-delete_vertices(grime_08_clean, which(degree(grime_08_clean)==0))
par(mar = c(0,0,0,0))
plot(grime_isol)

Now take the louvain on this network and take a look at it.

louv_isol <- cluster_louvain(grime_isol)

louv_isol
## IGRAPH clustering multi level, groups: 6, mod: 0.61
## + groups:
##   $`1`
##   [1] "Asher D"        "Scorcher"       "Flowdan"        "Tinchy Stryder"
##   [5] "Frisco"         "Kano"           "Wiley"          "Lauren Mason"  
##   
##   $`2`
##   [1] "Bless Beats" "BBK"         "Roll Deep"  
##   
##   $`3`
##   [1] "Blacks"          "Badness"         "Chronik"         "Tempa T"        
##   [5] "Newham Generals" "Skepta"          "JME"             "Jammer"         
##   + ... omitted several groups/vertices

The number of groups has reduced from 17 to 6. This shows that the first algorithm counted the isolates as groups. This becomes even more apparent when we plot the network.

Notice, however, that modularity does not change.

par(mfrow = c(1, 2))
par(mar =c(0,0,0,0))
set.seed(123)
plot(louv_isol, grime_isol, vertex.label = NA)
set.seed(123)
plot(grime_isol, vertex.color = louv_isol$membership, vertex.label = NA)

Now we do not have the isolates in, it is a much less confusing (noisy) visualisation.

Comparing algorithms

As a researcher, you may want to try different algorithms to determine which tells the story of your network. This section shows you four different algorithms. As the scientist, the onus is on you to ensure you understand what the algorithms do and why they may produce slightly different results.

wt <- cluster_walktrap(grime_isol)
le <- cluster_leading_eigen(grime_isol)
edge <- cluster_edge_betweenness(grime_isol)

par(mfrow = c(2, 2))
par(mar =c(0,0,3,0))
set.seed(123)
plot(louv_isol, grime_isol, main = "Louvain", vertex.label = NA)
set.seed(123)
plot(edge, grime_isol, main = "Edge Betweenness", vertex.label = NA)
set.seed(123)
plot(wt, grime_isol, main = "Walktrap", vertex.label = NA)
set.seed(123)
plot(le, grime_isol, main = "Leading Eigenvector", vertex.label = NA)

In this case, we have consensus across multiple algorithms. This is rare, but should build more confidence in your analysis.

Analysing the communities.

One thing you can do with the community detection is to describe their nature.A basic example is just to take a look at their characteristics like average the nodal degree in each community to see if certain groups have higher degree than others. In the following chunk, I make a data frame in an object called node data. This data frame has the membership from the louvain algorithm and the nodes’ degree. Then, I present a variable (not saved in the dataframe) called mean_degree which presents the mean degree of each community.

node_data <- data.frame(
  deg = degree(grime_isol),
  subgroup = louv_isol$membership
)

node_data %>%
     group_by(subgroup) %>%
     summarise(mean_degree = mean(deg, na.rm = TRUE))
## # A tibble: 6 × 2
##   subgroup mean_degree
##      <dbl>       <dbl>
## 1        1        2.12
## 2        2        1.67
## 3        3        2.12
## 4        4        2.17
## 5        5        1   
## 6        6        1

What does this new table tell you about each community?

You can do many different descriptive analyses of these communities Let’s say you have some node-level characteristics like their gender. you can examine the percentage of men/women in each community to see if gender may be associated with one group more than another.

Final Thoughts

You need to remember that there are a lot of algorithms that can be used. Each algorithm identifies groups within your network based on a certain characteristic. For example, some algorithms like Louvain seek to maximise the modularity of groups finding the communities that are more densely connected to each other compared to the network as a whole. Meanwhile, Walktrap uses random walks across the graph to determine which nodes occur more frequently together on each others random walk. The co-occurrence of nodes across random walks indicates that they are likely in the same community.

So, you need to be cautious when selecting what community detection algorithm you are going to use and understand them. When reporting, you will want to report your findings as they relate to the community detection algorithm that you are using otherwise your results could be misleading. For example, if you use the Walktrap but report that the communities are more densely connected to each other than the whole network (clearly a Louvain-related explanation) this may or may not actually be true because Walktrap is not directly measuring community vs. network density.

The tendency for researchers is to try multiple algorithms and find one that either produces the nicest visualisation, or produces the highest metrics. I strongly recommend not doing this but rather thinking deeply about why there might be differences across measurements. This in itself could be a finding!