Calculating diversity faultlines with the asw.cluster package in R:
A step-by-step guide for beginners

Bertolt Meyer & Andreas Glenz

Last update: 2017-03-03

This is the supplemental material for:

Meyer, B. & Glenz, A. (2013). Team faultline measures: A computational comparison and a new approach to multiple subgroups. Organizational Research Methods,16, 393-424. doi:10.1177/1094428113484970

A PDF version of this manual is here.

New features of version 2.0


Join the mailing list!

If you require help with using the software and/or would like to connect with other researchers who are interested in subgroups and faultlines, please join or mailing list by sending an empty email with the subject

subscribe faultlines your_first_name your_last_name

(e.g., subscribe faultlines Bertolt Meyer) to sympa@psychologie.lists.uzh.ch. After you receive the confirmation mail, you can send mail to faultlines@psychologie.lists.uzh.ch.

The public archive of the mailing list is here. Before posting a question, please check if somebody else has already posted it. If you ask for help or report a bug, please make sure to include a code snippet so that others can reproduce the problem.


1. Introduction

This manual describes how to calculate various diversity faultline measures with the open source statistical environment R (R Development Core Team, 2011), using the asw.cluster package provided below. R packages are extensions (like plug-ins) that extend the functionality of R by adding new commands to the basic set of commands available in R. The asw.cluster package adds the function faultlines() to R. This function can calculate the diversity faultline measures proposed by Bezrukova, Jehn, Zanutto, and Thatcher (2009), by Gibson and Vermeulen (2003), by Meyer and Glenz (2013), by Shaw (2004), by Thatcher, Jehn, and Zanutto (2003), by Trezzini (2008), by van Knippenberg, Dawson, West, and Homan (2011), and by Lawrence and Zyphur (2011).

This manual is intended for both novice and experienced users of R. Novice users who have never used R before can read it as a step-by-step guide for calculating diversity faultline measures for a data set that already exists in a file such as an SPSS data file or a comma-separated-value text file. Experienced R users will find information on organizing their team data prior to calculations (see Section 5) and examples (see Section 6).

Experienced R users can install the package with the following two commands:

install.packages(c("flexmix", "nnet", "nFactors", "QuantPsyc", "psych"))
install.packages("asw.cluster", repos="http://www.group-faultlines.org/packages", type="source")

and can call ?faultlines() afterwards.

For license reasons, we are currently unable to make the package available from the CRAN repository.

2. Installing R

The first step is the installation of R. To install R, download the most current version for the required operating system from http://www.r-project.org. Access the website, click on the link "CRAN" (Comprehensive R Archive Network) in the navigation on the left and choose the download server closest to your physical location. Next, select your operating system, download the installation file, and execute it. After the installation is complete, launch R by double-clicking its program icon.1

3. First steps in R

When you launch R, you see one window entitled "Console". This is where R outputs the results from the user's commands (like the output viewer in SPSS). Commands can also be entered in a one-by-one fashion into the console by typing them in at the > prompt.

3.1 Opening a new empty syntax file

Typing commands into the console is cumbersome. It is better practice to open a syntax file in a separate window and to type your syntax into it. This syntax can be executed line by line, or you can execute several highlighted lines of code at once. To open a new empty syntax file on a Mac, select "New File" from the "Files" menu. On a Windows installation of R, select "New Script" from "File".

In a syntax file, highlighted code is executed by simultaneously hitting the keys Ctrl and the letter R on Windows machines. On Mac installations, highlighted code is executed by pressing the cmd key and the enter key simultaneously. If no code his highlighted, doing this will result in the execution of the code in the line where the cursor is positioned.

3.2 Setting the working directory

One important concept when working with R is the so-called working directory. In a given R session, R treats one folder on the hard drive as a standard directory where it saves files and where it looks for the files that the user wishes to open. Thus, as a first step in your R session, you should specify a directory as a working directory. This directory should contain the data set containing the teams for which faultlines are to be calculated.

On Windows, the working directory is set by selecting "Change directory" from the "Files" menu. On a Mac, the working directory is set by selecting "Change working directory" from the "Extras" menu. Navigate to the directory that contains the data and the package files, and click "choose".

To see the path of the current working directory, you can execute the command getwd() from your syntax file. It will print the location of the current working directory to the console.

You are now ready to install the package that enables R to calculate faultlines.


  1. R is a powerful and flexible statistical environment for analyses and visualizations. If you do not already use R, the authors strongly recommend to familiarize yourself with its the core principles. An excellent introduction to the R environment is available as a PDF from this address: http://cran.r-project.org/doc/manuals/R-intro.pdf

4. Installing the asw.cluster package

The asw.cluster package requires some other packages to work properly. These so-called dependencies must be installed before the package can be activated. To install the dependent packages, make sure that your internet connection is working and copy the following command into your syntax file and execute it (click here to see how to execute syntax):

install.packages(c("flexmix", "nnet", "nFactors", "QuantPsyc", "psych"))

Subsequently, execute the following command to download the asw.cluster package:

install.packages("asw.cluster", repos="http://www.group-faultlines.org/packages", type="source")

As a next step, the package must be activated. To do so, execute the following command:

library(asw.cluster)

The console should echo this command, followed by the input prompt > indicating that no errors occurred during the activation of the package. You are now ready to import your data set into R and to calculate diversity faultlines.

Note that the package needs to be installed only once on your computer, but that you must activate it again if you quit and re-launch R.

In other words, install.packages() must be executed only once, but library(asw.cluster) must be called after every relaunch of R.

Note that this manual is also available as a vignette in the asw.cluster-package. To access the document as PDF, use the R command

vignette("asw.cluster.guide_1.1.pdf")

5. Importing team data into R

To import your data into R, the data file needs to be read into R's working memory, which is called the workspace. More specifically, you need to create a data frame object that contains your data. The term data frame is R's denomination for data set. The commands required for creating a data frame object with your data in it depend on the type of file that holds your data. The following two subsections explain how to read SPSS data and comma-separated-value plain text data.

5.1 Importing SPSS data sets

To read an SPSS data file with the ending .sav with the following syntax, it must reside in the current working directory (click here to see how to set the working directory). The command required to read SPSS files, read.spss(), is included in the package foreign, which is included in the standard installation of R but must be activated prior to use. Activate it by executing the following command:

library(foreign)

Before importing your SPSS data file, here are some guidelines on how the file should be organized:

If these considerations are met, you can import your SPSS data file into R using the following syntax, which assumes your data file is called teamdata.sav and resides in your current working directory (make sure to highlight all of the lines and to call library(foreign) before):

teamdata <- read.spss(file = "teamdata.sav", 
                      to.data.frame = TRUE, 
                      reencode = TRUE)

This command might return a warning with regard to an unknown data type, but those can be ignored as long as R does not echo an error. These warnings simply indicate that R encountered a special character that it had trouble converting, but the data set has been read completely.

By executing this command, you have learned the basic principle of R syntax: A command or function has a specific name followed by round brackets, read.spss() in this case. Inside the brackets, you specify the parameters that the function needs. Each parameter has a name and a value, e.g., file = "teamdata.sav". The assignment operator, the left arrow <-, is used to store the result of the command, a complete data frame in this case, into an object of an arbitrary name, teamdata in this case. This object resides in the workspace and can be used for further operations until it is deleted or R is quit. To learn which parameters a function needs, open its help page by typing ? followed by the function name, e.g.,

?read.spss()

5.2 Importing data from comma separated values files

A very common file format for data sets is the comma separated values (.csv) format. Files in the csv format are plain text files that contain a row for each observation in the data set, and the values of different variables are separated by commas. All statistical software packages and spreadsheet applications such as Microsoft Excel can read and write csv files.

To read a csv data file with the ending .csv with the following syntax, it must reside in the current working directory (click here to see how to set the working directory). Before importing your csv data file, here are some guidelines on how the file should be organized:

If these considerations are met, you can import your csv data file into R using the following syntax, which assumes your data file is called teamdata.csv and resides in your current working directory:

teamdata <- read.csv(file = "teamdata.csv")

In countries employing the comma as the decimal separator (e.g., Germany), csv files use the semicolon for separating the values in rows. In these countries, use the read.csv2 function:

teamdata <- read.csv2(file = "teamdata.csv")

By executing this command, you have learned the basic principle of R syntax: A command or function has a specific name followed by round brackets, read.csv() in this case. Inside the brackets, you specify the parameters that the function needs. Each parameter has a name and a value, e.g., file = "teamdata.csv". The assignment operator, the left arrow <-, is used to store the result of the command, a complete data frame in this case, into an object of an arbitrary name, teamdata in this case. This object resides in the workspace and can be used for further operations until it is deleted or R is quit. To learn which parameters a function needs, open its help page by typing ? followed by the function name, e.g.,

?read.csv()

5.3 Preparing the data set for calculating faultline measures

After the successful import of your data, the R workspace now contains a data frame object teamdata that contains the entire data set, i.e., all variables and all cases. You can display the names of the variables included in the data set by calling

names(teamdata)

The first few cases of the data set are displayed by calling

head(teamdata)

The entire data frame is printed into the console by calling its name:

teamdata

An SPSS-like table view of the data frame is also available:

fix(teamdata)

This table view of the data must be closed before R accepts further commands.

If your data set contains more variables than those required for the calculation of faultlines, you need to create a subset of the data set that only holds the diversity attributes that you want to use for the faultline, plus a team number variable. To create a new data frame in the R workspace called teamdata_sub that only contains the variables teamid, age, gender, and ethnicity, execute the following command:

teamdata_sub <- teamdata[,c("teamid", "age", "gender", "ethnicity")]

Faultlines cannot be calculated for missing data. Thus, prior to calculating faultlines, you need to remove all team members with missing values in the diversity attributes from the dataset. This can be achieved with the following command that assumes that your data frame is called teamdata_sub:

teamdata_sub <- teamdata_sub[complete.cases(teamdata_sub),]

Now that the data frame only contains the diversity attributes that are intended for the calculation of the faultline, a variable denoting team membership (teamid in this example), and no missing values, you are ready to calculate diversity faultlines.

6. Calculating diversity faultlines with the
faultlines() function

This section assumes that you have successfully installed and activated the asw.cluster package and that you have a data frame teamdata_sub in your R workspace that has been prepared according to the guidelines provided above .

As a first step, read the help page for the faultlines() function by calling ?faultlines(). The Details' section of the help page explains the principles of the function:

The function is run over a data set (a data frame passed to the function with the parameter data, (e.g., data = teamdata_sub) containing the members of one or more teams as rows and their diversity attributes that are used for calculating a given faultline measure as columns. Note that all columns will be used for calculating the faultline. Thus, this will most likely be a subset of the ɿull' data frame that was gathered in a given research. If the data set contains more than one team, it must contain a column that specifies a team number for each team member, thus indicating team membership. The name of this column must be passed to the argument group.par.

For each diversity attribute contained in the data frame, the user must specify its scale (either numeric or nominal) in the same order of the variables in the data frame as a character vector and pass it to the function with the parameter attr.type. For example, if the data set contains the variables age (numeric in years), ethnicity (character factor), and gender (character factor), this parameter must be specified as
attr.type = c("numeric", "nominal", "nominal").

Note that the faultline measures proposed by Shaw (2004) and Trezzini (2008) require that all attributes are nominal. Thus, prior to calculating diversity faultline strengths with these two methods, you must recode numeric attributes such as age to factors with levels such as 'young', 'middle-aged', and 'old' and specify the attribute type of this variable as nominal.

If you wish to calculate one of the diversity faultline measures "thatcher", "bezrukova", or "asw" that are capable of dealing with numeric attributes such as age or tenure, you must specify a weight for each diversity attribute with the attr.weight parameter. These weights indicate how strong a difference of 1 (in case of numeric attributes) or a different category (in case of nominal attributes) is fractured into the faultline. In the example case of age (in years), gender, and ethnicity, specifying this parameter as attr.weight = c(0.1, 1, 1) means that an age difference of ten years is equally weighted as a difference in gender, which is equally weighted as a difference in ethnicity. Note that these are the default values for Thatcher's et al. (2003) Fau that are probably used in most papers, but these appear to be arbitrary. More research is required with regard to the choice of these weights in a given context. Note that rescale combines with the attr.weight-parameter, which means that numeric attributes are first rescaled according to the rescale-parameter. In a second step, all attributes (dummy- coded values for nominal attributes) are multiplied by their appropriate weight given by the attr.weight-parameter.

The metric parameter lets you specify whether Euclidean or Mahalanobis distances should be employed in determining how different team members are from each other. This metric is only employed by the methods "thatcher", "bezrukova", and "asw". Note that the former two methods (Bezrukova et al., 2009; Thatcher et al., 2003) were introduced based on Euclidean distances, which assume that diversity attributes are uncorrelated. Meyer and Glenz (2013) showed that correlations between diversity attributes (e.g., between age and tenure) can have a significant influence on diversity faultline measures. They thus suggested to employ Mahalanobis distances to control for such correlations. They explicitly included this option in the calculation of the ASW measure, but invoking it for Thatcher's et al. Fau or for Bezrukova's Faultline Distance measure is purely experimental. Employing Mahalanobis distances for the latter two measures will thus deliver a measure that has not been described in the literature. Furthermore, calculating Mahalanobis distances requires an inversion of the variance-covariance-martix of attributes. Using Mahalanobis metrics is therefore restricted to data sets with invertible variance-/covariance matrices, i.e., to numeric attributes only.

6.1 ASW cluster faultlines for multiple subgroups

As Meyer and Glenz (2013) illustrated, the ASW measure is the only diversity faultline measure that is suitable for the case where more than two homogeneous subgroups are possible. In the following, we illustrate how to calculate ASW for an example data set teamdata_sub. It consists of two teams with six members each. For each team member, age, gender, and ethnicity have been collected. The data set can be created by executing the following syntax:

teamdata_sub <- data.frame(teamid = c(rep(1,6),rep(2,6)), 
  age = c(44,18,40,33,33,50,22,23,39,42,57,51), 
  gender = c("f","m","f","f","m","f","f","f","m","m","m","m"), 
  ethnicity = c("A","B","A","D","C","B","A","A","B","B","C","C"))

Executing the name of the data frame prints its content to the console:

teamdata_sub

   teamid age gender ethnicity
1       1  44      f         A
2       1  18      m         B
3       1  40      f         A
4       1  33      f         D
5       1  33      m         C
6       1  50      f         B
7       2  22      f         A
8       2  23      f         A
9       2  39      m         B
10      2  42      m         B
11      2  57      m         C
12      2  51      m         C

The first team appears to be rather heterogeneous and cross-cut, but the second team appears to consist of three rather homogeneous subgroups. To calculate the ASW faultline measure for both teams, we need to specify the scales of the diversity attributes age, gender, and ethnicity as being numeric, nominal, and nominal. Instead of specifying the scale types in the call to the faultlines() function, they can also be stored in a variable that can be passed to the function:

my_attr <- c("numeric", "nominal", "nominal")

The ASW faultline algorithm also needs to know how to weight the attributes, i.e., how much age difference is seen as equivalent to a difference in gender or ethnicity. Following the example in the introduction of this section, these can be stored in a variable as well:

my_weights <- c(0.1, 1, 1)

After these considerations have been made, the faultlines can be calculated with the results being stored in a data frame that we call my_ASW. Note how in the call to the faultline() function, the name of the data frame containing the demographic information and the name of the variable in that data frame specifying team membership are also passed as parameters:

my_ASW <- faultlines(data = teamdata_sub, 
                     group.par = "teamid", 
                     attr.type = my_attr, 
                     attr.weight = my_weights,
                     method = "asw")

Calling the my_ASW object reveals its content:

my_ASW

#   team  fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.3317999 1, 2, 1, 2, 2, 1                   2            3 3
# 2    2 0.8054895 1, 1, 2, 2, 3, 3                   3          2 3 3  

In the resulting data frame, each line represents a team. The first column denotes the team number and the second, fl.value, denotes the faultline measure for the given group (the ASW value). The column mbr_to_subgroups shows to which subgroup each member belongs. Members are listed left-to right with reference to the top-to-bottom order of the data frame containing the raw data. The column number_of_subgroups indicates how many subgroups the algorithm detected id the given team, and the last column lists the sizes of the subgroups.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

A longer and more detailed report of the results can be obtained by calling the function summary() with the result object as parameter:


summary(my_ASW)
# Number of Teams: 2
#
# Calculation features:
# Method: ASW
# Level:  team
# Metric: euclid
#
#
# Team 1 (1):
# ===========
# Faultline Strength:
# [1] 0.3317999
#
# Member to Subgroup Association:
# [1] 1 2 1 2 2 1
#
# Number of Subgroups:
# [1] 2
#
#
# Team 2 (2):
# ===========
# Faultline Strength:
# [1] 0.8054895
#
# Member to Subgroup Association:
# [1] 1 1 2 2 3 3
#
# Number of Subgroups:
# [1] 3

6.1.1 Scaling attributes

To circumvent the issue of assigning arbitrary weights (e.g., a difference in ten years of age equals a difference in gender) to diversity attributes, Bezrukova et al. (2009) recom- mended to scale numeric attributes by their standard deviation, and to dummy code nominal attributes with 0 and 1/√2. The latter results in an Euclidean distance of one between nominal attributes. This scaling is used by default in all papers employing the Fau * Dist faultline measure, and we also recommend to employ this scaling when calculating ASW faultlines. The application of this scaling is illustrated in the following example:

my_ASW_sc <- faultlines(data = teamdata_sub,
                               group.par = "teamid",
                               attr.type = my_attr,
                               rescale = "sd",
                               method = "asw")

Note how the use of scaling makes the specification of weights unnecessary. The scaling of attributes leads to different results, as can be seen by calling the resulting object:


my_ASW_sc
#   team          fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.337558681138197      1 2 1 1 2 1                   2            4 2
# 2    2 0.821859114925443      1 1 2 2 3 3                   3          2 2 2

6.1.2 Controlling for correlated attributes

One special feature of ASW faultlines is that you can control for the correlation between numeric attributes when calculating the faultline. For example, if you want to calculate faultlines for the attributes age and tenure, which tend to be correlated as in the following example:

mynumericdata <- data.frame(teamid = c(rep(1,6),rep(2,6)), 
                     age = c(44,18,40,33,33,50,22,23,39,42,57,51), 
                     tenure = c(12,2.5,11,3,5,5,2,1,12,13,20,22))

mynumericdata
#    teamid age tenure
# 1       1  44   12.0
# 2       1  18    2.5
# 3       1  40   11.0
# 4       1  33    3.0
# 5       1  33    5.0
# 6       1  50    5.0
# 7       2  22    2.0
# 8       2  23    1.0
# 9       2  39   12.0
# 10      2  42   13.0
# 11      2  57   20.0
# 12      2  51   22.0

with(mynumericdata, cor.test(age, tenure))

#   Pearson's product-moment correlation
# 
# data:  age and tenure 
# t = 4.5062, df = 10, p-value = 0.001132
# alternative hypothesis: true correlation is not equal to 0 
# 95 percent confidence interval:
#  0.4614055 0.9473970 
# sample estimates:
#       cor 
# 0.8185531 

If one calculates ASW faultlines with the standard settings, ASW will assume that the two attributes are uncorrelated and will return a stronger faultline value for the second team than for the first team:

my_num_attr <- c("numeric", "numeric") 
my_num_weights <- c(1,1)

my_ASW <- faultlines(data = mynumericdata, 
                            group.par = "teamid", 
                            attr.type = my_num_attr, 
                            attr.weight = my_num_weights, 
                            method = "asw")
my_ASW

#   team  fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.4680569 1, 2, 1, 2, 2, 1                   2            3 3
# 2    2 0.7792994 1, 1, 2, 2, 3, 3                   3          2 2 2

However, the tenure of the members of the second team is pretty much in line with what one would expect given their age; only the team member whose age is 57 has a lower tenure than her college with 51. So there is actually a difference between the last two group members because for one of them, the tenure is not what one would expect given its strong correlation with age. Therefore, controlling for the correlation by employing the Mahalanobis metric in determining how similar people are, leads to a different solution:

my_ASW_m <- faultlines(data = mynumericdata, 
                       group.par = "teamid", 
                       attr.type = my_num_attr, 
                       attr.weight = my_num_weights, 
                       method = "asw", 
                       metric = "mahal")
my_ASW_m

#   team  fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.4653479 1, 2, 1, 3, 3, 4                   4        2 1 2 1
# 2    2 0.3383156 1, 1, 2, 2, 3, 4                   4        2 2 1 1

This solution is drastically different from the previous one: The number of subgroups changes and the second team now has a weaker faultline than the first team, whereas in the previous case, it was the other way around. This example illustrates the considerations that have to be made if diversity attributes are correlated.

6.1.3 Ego-Faultlines

Version 2.0 of the asw.cluster package introduces a new option to determinate faultlines on the level of individual group members. As the classic conceptualization of faultlines takes a birds-eye perspective on latent subgroups in a team, by maximizing the silhouette width measures of its members on average, the member to subgroup classification may not correspons to every single memeber’s individual perspective on the group. Determining Ego-Faultlines with the faultlines()-function means to take the individual in- and outgroup perspective of every single member by maximizing his/her partiular silhouette width. This is done for every group member, which means that the result consist of as many faultlines (i.e. subgroup partitions and faultline strength values) as there are members in the group. These faultlines can be congruent, but typically different individual perspectives lead to different faultlines. The faultline()-function calculates the individual-level faultline strength and the individual member to subgroup association, as well as an aggregated measure of individual faultline strengths on the team-level. All results are recalled by the summary()-function.:

my_ego_ASW <- faultlines(data = mynumericdata,
                          group.par = "teamid",
                          i.level = TRUE,
                          attr.type = my_num_attr,
                          method = "asw")
    summary(my_ego_ASW)
    
    Number of Teams: 2
    
    Calculation features:
       Method:   ASW
       Level:    individual
       Metric:   euclid

    
    
    Team 1 (1):
    ===========
    Faultline Strength:
    [1] 0.5530582
    
    
    Individual Faultline Strengths (silhouette widths):
    [1] 0.7051077 0.3681231 0.6794982 0.3933564 0.4917633 0.6805004
    
    
    Member to Subgroup Association:
         X1      X2      X3      X4      X5      X6
    1121222 2212212 3121111 4121111 5111112 6122221
 
 
    Number of Subgroups:
    [1] 2 2 2 2 2 2
  
  
  Subgroup Network:
             X1        X2        X3        X4        X5        X6
    1 1.0000000 0.1666667 0.8333333 0.6666667 0.5000000 0.6666667
    2 0.1666667 1.0000000 0.3333333 0.5000000 0.6666667 0.1666667
    3 0.8333333 0.3333333 1.0000000 0.8333333 0.6666667 0.5000000
    4 0.6666667 0.5000000 0.8333333 1.0000000 0.8333333 0.6666667
    5 0.5000000 0.6666667 0.6666667 0.8333333 1.0000000 0.5000000
    6 0.6666667 0.1666667 0.5000000 0.6666667 0.5000000 1.0000000

  
  
    Distances:
             X1       X2        X3        X4        X5        X6
    1  0.000000 26.03843  4.000000 11.045361 11.090537  6.082763
    2 26.038433  0.00000 22.045408 15.066519 15.033296 32.015621
    3  4.000000 22.04541  0.000000  7.071068  7.141428 10.049876
    4 11.045361 15.06652  7.071068  0.000000  1.414214 17.029386
    5 11.090537 15.03330  7.141428  1.414214  0.000000 17.058722
    6  6.082763 32.01562 10.049876 17.029386 17.058722  0.000000
  
  
  
    Team 2 (2):
    ===========
    Faultline Strength:
    [1] 0.8329077
   
   
   Individual Faultline Strengths (silhouette widths):
    [1] 0.9604632 0.9588393 0.8100979 0.8101001 0.7649801 0.6929656
  
  
    Member to Subgroup Association:
         X1      X2      X3      X4      X5      X6
    1112222 2112222 3221122 4221122 5222211 6222211
  
  
   Number of Subgroups:
    [1] 2 2 2 2 2 2
   
   
   Subgroup Network:
             X1        X2        X3        X4        X5        X6
    1 1.0000000 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333
    2 1.0000000 1.0000000 0.3333333 0.3333333 0.3333333 0.3333333
    3 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 0.3333333
    4 0.3333333 0.3333333 1.0000000 1.0000000 0.3333333 0.3333333
    5 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000 1.0000000
    6 0.3333333 0.3333333 0.3333333 0.3333333 1.0000000 1.0000000
   
   
   Distances:
            X1       X2       X3        X4       X5        X6
    1  0.00000  1.00000 17.05872 20.049938 35.02856 29.034462
    2  1.00000  0.00000 16.06238 19.052559 34.02940 28.035692
    3 17.05872 16.06238  0.00000  3.000000 18.02776 12.041595
    4 20.04994 19.05256  3.00000  0.000000 15.03330  9.055385
    5 35.02856 34.02940 18.02776 15.033296  0.00000  6.000000
    6 29.03446 28.03569 12.04159  9.055385  6.00000  0.000000

6.2 Fau (Thatcher et al. 2003)

Fau by Thatcher et al. (2003) assumes the existence of two homogeneous subgroups. In the following, we show how to calculate it for the example data set introduced in Section [asw].

As with ASW faultlines, you need to specify the scales of the diversity attributes age, gender, and ethnicity as being numeric, nominal, and nominal. Instead of specifying the scale types in the call to the faultlines() function, they can also be stored in a variable that can be passed to the function:

my_attr <- c("numeric", "nominal", "nominal")

The Fau faultline algorithm also needs to know how to weigh the attributes, i.e., how much age difference is seen as equivalent to a difference in gender or ethnicity. Following the example in the introduction of this section, these can be stored in a variable as well:

my_weights <- c(0.1, 1, 1)

After these considerations have been made, the faultlines can be calculated with the results being stored in a data frame that we call my_Fau. Note how in the call to the faultline() function, the name of the data frame and the name of the variable in the data frame specifying team membership are also passed as parameters:

my_Fau <- faultlines(data = teamdata_sub, 
                     group.par = "teamid", 
                     attr.type = my_attr, 
                     attr.weight = my_weights,
                     method = "thatcher")

Calling the my_Fau object reveals its content:

my_Fau
#   team  fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.5513439 1, 2, 1, 2, 2, 1                   2            3 3
# 2    2 0.7747787 1, 1, 2, 2, 2, 2                   2            2 4

In the resulting data frame, each line represents a team as in the ASW example above. The first column denotes the team number and the second, fl.value, its faultline measure (the Fau value). The column mbr_to_subgroups shows to which subgroup each member belongs. Members are listed left-to right with reference to the top-to-bottom order of the data frame containing the raw data. The column number_of_subgroups indicates how many subgroups the algorithm detected id the given team, and the last column lists the sizes of the subgroups. Note that when calculating Fau, the number of subgroups is always fixed to 2. See also how the Fau values differ from the ASW values in Section [asw] and how the ASW method and Fau diverge with regard to the number of subgroups for team 2.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

6.3 Faultline Strength * Faultline Distance (Bezrukova et al., 2009)

Bezrukova et al., (2009) suggested to multiply Thatcher's Fau for a given group with the Euclidean distance between the two subgroup centroids2. To calculate this product score of faultline strength and euclidean distance, the faultlines() function has to be invoked with the method = "bezrukova" option. In the following, we show how to calculate this measure for the example data set introduced in Section [asw].

As with ASW faultlines, you need to specify the scales of the diversity attributes age, gender, and ethnicity as being numeric, nominal, and nominal. Instead of specifying the scale types in the call to the faultlines() function, they can also be stored in a variable that can be passed to the function:

my_attr <- c("numeric", "nominal", "nominal")

The Fau faultline strength that is multiplied with the Euclidean distance between the two subgroups when invoking the method = "bezrukova" option also needs to know how to weight the attributes, i.e., how much age difference is seen as equivalent to a difference in gender or ethnicity. Following the arguments presented by Zanutto, Bezrukova, and Jehn (2010), numeric attributes should be scaled by their standard deviations and nominal attributes should be scaled by 1/√2. This is achieved by invoking the rescale = "sd" parameter.

After these considerations have been made, the faultlines can be calculated with the results being stored in a data frame that we call my_bezrukova. Note how in the call to the faultline() function, the name of the data frame and the name of the variable in the data frame specifying team membership are also passed as parameters:

my_bezrukova <- faultlines(data = teamdata_sub, 
                group.par = "teamid", 
                attr.type = my_attr, 
                attr.weight = my_weights, 
                method = "bezrukova")

Calling the my_bezrukova object reveals its content:

my_bezrukova
    
#   team fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 9.201921 1, 2, 1, 2, 2, 1                   2            4 2
# 2    2 19.20314 1, 1, 2, 2, 2, 2                   2            2 4

In the resulting data frame, each line represents a team as in the ASW example above. The first column denotes the team number and the second, fl.value, its faultline strength. The column mbr_to_subgroups shows to which subgroup each member belongs. Members are listed left-to right with reference to the top-to-bottom order of the data frame containing the raw data. The column number_of_subgroups indicates how many subgroups the algorithm detected in the given team, and the last column lists the subgroup sizes. Note that when calculating the Faultline Strength * Faultline Distance measure, the number of subgroups is always fixed to 2. Also note how the multiplication of the Fau value with the Euclidean distance results in a value that is no longer in the range between 0 and 1.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.


  1. Meyer and Glenz (2013) show that faultline strength and faultline distance are related. Therefore, the multiplication of the faultline strength with the faultline distance does not add new information to the measure.

6.4 Multiple correlations (van Knippenberg et al., 2011)

The measure by van Knippenberg et al. (2011) operationalizes diversity faultlines through the multiple correlation between diversity attributes. It does not deliver the number of subgroups, nor a member-to-subgroup association. As with ASW faultlines, you need to specify the scales of the diversity attributes (e.g., age, gender, and ethnicity) in terms of whether they are numeric or nominal.

my_attr <- c("numeric", "nominal", "nominal")

In contrast to the previous measures, the measure by van Knippenberg et al. (2011) does not support the weighting of attributes and therefore, no weighting variable is required. The measure can be calculated by calling the faultlines() function with the parameter method = "knippenberg" with the following syntax, which stores the resulting table in an object that we name my_knippenberg:

my_knippenberg <- faultlines(data = teamdata_sub, 
                             group.par = "teamid", 
                             attr.type = my_attr, 
                             method = "knippenberg")

Calling the my_knippenberg object reveals its content:

my_knippenberg

#   team   fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.03742422               NA                  NA             NA
# 2    2  0.7664761               NA                  NA             NA

Note that the values for the number of subgroups and for the member-to-subgroup association are missing (NA - short for 'not available' - is R's notation for missing values), because the method is not capable of delivering this information.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

6.5 Subgroup Strength (Gibson & Vermeulen, 2003)

The measure by Gibson & Vermeulen (2003) quantifies the extent to which attributes overlap between the dyads that can be formed between all members of a team. Although address a latent subgroup-separation, their method does not reveal the boundaries of those subgroups, i.e. the member-to-subgroup association, nor does it provide an estimation of the number of subgroups. A notable feature is the method's ability to weigh the differences between team members with regard to the scale of the attributes.

For calculating Subgroup Strength, you also need to specify the scales of the diversity attributes (e.g., age, gender, and ethnicity) in terms of whether they are numeric or nominal. Instead of specifying the scale types in the call to the faultlines() function, they can also be stored in a variable that can be passed to the function:

my_attr <- c("numeric", "nominal", "nominal")

As the measure by does not support the weighting of attributes, no weighting variable is required. The measure can be calculated by calling the faultlines() function with the parameter method = "gibson" with the following syntax, which stores the resulting table in an object that we name my_gibson:

my_gibson <- faultlines(data = teamdata_sub, 
                        group.par = "teamid", 
                        attr.type = my_attr, 
                        method = "gibson")

Calling the my_gibson object reveals its content:

my_gibson
#   team   fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.7050475               NA                  NA              NA
# 2    2  1.003377               NA                  NA              NA

Note that the values for the number of subgroups and for the member-to-subgroup association are missing (NA - short for 'not available' - is R's notation for missing values), because the method is not capable of delivering this information. Note also how the values returned by this function are not restricted to the range of 0 to 1.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

6.6 FLS (Shaw, 2004)

Shaw (2004) measures the extent to which categorial attributes are aligned within subgroups, and deviate between subgroups. Thus, the measure is only suitable for categorical data and is thus not suitable for the example data set employed so far in this manual, because it contains a numeric variable for age. Thus, if you want to calculate Shaw's FLS, such data needs to be recoded to nominal scale, e.g. by employing categories for certain age ranges. The following code produces another data set that is based on the previous example but categorized the age variable:

mycategorialdata <- data.frame(teamid = c(rep(1,6),rep(2,6)), 
age = c("40 to 50","18 to 25","40 to 49","30 to 39","30 to 39",
        "50 to 59","18 to 25", "18 to 25","30 to 39",
        "40 to 49", "50 to 59","50 to 59"), 
gender = c("f","m","f","f","m","f","f","f","m","m","m","m"), 
ethnicity = c("A","B","A","D","C","B","A","A","B","B","C","C"))

Executing its name prints the data frame to the console:

mycategorialdata

#    teamid      age gender ethnicity
# 1       1 40 to 50      f         A
# 2       1 18 to 25      m         B
# 3       1 40 to 49      f         A
# 4       1 30 to 39      f         D
# 5       1 30 to 39      m         C
# 6       1 50 to 59      f         B
# 7       2 18 to 25      f         A
# 8       2 18 to 25      f         A
# 9       2 30 to 39      m         B
# 10      2 40 to 49      m         B
# 11      2 50 to 59      m         C
# 12      2 50 to 59      m         C

For calculating FLS, you also need to specify the scales of the diversity attributes, but they all have to be set to nominal:

my_cat_attr <- c("nominal", "nominal", "nominal")

Subsequently, FLS can be calculated by executing the following syntax that stores the result in an object called my_FLS:

my_FLS <- faultlines(data = mycategorialdata, 
                     group.par = "teamid", 
                     attr.type = my_cat_attr, 
                     method = "shaw")

Calling the my_FLS object reveals its content:

my_FLS

#   team   fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1  0.0546875               NA                  NA             NA
# 2    2  0.6032051               NA                  NA             NA

Note that the values for the number of subgroups and for the member-to-subgroup association are missing (NA - short for 'not available' - is R's notation for missing values), because the method is not capable of delivering this information.

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

6.7 PMDcat (Trezzini, 2008)

Trezzini (2008) operationalized faultline strength as the degree of polarized multi-dimensional subgroup diversity for categorial attributes. Thus, the measure is only suitable for categorical data. Therefore, in its illustration, we will use the mycategorialdata data frame created in Section 6.6 FLS . For calculating PMDcat, you also need to specify the scales of the diversity attributes, but they all have to be set to nominal. As the data set contains the nominal information for age, gender, and ethnicity, we specify an according variable:

my_cat_attr <- c("nominal", "nominal", "nominal")

Subsequently, PMDcat can be calculated by executing the following syntax that stores the result in an object called my_PMD:

my_PMD <- faultlines(data = mycategorialdata, 
                     group.par = "teamid", 
                     attr.type = my_cat_attr, 
                     method = "trezzini")

Calling the my_PMD object reveals its content:

my_PMD
#   team   fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1  0.2160494               NA                  NA             NA
# 2    2  0.3395062               NA                  NA             NA

The result can be converted into a format where each row represents a team member. Regardless of its format, it can be exported for use in other applications such as SPSS.

6.8 Faultlines based on Latent Class Cluster Analysis (Lawrence & Zyphur, 2011)

Lawrence and Zyphur (2011) proposed latent class cluster analysis (LCCA), also referred to as latent class analysis (LCA), for identifying faultlines in a stepwise way. First, several latent class solutions with different clusters are obtained over the data of a given team, where the clusters represent the subgroups. Out of these possible latent cluster solutions, the best-fitting one is identified by the lowest Bayesian information criterion (BIC) value. Each team member is then assigned to a subgroup based on the posterior probabilities for a given individual to belong to a certain class. As high posterior probabilities are likely in the case of homogeneous clusters, the homogeneity of posterior probabilities of all group members, which is determined with the entropy measure, can be employed as a measure of faultline strength.

As Meyer and Glenz (2013) show, this measure has certain practical limitations when applied to small group data. Its largest limitation lies in its insensitivity to different levels of homogeneity. It is biased towards strong faultlines and often fails to converge for very homogeneous small subgroups. Its usefulness for small group data is therefore questionable.

LCCA-based faultlines can be calculated by invoking the method = "lcca" parameter. They do not require a scaling of attributes:


my_attr <- c("numeric", "nominal", "nominal")

my_lcca <- faultlines(data = teamdata_sub,
                             group.par = "teamid",
                             attr.type = my_attr,
                             method = "lcca")

Calling the resulting object reveals its content:


my_lcca

#   team fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes 
# 1    1       NA               NA                  NA              1 
# 2    2       NA               NA                  NA              2

Note how in this case, LCCA did not converge for both teams and was therefore unable to return a value for faultline strength. In small and relatively homogeneous subgroups, this happens quite often (Meyer & Glenz, 2013).

7. Reformatting and exporting the faultline scores for use in other applications

7.1 Converting the result to the 'long' format with one row for each team member

Many analyses such as multilevel models require one row of data for each participant like in the raw data containing the team members' diversity attributes. The output of the faultline calculation is however a data frame with one row per team. To convert this data frame into a 'long' data frame with one row per team member, one can use the summary() function. It contains an object $long in that format, which can be displayed by adding the extension $long to the call to the summary() function:

my_ASW
#   team          fl.value mbr_to_subgroups number_of_subgroups subgroup_sizes
# 1    1 0.331799912553719      1 2 1 2 2 1                   2            3 3
# 2    2 0.805489497404053    1 1 2 2 3 3 3                   2            2 2
my_ASW_long <- summary(my_ASW)$long
my_ASW_long
#    team teamsize         fl.value fl.mbr mbr_to_subgroups number_of_subgroups subgroup_size
# 1     1      6  0.331799912553719    avg                1                   2             3
# 2     1      6  0.331799912553719    avg                2                   2             3
# 3     1      6  0.331799912553719    avg                1                   2             3
# 4     1      6  0.331799912553719    avg                2                   2             3
# 5     1      6  0.331799912553719    avg                2                   2             3
# 6     1      6  0.331799912553719    avg                1                   2             3
# 7     2      6  0.805489497404053    avg                1                   3             2
# 8     2      6  0.805489497404053    avg                1                   3             2
# 9     2      6  0.805489497404053    avg                2                   3             2
# 10    2      6  0.805489497404053    avg                2                   3             2
# 11    2      6  0.805489497404053    avg                3                   3             2
# 12    2      6  0.805489497404053    avg                3                   3             2

The result of summary(my_ASW)$long is a data frame object with one row for each team member in the same order as in the original data set. Thus, the original data set, teamdata_sub in this case, could be merged with this file for further analysis, e.g., full_data <- cbind(teamdata_sub, my_ASW_long). Note that one can only merge the long result data frame with the raw data frame that contained no missings. If one wants to merge the long result data frame with the full data set that contained missings, one needs to remove the team members with missing values for their diversity attributes employed for faultline calculation as well.

7.2 Exporting to SPSS

The result object created in a given faultline calculation (e.g., my_ASW or my_ASW_long) can be exported to a file for use in other applications. To export to an SPSS data set, use the function write.foreign() from the foreign package. It writes two files to the working directory: A generic text file with the data and a syntax for a statistics program such as SPSS that enables it to read the data file.

To export the 'short' result with one row per team member, it needs to be converted into a data frame object prior to exporting it, which can be achieved by calling the following command:

my_ASW <- as.data.frame(print(my_ASW))

Afterwards, the following code can be used to export to SPSS or to .csv by substituting my_ASW_long with my_ASW.

To export the long result data frame my_ASW_long to SPSS, one can use the following syntax, which requires the library foreign to be activated (i.e., make sure to call library(foreign) prior to the following command):

write.foreign(df = my_ASW_long, 
              datafile = "my_ASW.dat", 
              codefile = "my_ASW.sps", 
              package = "SPSS")

This syntax writes the content of the result data frame my_ASW_long to a text file my_ASW.dat into the current working directory, along with an SPSS syntax file my_ASW.sps containing the SPSS commands necessary for reading my_ASW.dat into SPSS.

Important: You need to edit the first line of my_ASW.sps in the SPSS syntax viewer before the import will work. The first line needs to be edited in such a way that it contains the full path to my_ASW.dat. When you open my_ASW.sps in SPSS, the first line will read:

DATA LIST FILE= "my_ASW.dat"  free (",")

Obtain the current working directory to which you saved the file with the getwd() command in R. It will be printed to the R console, e.g.,

getwd()
# [1] "/Users/myname/myfolder"

Copy the path into the SPSS syntax file so that it will look like this:

DATA LIST FILE= "/Users/myname/myfolder/my_ASW.dat"  free (",")

You can subsequently run the entire syntax file my_ASW.sps within the SPSS syntax viewer which should result in the proper import of the results of your faultline calculation into an SPSS data file.

7.3 Exporting to a comma separated values file (.csv)

Exporting the results of your faultline calculations to a .csv file is straight forward using the write.csv() command (or write.csv2() if you live in a country that employs the comma as the decimal separator). For example, to write the long result data frame my_ASW_long as a .csv file into the current working directory the following syntax can be employed:

write.csv(my_ASW_long, file = "my_ASW.csv")

References

Bezrukova, K., Jehn, K. A., Zanutto, E. L., & Thatcher, S. M. B. (2009). Do workgroup faultlines help or hurt? A moderated model of faultlines, team identification, and group performance. Organization Science, 20, 35-50. doi:10.1287/orsc.1080.0379

Gibson, C., & Vermeulen, F. (2003). A healthy divide: Subgroups as a stimulus for team learning behavior. Administrative Science Quarterly, 48, 202-239. doi:10.2307/3556657

Lawrence, B., & Zyphur, M. (2011). Identifying organizational faultlines with latent class cluster analysis. Organizational Research Methods, 14, 32-57. doi:10.1177/1094428110376838

Meyer, B. & Glenz, A. (2013). Team faultline measures: A computational comparison and a new approach to multiple subgroups. Organizational Research Methods. Advance online publication. doi:10.1177/1094428113484970

R Development Core Team. (2011). R: a language and environment for statistical computing. Computer software. R Foundation for Statistical Computing. Vienna, Austria. Retrieved 1 August 2011, from R Foun- dation for Statistical Computing: http://www.R-project.org

Shaw, J. (2004). The development and analysis of a measure of group faultlines. Organizational Research Methods, 7, 66-100. doi:10.1177/1094428103259562

Thatcher, S., Jehn, K., & Zanutto, E. (2003). Cracks in diversity research: The effects of diversity faultlines on conflict and performance. Group Decision and Negotiation, 12, 217-241. doi:10.1023/A:102332540694 6

Trezzini, B. (2008). Probing the group faultline concept: An evaluation of measures of patterned multi-dimensional group diversity. Quality and Quantity, 42, 339-368. doi:10.1007/s11135-006-9049-z

van Knippenberg, D., Dawson, J., West, M., & Homan, A. (2011). Diversity faultlines, shared objectives, and top management team performance. Human Relations, 64, 307-336. doi:10.1177/0018726710378384

Zanutto, E. L., Bezrukova, K., & Jehn, K. A. (2010). Revisiting faultline conceptualization: measuring faultline strength and distance. Quality & Quantity, 45(3), 701-714. doi:10.1007/s11135-009-9299-7