Method description

Method description
Prev	Chapter 42. Social Network Analysis Module	Next

Social Network

Social network is a structure of connections between individual persons, organizations or any other objects. A network consist of nodes which represent these objects and connections or edges, which represent relationships between these objects.

A natural way to represent a network is using a graph. However, for larger network such representation may turn out to be too complicated and generation of the graph on a computer may not be possible due to time and memory capacity constraints.

Classification of networks

There are several types of networks that can be used in analysis depending on the data available. The main types of networks are described below.

Directed and undirected networks

Directed networks contain information about directions of the connections between the nodes. In the graphical representations edges are showed by means of arrows. In the undirected graphs ties have no information about direction of connections thus the connection is assumed to be bidirectional.

Weighted and unweighted networks

Connections in weighted networks have values associated with them - weights. Weights are most often positive real numbers. It is usually assumed that the higher value of a tie means stronger connection between the nodes. The strength of the connection can be depicted on the diagram with colors or width of edges. In practice weights are often normalized so they are numbers in the range of 0 to 1.

Unweighted networks have no information about the strength of connections. In such cases it is often assumed that all connection have weights equal to 1.

Basic concepts used in network analysis

Subnetwork

A subnetwork of a network is a subset of nodes of the network with all connections that exist in the original network between the selected nodes. If a subnetwork consists of three nodes, then it is called a a triad.

Community

A community is a set of nodes that are more densely connected internally than with the rest of the nodes in the network. Communities may also be created by selecting nodes according to certain criteria, e.g. nodes representing people of particular profession, gender or age range.

Path

A path in a network is a sequence of connections between consecutive connected nodes. In a directed network paths are also directed and each intermediate node in the path is the target of incoming segment and the source of the outgoing segment. A path may pass only once through each intermediate node and path may not contain cycles.

Every path has got its length. For unweighted network path length is number of ties comprising the path. For weighted network path length is usually calculated in the following way:

where is the weight of i-th tie on the path between the nodes.

Neighborhood

Every node in a network has got its neighborhood. Neighborhood of the n-th order is set of nodes that are at a path length of n to a given node. The neighborhood also includes all the connections that are in the original graph between the nodes belonging to the neighborhood.

Node degree

The degree of a node is defined as the number of nodes with which the given node is connected directly. For directed networks it is possible to calculate in-degree (number of incoming edges) and out-degree (number of outgoing edges). In-degree is often interpreted as the measure of how much information the node receives. Out degree measure usually indicates how influential is the node.

Description of the algorithms used in the Social Network Analysis

Louvain community finder

Louvain community finder is a popular algorithm used for fast detection of communities in the network. Louvain community finder iteratively optimizes communities by maximizing modularity measure. The measure is described in the section Modularity. The algorithm can also find a hierarchy of nested communities (ranging from large to many small communities). For detailed description of the algorithm see Blondel et al. 2008 or Clauset et al. 2004.

Modularity

Modularity is a quantity with values in the range from -1 to 1. Modularity measures ties inside communities as compared to ties between communities. It is calculated as follows:

where

is the weight of the tie between nodes i and j,
is the sum of weights of ties attached to node i,
is the community to which node i is assigned,
if and 0 otherwise,

Modularity measure is used by Louvain community finder algorithm to asses the optimality of the network's division into communities.

Size-Constrained Community Finder

Size-constrained Community Finder is an algorithm that is used to find communities in the network. The algorithm allows the user to set the maximum size of generated communities. In every iteration for every node the algorithm finds to which community the node should be assigned based on the value of the gain function. For details of the algorithm see Ciglan and Nørvåg 2010.

Community Statistics

Every community found in the network can be described by certain statistics. Basic community statistics include:

number of nodes in a community,
number of ties inside the community,
number of ties with other communities,
relation of the number of ties inside the community to the number of ties both inside and outside the community,
sum of weights inside the community,
sum of weights of ties to other communities.

Role Finder

The goal of social network analysis may be to find nodes playing similar roles in the network structure. Roles are defined by similarities of relations among nodes. The assumption is that a node having a certain role in a network has certain pattern of relations with other nodes.

A possible approach to the problem of finding such roles was described by Guimerà and Amaral 2007. The authors define roles in terms o two statistics which can be calculated for every node in relation to the communities in the network:

Leadership score. This statistic indicates to what extent a node can be considered to be the leader of the community. It is calculated as:
where
- is number of links from i-th node to other nodes in the community,
- is average value of in a community ,
- is standard deviation of in a community .
Participation score. This statistic indicates to what extent a node participates in contacts between two or more communities. It is calculated as:
where
- is the number of links of i-th node to nodes in community ,
- is number of communities in the network,
- is degree of i-th node.

On the basis of participation score and leadership score the following roles can be assigned:

Peripheral leader for
Connecting leader for
Kinless leader for
Ultra-peripheral follower for
Peripherial follower for
Connecting follower for
Kinless follower for

In general nodes with low leadership score are called followers, whereas leaders are characterized by high value of leadership score. Nodes with low value of participation score connect mainly within their community, so they are called peripheral nodes. If it is difficult to assign a node to any community (it has many ties with other communities) it is called a kinless node.

Aggregator

For every node in the network certain statistics in its neighborhood can be calculated. The statistics can be calculated based on:

the data describing network connections,
network data and additional attributes describing every node.

The basic statistics calculated from network data describe path lengths from a node to its neighboring nodes. The following statistics are calculated:

average path length,
entropy of path lengths,
gini index of path lengths,
first quartile of path length distribution,
upper quartile of path length distribution,
minimum path length,
maximum path length,
median of path length,
sum of path lengths in the node’s neighborhood,
variance of path lengths in the node’s neighborhood.

The following statistics describe the node's position in the neighborhood:

number of nodes in the node’s neighborhood,
node degree,
weighted average of node degrees in the node’s neighborhood,
node’s closeness centrality measure in its neighborhood,
weighted node’s closeness centrality measure in the node’s neighborhood.

In these calculations, weights are path lengths. If the above statistics are calculated in a neighborhood with radius greater than 1, there can be more than one path between the central node and its neighboring nodes. In this case path statistics can be calculated in multiple ways.

For unweighted network path length may equal either:

the number of paths between two nodes, or
1 if there is at least one path between the nodes and 0 otherwise

In the case of weighted network path length may be also calculated in the following manner:

sum of all distinct path lengths between then nodes,
average path length between the nodes,
maximum path length between the nodes.

Aggregates can be calculated for additional variables describing any node. Examples of such variables are: customer age or type of customer.

The following statistics can be calculated for nominal variables:

the number of nodes in the neighborhood that are in a certain category,
fraction of nodes in the neighborhood that are in a certain category.

For numerical variables it is possible to calculate the same statistics as for path lengths.

Triad statistics

Triads are the smallest social structures in the social networks that have a character of a small society. Because of this fact they are often analyzed to capture certain characteristics of the network.

In an undirected network there are six types of triads, depending on the topology of connections between the three nodes. Typical analysis involves determining to how many different triad types a node belongs. The triads which are most often used in the analysis are:

Full triad. Every node is connected with to each of the other nodes.
Type 1 partial triad. The node is connected with two other nodes which are not connected directly with each other.
Type 2 parial triad. The node is connected only to one of the remaining nodes and the other two nodes are connected to each other.

Sometimes additional information about the nodes is available in the form of a binary variable (flag). In such case triads can be divided into following subtypes:

0_0. None of the nodes in the triad is signed with a flag.
0_1. The main node has no flag, exactly one of the other two nodes is flagged.
0_2. The node has no flag, the other two nodes are flagged.
1_0. The node is flagged, the other two nodes are not flagged.
1_1. The node is flagged, exactly one of the other two nodes in also flagged.
1_2. All nodes are flagged.

Equivalence

The notion of equivalence is often used in Social Network Analysis. Generally equivalence describes similarity of nodes’ connections to other members of society. Usually it is assumed that fully equivalent nodes have direct connections to the same nodes. In the case of weighted connections fully equivalent nodes should have the same values of weights in the ties linking the same nodes.

Equivalence between nodes i and jcan be calculated in the following way:

where is weight of connection between nodes i and k.

Community Equivalence

Community equivalence is defined like equivalence but is restricted to nodes belonging to one community. All calculations are done only for the subnetwork created by selecting the nodes of this community.

Page Rank algorithm

Page Rank is an algorithm used by Google Search to rank website importance. This algorithm can be applied to any network data. Page Rank measures the importance of a given node on the basis of number and the quality of nodes that have connections to it. For details of the algorithm see Sullivan 2007 .

HITS algorithm

HITS (Hyperlink-Induced Topic Search) is an algorithm similar to Page Rank, which has been developed to assess websites. The algorithm can be used to analyze any network data. HITS assesses to what extent each node is a hub or an authority. A hub is a node that has connections to many other nodes. Authority is a node that has many connections from many different hubs. Details of the algorithm are described in Kleinberg 1999

Spreading Activation Algorithm

Spreading Activation Algorithm simulates how information spreads in the network. The spreading starts from one node and transfers part of its energy to the nodes to which it is connected without losing its own energy. In subsequent iterations the nodes that have energy above certain threshold transfer their energy to nodes to which they are connected. Each node is activated only once.