GitHub - doscsy12/network_sci_analysis: Network data analysis using networkx, and gephi for visualisations.

Network data analysis using networkx

Network science or network analytics, involve the analysis of network data and statistics to identify trends and patterns. It is part of graph theory, where a network can be defined as a graph in which nodes and/or edges/links have attributes.

Here, we use python package networkX. NetworkX includes many graph generator functions and facilities to read and write graphs in many formats. They have inbuilt graphs, such as Petersen graph, which is an undirected graph with 10 nodes and 15 edges, and Tutte graph, which has 3-regular polyhedron graphs with 46 nodes and 69 edges.

	notebook	description
1	airline_network_data_analysis	analysing airline network data
2	epidemic_spreading_analysis	predict vaccination strategy during pandemic
3	viral_marketing_analysis	analysing information spread thru social media
	online_FB_social_network	analysing information spread using Facebook network data of New Orleans
4	community_detection	clustering behaviour

The above notebooks have been updated/edited to Networkx 2.6 and Python 3.x. Be aware that some codes in earlier versions of Networkx were depreciated. Read the docs here.

1. Airline network

Airport networks are crucial for understanding the spread of world-wide pandemics. Air-travel is one of the quickest ways that diseases can spread across the globe. For the airport network, an 'attack' is effectively the closure of an airport. Here, the aim is to know if it is possible to break the network apart, and stop the disease spread, by only attacking a few nodes (closing a few airports).

From the experiments, closing a few airports (with the most connectivity) would result in increase in diameter (i.e. shortest distance between two airports) and average path length (i.e. longest number of flights to reach anywhere in the world) within the airline network. These results are not surprising, as closing a few airports would mean having to perform more transits (somewhere else). From the data, 'FRA' (Frankfurt) lies in most routes. There are the usual busy airports, such as 'DXB' (Dubai) and 'CDG' (Paris) airports which also lies on most routes (high betweenness centrality). However, 'ANC' (Anchorage international airport) in Alaska, was a surprise number 2! Thus, closing these well-connected airports might effectively reduce the spread of diseases.

Below is the airline network in Force Atlas 2 layout done in Gephi. The size of the nodes are airports with higher degrees (more connections in/out of other airports).

Another example is the internal subdivisions (or communities/clusters) of the airline network done in Gephi.

2. Disease network

Networks (and network simulation), are a key tool in trying to predict the spread of worldwide pandemics. Many governments use modelling and simulation software (such as GLEAMviz) to help them make decisions during an epidemic outbreak. Here, the aim is to build a simple model of disease spread on different artificial networks and investigate different vaccination strategies to try and prevent/ slow down the spread of the disease.

The network built here is based on the stochastic SIR (Susceptible, Infected, Removed) model, where alpha and beta describe the rate of removal (fraction of people moving from Infected to Removed) and the rate of infection (fraction of people moving from Susceptible to Infected) respectively. There is a level of known and unknown parameters, which is somewhat realistic, since governments may only have some information and data for different aspects of the pandemic.

Having any vaccination strategy delays the average peak time and decreases the average total infected, compared to not having any vaccination strategy. Vaccinating the node (person) with the highest degree will be more effective against the disease, since the person will have the most personal interactions with other people. In reality, it is hard to implement, because should we vaccinate people based on the number of 'friends' in facebook/ instagram?

3. Viral marketing

Social network platforms like Facebook or Twitter are becoming major sources of information channels. In these media platforms, everyone acts as both information generator and consumer. In a sense, the spreading of information, or opinions, shares some resemblance with disease spreading. Instead of the disease infecting one person to the next, information is spread from one person to the next. Hence the term 'viral marketing', which is used to define the information that goes viral in social media.

The aim here is to explore the maximization of viral spreading by selecting the best set of initial seed nodes. In other words, what are the best seed nodes to start tweeting in the initialisation stage, such that the tweet/information can reach the widest population?

Similar to disease spreading, the model construct is based on the stochastic SIR model. From the simulations, bimodal behaviour was seen regardless of how the initial seed nodes were selected (randomly, by degree, by betweenness). This is due to the presence of a 'giant component' within the network. Thus, it was demonstrated that viral spreading can only occur when the information/tweet reaches a certain threshold in spreading strength. Thus, for any given network structure or model construct (SIR, linear threshold model, etc), there is a well-defined threshold before the information (or tweet) can be considered VIRAL.

Information spread using Facebook (FB) network data of New Orleans
Real social networks are complicated, with more intricate local and global structures like assortativity, clustering and communities; They are also much larger with up to billions of nodes for FB. Here, the FB network of New Orleans is used to study the information spread over real social networks. The network consists of 63,731 nodes and 817,090 links. The aim is try to find the 10 (most influential) nodes that maximizes the final percentage of infected nodes. The 10 seed nodes could influence about 64.2% of the network used here.

4. Community detection

In network science, community detection is a method for finding groups within large and complex systems represented on a graph. Techniques in community detection is useful to discover people in social communities that have common interests, or discover word patterns and co-occurences within publishing or media.

In this introductory project, we explore two general methods (divisive and agglomerative) to study real-world networks to reveal clustering behaviour within those systems.

Real-world datasets used here are:

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
Airline_network_data_analysis.ipynb		Airline_network_data_analysis.ipynb
README.md		README.md
airline_forceatlas2.png		airline_forceatlas2.png
airline_forceatlas2_1.png		airline_forceatlas2_1.png
airline_fruchterman_reingold.png		airline_fruchterman_reingold.png
airline_fruchterman_reingold_1.png		airline_fruchterman_reingold_1.png
airline_modularity.png		airline_modularity.png
airline_modularity_1.png		airline_modularity_1.png
airline_modularity_2.png		airline_modularity_2.png
community_detection.ipynb		community_detection.ipynb
epidemic_spreading_analysis.ipynb		epidemic_spreading_analysis.ipynb
online_FB_social_network.ipynb		online_FB_social_network.ipynb
petersen_graph.png		petersen_graph.png
tutte_graph.png		tutte_graph.png
viral_marketing_analysis.ipynb		viral_marketing_analysis.ipynb

doscsy12/network_sci_analysis

Folders and files

Latest commit

History

Repository files navigation

Network data analysis using networkx

1. Airline network

2. Disease network

3. Viral marketing

4. Community detection

About

Topics

Resources

Stars

Watchers

Forks

Languages