网络数据统计分析 - R 语言实践

最编程 2024-04-19 19:37:45

...

资料：《Statistical Analysis of Network Data with R》

语言R常见的网络分析包：

基础网络操作、可视化于特征化： igraph、network、sna
网络建模：igraph、eigenmodel、ergm、mixer
网络建模：glasso、huge

网络分析研究大部分是描述性的工作。
网络的可视化即是一门艺术，也是一门科学。

聚类系数 Clustering cofficient

三元闭包体现了社会网络的“传递性”（transitivity），枚举所有节点三元组中构成三角形的比值来表征。

网络的可视化和数值特征化是网络分析的首要步骤之一。
网络可视化视图将数据的多个重要反面整合在一个图表中。

同配性 assortativity

该节点在多大程度上会与同类型或者不同类型的其他节点进行匹配，可以通过一种相关性统计量（所谓的同配系数）进行量化。

将复杂系统中感兴趣的问题与合适的网络概括性度量匹配起来，是网络特征化方法起作用的关键所在。

模体 motif

网络中的频繁子图模式

网络聚类系数的分布,用来检验社会网路的聚集性上

sand安装包
网络数据统计分析 statistical analysis of network data
在CRAN上

install.packages("sand")
# 安装包statistical analysis of network data 网络数据统计分析
install.packages("igraph")
library(igraph)
library(igraphdata)
library(sand)
install_sand_packages()

?sand

第2章操作网络数据

G=（V,E）
节点：vertices 或者 nodes
边：edges 或者 links
节点数量：图的阶数 order
边的数量：图的规模 size

同构图 isomorphic

无向 undirected
有向 directed graph 或者 digraph
边：有向边 directed edges 或弧 arcs
双向 mutual

小的图形用 formulate来创建

# 2.1
library(igraph) # 载入igraph包

# 创建一个图对象g，包含Nv=7个节点
g <- graph.formula(1-2, 1-3, 2-3, 2-4, 3-5, 4-5, 4-6, 4-7, 5-6, 6-7)

# 2.2
## 节点系列：Vertex sequence
V(g)


# 2.3
## 边序列：Edge squence
E(g)

# 2.4
## 显示图的结构
str(g)

# 2.5
## 可视化图
plot(g)

# 2.6
## 有向图的创建
dg <- graph.formula(1-+2, 1-+3, 2++3)
plot(dg)

# 2.7
## 1,2,3,...,n的编号是默认给定的标签，也可以自己给定标签
dg2 <- graph.formula(Sam-+Mary, Sam-+Tom, Mary++Tom)
str(dg2)
summary(dg2)
str(dg2) # 属性调用不清楚
?str()

# 2.8
## 通过节点的name属性进行修改。
V(dg)$name <- c("小明", "小蓝", "小西")
plot(dg)

三种表示图的格式：邻接列表、边列表、领接矩阵

邻接列表 adjacency list
边列表 edge list

E(dg)

+ 4/4 edges from 2ec3d7f (vertex names):
[1] 小明->小蓝 小明->小西 小蓝->小西 小西->小蓝

领接矩阵 adjacency matrix 数据 0 和1

# 2.10
## 邻接矩阵
get.adjacency(g)

7 x 7 sparse Matrix of class "dgCMatrix"
  1 2 3 4 5 6 7
1 . 1 1 . . . .
2 1 . 1 1 . . .
3 1 1 . . 1 . .
4 . 1 . . 1 1 1
5 . . 1 1 . 1 .
6 . . . 1 1 . 1
7 . . . 1 . 1 .

子图 subgraph

导出子图 induced subgraph

# 2.11 
## 导出子图
h <- induced.subgraph(g, 1:5)
plot(h)

## 导出子图 删除节点6和节点7
h1 <- g - vertices(c(6,7))
plot(h1)

# 2.13
# 给h1增加两个新节点，然后在增加边
h1 <- h1 + vertices(c(6,7))
plot(h1)

g <- h1 + edges(c(4,5), c(4,7), c(5,6), c(6,7))
plot(g)


# 图的合并
h1 <- h
h2 <- graph.formula(4-6, 4-7, 5-6, 6-7)
g <- graph.union(h1, h2)
plot(g)

2.3 网络的修饰

> library(sand)
载入需要的程辑包：igraphdata

Statistical Analysis of Network Data with R
Type in C2 (+ENTER) to start with Chapter 2.
> g.lazega <- graph.data.frame(elist.lazega, directed = "FALSE", vertices = v.attr.lazega)
> g.lazega$name <- "Lazega Lawyers" # 给图命名
> 
> vcount(g.lazega) # 节点个数
[1] 36
> ecount(g.lazega) # 边的个数
[1] 115
> 
> list.vertex.attributes(g.lazega) # 节点的属性
[1] "name"      "Seniority" "Status"    "Gender"    "Office"    "Years"    
[7] "Age"       "Practice"  "School"   
> is.simple(g) # 判断是否是简答图
[1] TRUE
> E(mg)$weight <- 1 # 给所有的边赋值为1
> wg2 <- simplify(mg)
> is.simple(wg2)
[1] TRUE
> E(wg2)$weight
 [1] 1 1 2 1 1 1 1 1 1 1
plot(mg)
plot(wg2)

把mg转化为wg2

wg2

> # 图g中节点5的邻居
> neighbors(g,5)
+ 3/7 vertices, named, from bb8e27b:
[1] 3 4 6
> 
> degree(g)
1 2 3 4 5 6 7 
2 3 3 4 3 3 2 
> 
> degree(dg, mode = "in") # 有向图，入度数
小明 小蓝 小西 
   0    2    2 
> degree(dg, mode = "out") # 无向图，出度数
小明 小蓝 小西 
   2    1    1

g.full <- graph.full(7)
g.ring <- graph.ring(7)
g.tree <- graph.tree(7, children = 2, mode = "undirected")
g.star <- graph.star(7, mode = "undirected")
par(mfrow=c(2,2))
plot(g.full)
plot(g.ring)
plot(g.tree)
plot(g.star)

# 二部图
g.bip <- graph.formula(actor1:actor2:actor3, movie1:movie2,
                       actor1:actor2-movie1, actor2:actor3 - movie2)
V(g.bip)$type <- grepl("^movie", V(g.bip)$name)
plot(g.bip)

3 网络数据可视化

# 3.1
library(sand)
## 数据一：5x5x5的网格（3D）
g.l <- graph.lattice(c(5, 5, 5))

# 3.2
## 数据二：博客网络 数据记录了146个独立博客之间的引用关系
data(aidsblog)
summary(aidsblog)

str(aidsblog)
# 这个函数不能用就调用基础的api
vcount(aidsblog) # 节点个数
ecount(aidsblog) # 边的个数
## 节点系列：Vertex sequence
V(aidsblog)

## 边序列：Edge squence
E(aidsblog)

# 3.3
igraph.options(vertex.size=3, vertex.label=NA, edge.arrow.size=0.5)
par(mfrow=c(1, 2))
plot(g.l, layout=layout.circle)
title("5x5x5 Lattice")
plot(aidsblog, layout=layout.circle)
title("Blog Network")

# 3.4
plot(g.l, layout=layout.fruchterman.reingold)
title("5x5x5 Lattice")
plot(aidsblog, layout=layout.fruchterman.reingold)
title("Blog Network")

# 3.5
plot(g.l, layout=layout.kamada.kawai)
title("5x5x5 Lattice")
plot(aidsblog, layout=layout.kamada.kawai)
title("Blog Network")

# 3.7 二部图的可视化
plot(g.bip, layout=-layout.bipartite(g.bip)[,2:1],
     vertex.size=30, vertex.shapes=ifelse(V(g.bip)$type,
                                          "rectangle", "circlr"),
     vertex.color=ifelse(ifelse(V(g.bip)$type, "red", "cyan")))

Zachary 空手道俱乐部网络（karate club network）
数据集合实际上只存在两个社团，分别以教练为中心和以主管为中心。

# 3.8
library(igraphdata)
data(karate)

# 可重复的布局
set.seed(42)
l <- layout.kamada.kawai(karate)

# 首先回执未修饰的图
igraph.options(vertex.size=10)
par(mfrow = c(1,1))
plot(karate, layout=1, vertex.label=V(karate))


# 修饰图，首先设定标签
V(karate)$label <- sub("Actor", "", V(karate)$name)

# 两个领导者与其他俱乐部成员的节点形状不同
V(karate)$shape <- "circle"
V(karate)[c("Mr Hi", "John A")]$shape <- "rectangle"

# 使用颜色区分不同的派别
V(karate)[Faction == 1]$color <- "red"
V(karate)[Faction == 2]$color <- "dodgerblue"

# 节点面积正比于节点强弱（即所关联边的权重值之和）
V(karate)$size <- 4 * sqrt(graph.strength(karate))
V(karate)$size <- v(karate)$size * .5

# 将共同活动的数量设定为边的权重（粗细）
E(karate)$width <- E(karate)$weight

# 使用颜色区分派别内部和派别之间的边
F1 <- V(karate)[Faction == 1]
F2 <- V(karate)[Faction == 2]
E(karate)[F1 %--% F1]$color <- "pink"
E(Karate)[F2 %--% F2]$color <- "lightblue"
E(karate)[F1 %--% F2]$color <- "yellow"
#这样可以看出R语言绘图确实有很多优势


# 较小节点的标签位置偏移量（初始为 0）
V(karate)$label.dist <- ifelse(V(karate)$size >= 10, 0, 0.75)

# 使用相同布局绘制修饰后的图
plot(karate, layout=l)

Lazega律师网络可视化

# 3.9
library(sand)
data(lazega)

# 使用颜色表示办公地点
colbar <- c("red", "dodgerblue", "goldenrod")
v.colors <- colbar[V(lazega)$Office]

# 使用形状表示执业类型
v.shapes <- c("circle", "square")[V(lazega)$Practice]

# 节点大小正比于在公司工作了几年
v.size <- 3.5 * sqrt(V(lazega)$Years)

# 节点标签为个人的资历
v.label <- V(lazega)$Seniority

# 可重复布局
set.seed(42)
l <- layout.fruchterman.reingold(lazega)
plot(lazega, layout=l, vertex.color=v.colors, vertex.shape=v.shapes,
     vertex.size=v.size, vertex.label=v.label)

大型网络可视化

srt() 不能用使用 upgrade_graph()d代替

library(sand)
summary(fblog)
upgrade_graph(fblog) 
# upgrade_graph(fblog) = summary(fblog) + str(fblog)
# 新api 旧api

library(sand)
summary(fblog)
upgrade_graph(fblog) 
# upgrade_graph(fblog) = summary(fblog) + str(fblog)
# 新api 旧api

list.vertex.attributes(fblog) # 节点的属性

party.names <- sort(unique(V(fblog)$PolParty))

party.names

set.seed(42) # 设定随机数种子，一个特定的种子可以产生一个特定的伪随机序列，
l = layout.kamada.kawai(fblog)
party.nums.f <- as.factor(V(fblog)$PolParty)
party.nums <- as.numeric(party.names.f)
plot(fblog, layout=l, vertex.color=party.nums,
     vertex.size=3, vertex.label=NA)

DrL算法，针对大型网络可视化设计的布局算法。

set.seed(42) # 设定随机数种子，一个特定的种子可以产生一个特定的伪随机序列
l <- layout.drl(fblog)
plot(fblog, layout=l, vertex.size=5, vertex.label=NA, vertex.color=party.nums)

元节点绘制

节点的节点，即社区节点（主题节点）

fblog.c <- contract.vertices(fblog, party.nums)
E(fblog.c)$weight <- 1
fblog.c <- simplify(fblog.c)

party.size <- as.vector(table(V(fblog)$PolParty))
plot(fblog.c, vertex.size=5*sqrt(party.size),
     vertex.label=party.names,vertex.color=V(fblog.c),
     edge.width=sqrt(E(fblog.c)$weight),
     vertex.label.dist=1.5,edge.arrow.size=0)

个体中心网（egocentric network）

即一个中心节点，一其直接相连的邻居，以及这些节点至今的边。

data("karate")
k.nbhds <- graph.neighborhood(karate, order = 1)

# k.nbhds形成了不同的图

# 教练（Mr Hi， 节点1） 和 主管（John A，节点34）的邻居规模最大

sapply(k.nbhds, vcount)  # 函数vcount 节点计数

# 提取连个最大的子网络并绘制网络
k.1 <- k.nbhds[[1]]
k.34 <- k.nbhds[[34]]
par(mfrow = c(1,2))

plot(k.1, vertex.label=NA,
     vertex.color=c("red", rep("lightblue", 16)))
plot(k.34, vertex.label=NA,
     vertex.color=c(rep("lightblue", 17),"red"))

第四章网络图特征的描述性分析

4.2节点和边

度分布

library(igraph)
library(sand)
data(karate)
hist(degree(karate), col="lightblue", xlim = c(0,50),
     xlab = "Vertex Degree", ylab = "Frequency", main="")

节点强度分布

强度strength 即与某个节点相连的边的权重之和

hist(graph.strength(karate), col="pink",
     xlab = "Vertex Strength", ylab = "Frequency", main="")

library(igraphdata)
data(yeast)

# 边的数量
ecount(yeast)

# 节点的数量
vcount(yeast)

d.yeast <- degree(yeast)

par(mfrow = c(1,2))

# 节点度分布的异质性很强
hist(d.yeast, col="blue",
     xlab="Degree",ylab="Frequency",
     main="Degree Distribution")

#度的分布递减，采用双对数坐标表达度的信息更为有效

dd.yeast <- degree.distribution(yeast)
d <- 1:max(d.yeast)-1
ind <- (dd.yeast != 0)
plot(d[ind], dd.yeast[ind], log="xy", col="blue",
     xlab=c("Log-Degree"), ylab=c("Log-Intensity"),
     mian="Log-Log Degree Distribution")

度值不同的节点以何种方式彼此连接

a.nn.deg.yeast <- graph.knn(yeast, V(yeast))$knn
plot(d.yeast, a.nn.deg.yeast, log="xy",
     col="goldenrod", xlab = c("Log Vertex Degree"),
     ylab = c("Log Average Neighbor Degree"))

节点中心性（centrality）

接近中心性 closeness centrality


l <- layout.kamada.kawai(aidsblog)

par(mfrow=c(1,2))
plot(aidsblog, layout=l, mian="Hubs",
     vertex.label="", vertex.size=10 *
       sqrt(hub.score(aidsblog)$vector))

plot(aidsblog, layout=l, mian="Authorities",
     vertex.label="", vertex.size=10 *
       sqrt(authority.score(aidsblog)$vector))

4.3 子图，完全子图

library(igraphdata)
library(igraph)
data(karate)
# cliques 团 完全子图
table(sapply(cliques(karate), length))

cliques(karate)[sapply(cliques(karate), length) == 5]

data(yeast)
clique.number(yeast)


cores <- graph.coreness(karate)
sna::gplot.target(g, cores, circ.lab = FALSE,
                  circ.col = "skyblue", userrow = FALSE,
                  vertex.col = cores, edge.col="darkgray")
detach("package:network")
detach("package:sna")

plot(aidsblog)

aidsblog <- simplify(aidsblog)
> dyad.census(aidsblog)
$`mut`
[1] 3

$asym
[1] 177

$null
[1] 10405

图的密度

> ego.instr <- induced.subgraph(karate,
+                               neighborhood(karate, 1, 1)[[1]])
> 
> ego.admin <- induced.subgraph(karate,
+                               neighborhood(karate, 1, 34)[[1]])
> 
> graph.density(karate)
[1] 0.1390374
> 
> graph.density(ego.instr)
[1] 0.25
> 
> graph.density(ego.admin)
[1] 0.2091503

全局聚类系数

> transitivity(karate)
[1] 0.2556818

局部聚类系数

> transitivity(karate, "local", vids = c(1,34))
[1] 0.1500000 0.1102941

互惠性 reciprocity
二元组普查

> reciprocity(aidsblog, mode = "default")
[1] 0.03278689
> 
> reciprocity(aidsblog, mode="ratio")
[1] 0.01666667

连通，割与流

> is.connected(yeast)
[1] FALSE

plot(yeast, vertex.label=NA, vertex.size=3,edge.arrow.size=1)

plot(yeast,  layout=layout.kamada.kawai,vertex.label=NA, vertex.size=3,edge.arrow.size=1)

上一篇：评价模型

下一篇： TechBits | 使用 WireShark 捕捉 TCP 数据包

网络数据统计分析 - R 语言实践

聚类系数 Clustering cofficient

同配性 assortativity

模体 motif

第2章操作网络数据

三种表示图的格式：邻接列表、边列表、领接矩阵

子图 subgraph

导出子图 induced subgraph

2.3 网络的修饰

大型网络可视化

元节点绘制

个体中心网（egocentric network）

第四章网络图特征的描述性分析

4.2节点和边

度分布

节点强度分布

节点中心性（centrality）

4.3 子图，完全子图

连通，割与流

r 语言对数据帧中的一列取对数

Go 语言实践：构建高性能网络应用程序

R 语言：数据类型

ArcGIS pro/ArcGIS 10.6 及以上版本最强大的工具箱 - "WhiteboxTools"（468 项新功能：GIS 分析、水文分析、图像分析、激光雷达分析、数理统计分析、数据流网络分析）！-支持的数据格式

R 语言与 Garch 模型和回归模型对股价的分析--原文来源：顶端数据部落公众号

Aiki 网络控制和数据包捕获测试技术实践

数据采集技术综合项目实践3（网络爬虫+数据预处理+数据可视化）》配有详细的步骤说明，干货满满！

用 R 语言实现随机前沿分析 SFA、数据包络分析 DEA、*弃置水文学 FDH 和 BOOTSTRAP 方法

系统评估--用 R 语言实现数据包络分析的 DEA (VII)

网络数据统计分析 - R 语言实践

聚类系数 Clustering cofficient

同配性 assortativity

模体 motif

第2章 操作网络数据

三种表示图的格式：邻接列表、边列表、领接矩阵

子图 subgraph

导出子图 induced subgraph

2.3 网络的修饰

大型网络可视化

元节点绘制

个体中心网（egocentric network）

第四章 网络图特征的描述性分析

4.2节点和边

度分布

节点强度分布

节点中心性（centrality）

4.3 子图，完全子图

连通，割与流

r 语言对数据帧中的一列取对数

Go 语言实践：构建高性能网络应用程序

R 语言：数据类型

ArcGIS pro/ArcGIS 10.6 及以上版本最强大的工具箱 - "WhiteboxTools"（468 项新功能：GIS 分析、水文分析、图像分析、激光雷达分析、数理统计分析、数据流网络分析）！-支持的数据格式

R 语言与 Garch 模型和回归模型对股价的分析--原文来源：顶端数据部落公众号

Aiki 网络控制和数据包捕获测试技术实践

数据采集技术综合项目实践3（网络爬虫+数据预处理+数据可视化）》配有详细的步骤说明，干货满满！

用 R 语言实现随机前沿分析 SFA、数据包络分析 DEA、*弃置水文学 FDH 和 BOOTSTRAP 方法

系统评估--用 R 语言实现数据包络分析的 DEA (VII)

第2章操作网络数据

第四章网络图特征的描述性分析