pandas 文本两两分割进行频率统计

最编程 2024-10-16 07:20:32

...

源文件如下，需要对三个以上的组合文本进行两两统计，最终找出组合频率的排名

实操代码如下：

关键指令：df2['组合拆分']=df2['组合'].apply(lambda x:list(combinations(list(x),2)))

import pandas as pd
import numpy as np
from itertools import combinations


filepath='/Users/kangyongqing/Documents/kangyq/202409/教师空余时间查询/数学排课/'
file1='02星期热力9-20至9-26文件2024-10-11.xlsx'

df1=pd.read_excel(filepath+file1)
df1['组合']=df1['pinlv'].str.split(',')
df1['长度']=df1['pinlv'].str.split(',').str.len()
print(df1.head())

df2=df1[df1['长度']>2].copy()
df2['组合拆分']=df2['组合'].apply(lambda x:list(combinations(list(x),2)))
# print(df2.head())

df3=pd.DataFrame(df2.explode('组合拆分'))
print(df3.head())

df4=df1.loc[df1['长度']<=2,['pinlv','学生数']]
print(df4.head(),df4.shape)
df5=pd.DataFrame(df3.loc[:,['组合拆分','学生数']].rename(columns={'组合拆分':'pinlv'}))
print(df5.head(),df5.shape)

df6=df5.copy()
df6.index=range(df6.shape[0])

print(df6.head())

df7=pd.concat((df4,df6),axis=0)
print(df7.tail(),df7.shape)



writer=pd.ExcelWriter(filepath+'ceshi.xlsx')
df4.to_excel(writer,sheet_name='单一')
df5.to_excel(writer,sheet_name='多拆')
df1.to_excel(writer,sheet_name='明细')
df7.to_excel(writer,sheet_name='合并')
writer._save()

多拆结果如下：

由于格式原因，可能需要进行一些替换调整，跟源数据中的部分数据合并在一起，进行最终的频率排名统计。

上一篇： AOT 漫谈主题（第一部分）：如何调试 C# AOT 程序 - I：背景

下一篇：智能工厂设计软件离散逼近和 "程序 "逻辑描述的形式化规则 2 of 2