LIDC-IDRICT肺结节XML标记特征处理统计--688IT编程网

LIDC-IDRICT肺结节XML标记特征处理统计

⼀、数据篇❤

肺影像数据库联盟影像收集 (LIDC-IDRI) 包括诊断性和肺癌筛查胸部计算机断层扫描 (CT) 扫描，并带有标记注释病变。它是⼀种可通过⽹络访问的国际资源，⽤于开发、培训和评估⽤于肺癌检测和诊断的计算机辅助诊断 (CAD) ⽅法。由美国国家癌症研究所 (NCI) 发起，由美国国⽴卫⽣研究院基⾦会 (FNIH) 进⼀步推进，并在美国⾷品和药物管理局 (FDA) 的积极参与下，这种公私合作伙伴关系证明了联盟建⽴在基于共识的过程中。

七个学术中⼼和⼋家医学影像公司合作创建了这个包含 1018 个病例的数据集。每个主题包括来⾃临床胸部 CT 扫描的图像和相关的 XML ⽂件，该⽂件记录了由四位经验丰富的胸部放射科医师执⾏的两阶段图像注释过程的结果。在最初的盲读阶段，每位放射科医师独⽴审查每次 CT 扫描并标记属于三类（“结节 > 或 = 3 毫⽶”、“结节 < 3 毫⽶”和“⾮结节 > 或 = 3毫⽶”）。在随后的盲读阶段，每位放射科医师独⽴审查⾃⼰的评分以及其他三位放射科医师的匿名评分，以得出最终意见。❤

从官⽅给出的简介可以看到，⾸先数据部分是标注好的结节，数据形式分为两种，分别是：

DICOM的CT图像数据

XML形式的标注信息

4个医⽣对结节标注会分三种类别，分别是：

⼤于等于 3mm 的结节：包含结节的特征信息（characteristics）、结节的完整轮廓(roi)；

⼩于 3mm 的结节：只显⽰结节的近似三维重⼼，若不透明则不标记；

⼤于等于 3mm 的⾮结节：只显⽰其近似的三维重⼼，指出⾮结节连接区域；

标注XML⽂件存储的是原始标注数据（未经处理的），⼤体结构图如下（不必细看，后⾯我们还要反复查看）：

⽅便初学者这⾥对CT部分的存储结构，做⼀个简单的介绍：

⼀个病⼈的检查，称作⼀个studyInstanceUID；⼀个病⼈可能在这家医院前后做了三次检查，那就是三个seriesInstanceUID；⼀次检查⼜会有多张断层扫描图，就对应着imageSOP_UID，这也就是对应到这种图像的唯⼀标志。

其中对于3mm-33mm结节的特征characteristics，包含了如下信息：

1） Subtlety：检测难度（1-5级，1最难，5最明显）

2） internalStructure：内部结构（4种，软组织、液体、脂肪、空⽓）

3） calcification：钙化（6种情况）

4） sphericity：球形度（5种程度，但只明确3种）

5） margin：边缘（5种程度）

6） lobulation：分叶征（5种情况，但只明确2种）

7） spiculation：⽑刺征（5种情况，但只明确2种）

8） texture：纹理（5种情况，但只明确3种）

9） maliynancy：恶性程度（1-5，1最低，5最⾼）

特征部分实例如下⾯这样：

<noduleID>Nodule 001</noduleID>

</characteristics>

对这部分特征信息的抽取和应⽤的公开资料⽐较少，⼤多数论⽂和应⽤都集中在对结节的检测、良恶性的判别⽅⾯。所以，本博客就是对该数据的特征部分做重点转换和处理。尤其是数据中错综复杂的关系，真是死了不少脑细胞。

⼆、标注信息转储❤

.pkl是python保存⽂件的⼀种⽂件格式，如果直接打开会显⽰⼀堆序列化的东西。

需要使⽤rb类型来打开。rb – 读取2进制⽂件，r – 读取⽂本⽂件。

可以参考这⾥：

写⼊和读出pkl的代码，⽅便我们对写⼊的pkl⽂件进⾏查看，如下：

# -*- coding:utf-8 -*-

import pickle

# obj = 123, "abcdef", ["ac", 123], {"key": "value", "key1": "value1"}

# print(obj)

# 序列化到⽂件

# with open(r"F:\", "wb") as f:

# pickle.dump(obj, f)

with open(r"F:\dst\\LIDC-IDRI-0002_annotation_flatten.pkl","rb")as f:

print(pickle.load(f))

后⾯我们会对xml⽂件存储的信息，转储到pkl⽂件保存，所以这⾥就先pkl的保存和读取进⾏预剔除，查看下打印出来的内容，和我们所预想的是否⼀致。同时，根据打印的结构，帮助我们对这批标注内容进⾏理解。

主要函数结果，读取数据，处理完毕后保存数据

def parse_main(dirname, outdir, case_name, save_pickle=True):

assert os.path.isdir(dirname)

annotations = parse_original_xmls(dirname)

annotations = flatten_annotation(annotations)

print(annotations)

if save_pickle:

pickle_file = os.path.join(outdir, case_name +'_annotation_flatten.pkl')

logging.info("Saving annotations to file %s"% pickle_file)

with open(pickle_file,'wb')as f:

cPickle.dump(annotations, f)

print('pkl save OK')

if __name__ =='__main__':

dsr_dir =r'F:\dst\tmp'

case_path =r'F:\LIDC\LIDC-IDRI-0003'

parse_main(case_path, dsr_dir, case_name='LIDC-IDRI-0003')

对xml⽂件部分信息进⾏抽取

def parse_XML(xml_filename, Slices_info):

logging.info("Parsing %s"% xml_filename)

annotations =[]

# ET is the library we use to parse xml data

tree = etree.parse(xml_filename)

root = t()

# readingSession-> holds radiologist's annotation info

for read_session in root.findall('nih:readingSession', NS):

# to hold each radiologists annotation

# i.e. readingSession in xml file

rad_annotation = RadAnnotation()

rad_annotation.version = read_session.find('nih:annotationVersion', NS).text

rad_annotation.id= read_session.find('nih:servicingRadiologistID', NS).text

# nodules

nodule_nodes = read_session.findall('nih:unblindedReadNodule', NS) for node in nodule_nodes:

nodule = parse_nodule(node, Slices_info)

if nodule.is_small:

rad_annotation.small_nodules.append(nodule)

else:

dules.append(nodule)

# non-nodules

non_nodule = read_session.findall('nih:nonNodule', NS)

for node in non_nodule:

nodule = parse_non_nodule(node)

_nodules.append(nodule)

annotations.append(rad_annotation)

return annotations

def find_all_xmlFiles(root, suffix=None):

res =[]

for root, _, files in os.walk(root):

for f in files:

if suffix is not None and dswith(suffix):

continue

res.append(os.path.join(root, f))

return res

def find_all_seriesDir(src_dir):

SeriesInstanceUID_path_list =[]

StudyInstanceUID_list = os.listdir(src_dir)

for StudyInstanceUID in StudyInstanceUID_list:

StudyInstanceUID_path = os.path.join(src_dir, StudyInstanceUID)

SeriesInstanceUID_list = os.listdir(StudyInstanceUID_path)

for SeriesInstanceUID in SeriesInstanceUID_list:

SeriesInstanceUID_path = os.path.join(StudyInstanceUID_path, SeriesInstanceUID) print(SeriesInstanceUID_path)

dcm_list = os.listdir(SeriesInstanceUID_path)

if len(dcm_list)<50:

continue

else:

SeriesInstanceUID_path_list.append(SeriesInstanceUID_path)

return SeriesInstanceUID_path_list

def parse_original_xmls(dirname):

annotations =[]

SeriesInstanceUID_dir_list = find_all_seriesDir(dirname)

print('SeriesInstanceUID_path_list:', SeriesInstanceUID_dir_list)

for SeriesInstanceUID_dir in SeriesInstanceUID_dir_list:

Slices_info = get_CT_info(SeriesInstanceUID_dir)

logging.info("Reading annotations")

xml_files = find_all_xmlFiles(SeriesInstanceUID_dir,'.xml')

for xml_file in xml_files:

annotations.append(parse_XML(xml_file, Slices_info))# xml全⾯信息抽取

return annotations

结节和⾮结节信息单独获取

def parse_nodule(xml_node, Slices):# xml_node is one unblindedReadNodule

char_node = xml_node.find('nih:characteristics', NS)

# if no characteristics, it is smallnodule i.e. is_small=TRUE

is_small =(char_node is None or len(char_node)==0)

nodule = is_small and SmallNodule()or NormalNodule()

nodule.id= xml_node.find('nih:noduleID', NS).text

if not is_small:

subtlety = char_node.find('nih:subtlety', NS)

nodule.characteristics.subtlety =)

nodule.characteristics.internal_struct = \

int(char_node.find('nih:internalStructure', NS).text)

nodule.characteristics.calcification = \

int(char_node.find('nih:calcification', NS).text)

nodule.characteristics.sphericity = \

int(char_node.find('nih:sphericity', NS).text)

nodule.characteristics.margin = \

int(char_node.find('nih:margin', NS).text)

nodule.characteristics.lobulation = \

int(char_node.find('nih:lobulation', NS).text)

nodule.characteristics.spiculation = \

int(char_node.find('nih:spiculation', NS).text)

ure = \

int(char_node.find('nih:texture', NS).text)

nodule.characteristics.malignancy = \

int(char_node.find('nih:malignancy', NS).text)

xml_rois = xml_node.findall('nih:roi', NS)

for xml_roi in xml_rois:

roi = NoduleRoi()

roi.z =float(xml_roi.find('nih:imageZposition', NS).text)# z 轴位置

roi.Instance_num =int(abs(roi.z - Slices[0].ImagePositionPatient[2])/(

abs(float(Slices[3].ImagePositionPatient[2])-float(Slices[4].ImagePositionPatient[2]))))+1 roi.sop_uid = xml_roi.find('nih:imageSOP_UID', NS).text

# when inclusion = TRUE ->roi includes the whole nodule

# when inclusion = FALSE ->roi is drown twice for one nodule

# 1.ouside the nodule

# 2.inside the nodule -> to indicate that the nodule has donut

# hole(the inside hole is

# not part of the nodule) but by forcing inclusion to be TRUE,

# this situation is ignored

roi.inclusion =(xml_roi.find('nih:inclusion', NS).text =="TRUE")

edge_maps = xml_roi.findall('nih:edgeMap', NS)

for edge_map in edge_maps:

x =int(edge_map.find('nih:xCoord', NS).text)

y =int(edge_map.find('nih:yCoord', NS).text)

xmax = np.i_xy)[:,0].max()

xmin = np.i_xy)[:,0].min()

ymax = np.i_xy)[:,1].max()

ymin = np.i_xy)[:,1].min()

if not is_small:# only for normalNodules

(xmax + xmin)/2.,(ymin + ymax)/2.)# center point

return nodule # is equivalent to unblindedReadNodule(xml element)

def parse_non_nodule(xml_node):# xml_node is one nonNodule

nodule = NonNodule()

nodule.id= xml_node.find('nih:nonNoduleID', NS).text

roi = NoduleRoi()

roi.z =float(xml_node.find('nih:imageZposition', NS).text)

roi.sop_uid = xml_node.find('nih:imageSOP_UID', NS).text

loci = xml_node.findall('nih:locus', NS)

for locus in loci:

x =int(locus.find('nih:xCoord', NS).text)

y =int(locus.find('nih:yCoord', NS).text)

return nodule # is equivalent to nonNodule(xml element)

整理结节获取后的信息，进⼀步整理

def flatten_nodule(nodules,type, result):

# if not result:

# result = {'nodules': [], 'small_nodules': [], 'non_nodules': []}

for nodule in nodules:

print('nodule:', nodule)

point =[]

for roi is:

if type=='nodules':

tmp ={'pixels': i_xy,'sop_uid': roi.sop_uid,'sop_Instance_num': roi.Instance_num, 'nodule_id': nodule.id,

'nodule_malignancy': nodule.characteristics.malignancy,

'nodule_subtlety': nodule.characteristics.subtlety,

'nodule_internal_struct': nodule.characteristics.internal_struct,

'nodule_calcification': nodule.characteristics.calcification,

'nodule_sphericity': nodule.characteristics.sphericity,

'nodule_margin': nodule.characteristics.margin,

'nodule_lobulation': nodule.characteristics.lobulation,

'nodule_spiculation': nodule.characteristics.spiculation,

'nodule_texture': ure

}

else:

python处理xml文件

tmp ={'pixels': i_xy,'sop_uid': roi.sop_uid}

point.append(tmp)

result[type].append(point)

def flatten_annotation(annotation_dict):

logging.info("Start flatten")

# res = {}

res ={'nodules':[],'small_nodules':[],'non_nodules':[]}

for annotations in annotation_dict:

# annotations in each file

for anno in annotations:

flatten_dules,'nodules', res)

flatten_nodule(anno.small_nodules,'small_nodules', res)

flatten__nodules,'non_nodules', res)

logging.info("Flatten complete")

return res

获取dicom该序列的metadata数据信息，如下：

688IT编程网

LIDC-IDRICT肺结节XML标记特征处理统计

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

LIDC-IDRICT肺结节XML标记特征处理统计

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式