LIDC-IDRICT肺结节XML标记特征处理统计
⼀、数据篇❤
肺影像数据库联盟影像收集 (LIDC-IDRI) 包括诊断性和肺癌筛查胸部计算机断层扫描 (CT) 扫描,并带有标记注释病变。它是⼀种可通过⽹络访问的国际资源,⽤于开发、培训和评估⽤于肺癌检测和诊断的计算机辅助诊断 (CAD) ⽅法。由美国国家癌症研究所 (NCI) 发起,由美国国⽴卫⽣研究院基⾦会 (FNIH) 进⼀步推进,并在美国⾷品和药物管理局 (FDA) 的积极参与下,这种公私合作伙伴关系证明了联盟建⽴在基于共识的过程中。
七个学术中⼼和⼋家医学影像公司合作创建了这个包含 1018 个病例的数据集。每个主题包括来⾃临床胸部 CT 扫描的图像和相关的 XML ⽂件,该⽂件记录了由四位经验丰富的胸部放射科医师执⾏的两阶段图像注释过程的结果。在最初的盲读阶段,每位放射科医师独⽴审查每次 CT 扫描并标记属于三类(“结节 > 或 = 3 毫⽶”、“结节 < 3 毫⽶”和“⾮结节 > 或 = 3毫⽶”)。在随后的盲读阶段,每位放射科医师独⽴审查⾃⼰的评分以及其他三位放射科医师的匿名评分,以得出最终意见。❤
从官⽅给出的简介可以看到,⾸先数据部分是标注好的结节,数据形式分为两种,分别是:
DICOM的CT图像数据
XML形式的标注信息
4个医⽣对结节标注会分三种类别,分别是:
⼤于等于 3mm 的结节:包含结节的特征信息(characteristics)、结节的完整轮廓(roi);
⼩于 3mm 的结节:只显⽰结节的近似三维重⼼,若不透明则不标记;
⼤于等于 3mm 的⾮结节:只显⽰其近似的三维重⼼,指出⾮结节连接区域;
标注XML⽂件存储的是原始标注数据(未经处理的),⼤体结构图如下(不必细看,后⾯我们还要反复查看):
⽅便初学者这⾥对CT部分的存储结构,做⼀个简单的介绍:
⼀个病⼈的检查,称作⼀个studyInstanceUID;⼀个病⼈可能在这家医院前后做了三次检查,那就是三个seriesInstanceUID;⼀次检查⼜会有多张断层扫描图,就对应着imageSOP_UID,这也就是对应到这种图像的唯⼀标志。
其中对于3mm-33mm结节的特征characteristics,包含了如下信息:
1) Subtlety:检测难度(1-5级,1最难,5最明显)
2) internalStructure:内部结构(4种,软组织、液体、脂肪、空⽓)
3) calcification:钙化(6种情况)
4) sphericity:球形度(5种程度,但只明确3种)
5) margin:边缘(5种程度)
6) lobulation:分叶征(5种情况,但只明确2种)
7) spiculation:⽑刺征(5种情况,但只明确2种)
8) texture:纹理(5种情况,但只明确3种)
9) maliynancy:恶性程度(1-5,1最低,5最⾼)
特征部分实例如下⾯这样:
<noduleID>Nodule 001</noduleID>
<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
对这部分特征信息的抽取和应⽤的公开资料⽐较少,⼤多数论⽂和应⽤都集中在对结节的检测、良恶性的判别⽅⾯。所以,本博客就是对该数据的特征部分做重点转换和处理。尤其是数据中错综复杂的关系,真是死了不少脑细胞。
⼆、标注信息转储❤
.pkl是python保存⽂件的⼀种⽂件格式,如果直接打开会显⽰⼀堆序列化的东西。
需要使⽤rb类型来打开。rb – 读取2进制⽂件,r – 读取⽂本⽂件。
可以参考这⾥:
写⼊和读出pkl的代码,⽅便我们对写⼊的pkl⽂件进⾏查看,如下:
# -*- coding:utf-8 -*-
import pickle
# obj = 123, "abcdef", ["ac", 123], {"key": "value", "key1": "value1"}
# print(obj)
# 序列化到⽂件
# with open(r"F:\", "wb") as f:
# pickle.dump(obj, f)
with open(r"F:\dst\\LIDC-IDRI-0002_annotation_flatten.pkl","rb")as f:
print(pickle.load(f))
后⾯我们会对xml⽂件存储的信息,转储到pkl⽂件保存,所以这⾥就先pkl的保存和读取进⾏预剔除,查看下打印出来的内容,和我们所预想的是否⼀致。同时,根据打印的结构,帮助我们对这批标注内容进⾏理解。
主要函数结果,读取数据,处理完毕后保存数据
def parse_main(dirname, outdir, case_name, save_pickle=True):
assert os.path.isdir(dirname)
annotations = parse_original_xmls(dirname)
annotations = flatten_annotation(annotations)
print(annotations)
if save_pickle:
pickle_file = os.path.join(outdir, case_name +'_annotation_flatten.pkl')
logging.info("Saving annotations to file %s"% pickle_file)
with open(pickle_file,'wb')as f:
cPickle.dump(annotations, f)
print('pkl save OK')
if __name__ =='__main__':
dsr_dir =r'F:\dst\tmp'
case_path =r'F:\LIDC\LIDC-IDRI-0003'
parse_main(case_path, dsr_dir, case_name='LIDC-IDRI-0003')
对xml⽂件部分信息进⾏抽取
def parse_XML(xml_filename, Slices_info):
logging.info("Parsing %s"% xml_filename)
annotations =[]
# ET is the library we use to parse xml data
tree = etree.parse(xml_filename)
root = t()
# readingSession-> holds radiologist's annotation info
for read_session in root.findall('nih:readingSession', NS):
# to hold each radiologists annotation
# i.e. readingSession in xml file
rad_annotation = RadAnnotation()
rad_annotation.version = read_session.find('nih:annotationVersion', NS).text
rad_annotation.id= read_session.find('nih:servicingRadiologistID', NS).text
# nodules
nodule_nodes = read_session.findall('nih:unblindedReadNodule', NS) for node in nodule_nodes:
nodule = parse_nodule(node, Slices_info)
if nodule.is_small:
rad_annotation.small_nodules.append(nodule)
else:
dules.append(nodule)
# non-nodules
non_nodule = read_session.findall('nih:nonNodule', NS)
for node in non_nodule:
nodule = parse_non_nodule(node)
_nodules.append(nodule)
annotations.append(rad_annotation)
return annotations
def find_all_xmlFiles(root, suffix=None):
res =[]
for root, _, files in os.walk(root):
for f in files:
if suffix is not None and dswith(suffix):
continue
res.append(os.path.join(root, f))
return res
def find_all_seriesDir(src_dir):
SeriesInstanceUID_path_list =[]
StudyInstanceUID_list = os.listdir(src_dir)
for StudyInstanceUID in StudyInstanceUID_list:
StudyInstanceUID_path = os.path.join(src_dir, StudyInstanceUID)
SeriesInstanceUID_list = os.listdir(StudyInstanceUID_path)
for SeriesInstanceUID in SeriesInstanceUID_list:
SeriesInstanceUID_path = os.path.join(StudyInstanceUID_path, SeriesInstanceUID) print(SeriesInstanceUID_path)
dcm_list = os.listdir(SeriesInstanceUID_path)
if len(dcm_list)<50:
continue
else:
SeriesInstanceUID_path_list.append(SeriesInstanceUID_path)
return SeriesInstanceUID_path_list
def parse_original_xmls(dirname):
annotations =[]
SeriesInstanceUID_dir_list = find_all_seriesDir(dirname)
print('SeriesInstanceUID_path_list:', SeriesInstanceUID_dir_list)
for SeriesInstanceUID_dir in SeriesInstanceUID_dir_list:
for SeriesInstanceUID_dir in SeriesInstanceUID_dir_list:
Slices_info = get_CT_info(SeriesInstanceUID_dir)
logging.info("Reading annotations")
xml_files = find_all_xmlFiles(SeriesInstanceUID_dir,'.xml')
for xml_file in xml_files:
annotations.append(parse_XML(xml_file, Slices_info))# xml全⾯信息抽取
return annotations
结节和⾮结节信息单独获取
def parse_nodule(xml_node, Slices):# xml_node is one unblindedReadNodule
char_node = xml_node.find('nih:characteristics', NS)
# if no characteristics, it is smallnodule i.e. is_small=TRUE
is_small =(char_node is None or len(char_node)==0)
nodule = is_small and SmallNodule()or NormalNodule()
nodule.id= xml_node.find('nih:noduleID', NS).text
if not is_small:
subtlety = char_node.find('nih:subtlety', NS)
nodule.characteristics.subtlety =)
nodule.characteristics.internal_struct = \
int(char_node.find('nih:internalStructure', NS).text)
nodule.characteristics.calcification = \
int(char_node.find('nih:calcification', NS).text)
nodule.characteristics.sphericity = \
int(char_node.find('nih:sphericity', NS).text)
nodule.characteristics.margin = \
int(char_node.find('nih:margin', NS).text)
nodule.characteristics.lobulation = \
int(char_node.find('nih:lobulation', NS).text)
nodule.characteristics.spiculation = \
int(char_node.find('nih:spiculation', NS).text)
ure = \
int(char_node.find('nih:texture', NS).text)
nodule.characteristics.malignancy = \
int(char_node.find('nih:malignancy', NS).text)
xml_rois = xml_node.findall('nih:roi', NS)
for xml_roi in xml_rois:
roi = NoduleRoi()
roi.z =float(xml_roi.find('nih:imageZposition', NS).text)# z 轴位置
roi.Instance_num =int(abs(roi.z - Slices[0].ImagePositionPatient[2])/(
abs(float(Slices[3].ImagePositionPatient[2])-float(Slices[4].ImagePositionPatient[2]))))+1 roi.sop_uid = xml_roi.find('nih:imageSOP_UID', NS).text
# when inclusion = TRUE ->roi includes the whole nodule
# when inclusion = FALSE ->roi is drown twice for one nodule
# 1.ouside the nodule
# 2.inside the nodule -> to indicate that the nodule has donut
# hole(the inside hole is
# not part of the nodule) but by forcing inclusion to be TRUE,
# this situation is ignored
roi.inclusion =(xml_roi.find('nih:inclusion', NS).text =="TRUE")
edge_maps = xml_roi.findall('nih:edgeMap', NS)
for edge_map in edge_maps:
x =int(edge_map.find('nih:xCoord', NS).text)
y =int(edge_map.find('nih:yCoord', NS).text)
xmax = np.i_xy)[:,0].max()
xmin = np.i_xy)[:,0].min()
ymax = np.i_xy)[:,1].max()
ymin = np.i_xy)[:,1].min()
if not is_small:# only for normalNodules
(xmax + xmin)/2.,(ymin + ymax)/2.)# center point
return nodule # is equivalent to unblindedReadNodule(xml element)
def parse_non_nodule(xml_node):# xml_node is one nonNodule
nodule = NonNodule()
nodule.id= xml_node.find('nih:nonNoduleID', NS).text
roi = NoduleRoi()
roi.z =float(xml_node.find('nih:imageZposition', NS).text)
roi.sop_uid = xml_node.find('nih:imageSOP_UID', NS).text
loci = xml_node.findall('nih:locus', NS)
for locus in loci:
x =int(locus.find('nih:xCoord', NS).text)
y =int(locus.find('nih:yCoord', NS).text)
return nodule # is equivalent to nonNodule(xml element)
整理结节获取后的信息,进⼀步整理
def flatten_nodule(nodules,type, result):
# if not result:
# result = {'nodules': [], 'small_nodules': [], 'non_nodules': []}
for nodule in nodules:
print('nodule:', nodule)
point =[]
for roi is:
if type=='nodules':
tmp ={'pixels': i_xy,'sop_uid': roi.sop_uid,'sop_Instance_num': roi.Instance_num, 'nodule_id': nodule.id,
'nodule_malignancy': nodule.characteristics.malignancy,
'nodule_subtlety': nodule.characteristics.subtlety,
'nodule_internal_struct': nodule.characteristics.internal_struct,
'nodule_calcification': nodule.characteristics.calcification,
'nodule_sphericity': nodule.characteristics.sphericity,
'nodule_margin': nodule.characteristics.margin,
'nodule_lobulation': nodule.characteristics.lobulation,
'nodule_spiculation': nodule.characteristics.spiculation,
'nodule_texture': ure
}
else:
python处理xml文件
tmp ={'pixels': i_xy,'sop_uid': roi.sop_uid}
point.append(tmp)
result[type].append(point)
def flatten_annotation(annotation_dict):
logging.info("Start flatten")
# res = {}
res ={'nodules':[],'small_nodules':[],'non_nodules':[]}
for annotations in annotation_dict:
# annotations in each file
for anno in annotations:
flatten_dules,'nodules', res)
flatten_nodule(anno.small_nodules,'small_nodules', res)
flatten__nodules,'non_nodules', res)
logging.info("Flatten complete")
return res
获取dicom该序列的metadata数据信息,如下:
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论