批量处理total-text数据集格式
total-text数据集的格式不同于CTW-1500和ICDAR家族,后者是直接以坐标的形式存放在.txt⽂件中的,⽽total-text的标注格式长这样,取其中⼀张图的GT为例:
x: [[115 503 494 115]], y: [[322 346 426 404]], ornt: [u'm'], transcriptions: [u'nauGHTY']
x: [[734 1058 1061 744]], y: [[360 369 449 430]], ornt: [u'm'], transcriptions: [u'NURIS']
x: [[558 682 682 557]], y: [[370 375 404 398]], ornt: [u'm'], transcriptions: [u'NURIS']
x: [[562 595 651 687 653 637 604 588]], y: [[347 304 305 360 366 334 332 361]], ornt: [u'c'], transcriptions: [u'nauGHTY']
x: [[603 632 630 603]], y: [[408 413 426 423]], ornt: [u'h'], transcriptions: [u'EST']
x: [[599 638 637 596]], y: [[419 422 441 437]], ornt: [u'h'], transcriptions: [u'1996']
x: [[583 602 633 656 679 648 594 558]], y: [[410 445 445 411 428 476 472 432]], ornt: [u'c'], transcriptions: [u'warunG']
x: [[543 583 660 701 691 653 592 557]], y: [[347 288 288 347 358 308 302 355]], ornt: [u'#'], transcriptions: [u'#']
x: [[557 580 640 683 698 649 583 537]], y: [[419 470 481 422 432 497 491 432]], ornt: [u'#'], transcriptions: [u'#']
分别存放所有x的坐标、所有y的坐标,⽂本的⽅向以及所包含字符的内容。
⽽CTW-1500或者ICDAR2015是(x1,y1,x2,y2,x3,y3,x4,y4…)这样的形式直接给出的,(ICDAR15⾥⾯最后还有个###代表忽略)所以这⾥要做的是批量处理total-text的标注格式使其转换成CTW-1500风格的。
还是直接上代码:
#正则表达式库
import re
import cv2
import os
import numpy as np
root_path ='./'
_indexes =sorted([f.split('.')[0]
for f in os.listdir(os.path.join(root_path,'train_rename_totaltext_labels_sqfree'))]) for index in _indexes:
print('Processing: '+ index)
anno_file = os.path.join(root_path,'train_rename_totaltext_labels_sqfree/')+ index +'.txt' with open(anno_file,'r+')as f:
#lines是每个⽂件中包含的内容
lines =[line for line adlines()if line.strip()]
single_list =[]
all_list =[]
for i, line in enumerate(lines):
#if i == 0:
#continue
#parts是每⼀⾏包含的内容
parts = line.strip().split(',')
xy_list =[]
for a, part in enumerate(parts):
if a >1:
break
piece = part.strip().split(',')
numberlist = re.findall(r'\d+',piece[0])
d(numberlist)
length =len(xy_list)
n =int(length /2)
x_list = xy_list[:n]
y_list = xy_list[n:]
single_list =[None]*(len(x_list)+len(y_list))
single_list[::2]= x_list
single_list[1::2]= y_list
all_list.append(single_list)
with open(anno_file,'w')as w:
for all_list_piece in all_list:
for string in all_list_piece:
w.write(string)textstyle
w.write(',')
w.write('\n')
这样刚才那个标注格式就会变成这样:
115,322,503,346,494,426,115,404,
734,360,1058,369,1061,449,744,430,
558,370,682,375,682,404,557,398,
562,347,595,304,651,305,687,360,653,366,637,334,604,332,588,361,
603,408,632,413,630,426,603,423,
599,419,638,422,637,441,596,437,
583,410,602,445,633,445,656,411,679,428,648,476,594,472,558,432,
543,347,583,288,660,288,701,347,691,358,653,308,592,302,557,355,
557,419,580,470,640,481,683,422,698,432,649,497,583,491,537,432,
极⼤⽅便了后续的处理过程。

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。