|
Answer» Hi, everyone, i am trying to build an ENGLISH-chinese bilingual corpus. I have downloaded about 1000 internet pages containing English news and its translations. I have saved them as .xml files which reads like:
- - - - gate.SourceURL http://www.freexinwen.com/chinese/eng/news_bilingual/news/news_010107.asp - MimeType text/html - - 看新闻学英语 01-01-07 亚洲领导人迎2007承诺促进国际合作 (Asian Leaders Usher in 2007 with Pledges for International Cooperation) 几个亚洲国家政府领导人在2007到来之际承诺促进国际合作并发表政策讲话。 在日本,安倍晋三首相承诺要改进和中国及韩国的关系。日本与这两国的关系近年来出现紧张,原因是它们认为日本歪曲了上世纪前半叶日本军国主义的历史。 在中国,国家主席胡锦涛承诺将建设“和谐社会”,并和其它国家共同解决减少污染和促进经济增长等其它问题。胡锦涛还强调将寻求和台湾实现和平统一。台湾总统陈水扁表示, 台湾的主权由台湾人民决定。他宣称台湾“绝对不属于”中国。 Several Asian government leaders have welcomed 2007 with new pledges for international cooperation and policy speeches. In Japan, Prime Minister Shinzo Abe pledged to improve relations with China and South KOREA. Japan's relationship with those countries has been strained in recent years because of their perception that Japan misrepresents the history of its military expansionism in the FIRST half of the last century. In China, President Hu Jintao pledged to pursue social harmony at home, and to work with other countries to ADDRESS common issues, such as pollution and economic growth. MR. Hu also stressed that he intends to pursue peaceful reunification with Taiwan. Taiwan's President Chen Shui-bian said the island's sovereignty was a matter for its people. He declared that the island "definitely does not belong" to China.
How can i extract only the Chinese and English from the 1000 .xml files and convert them into individual .txt files? Many thanks! Please don't post the same question more than once. Thank you. I deleted the duplicate post.i am sorry, admin. I won't do it anymore. Could anyone offfer me some tips on how to do it? Welcome to the CH forums.
Here's a small trial script which should get you going but I don't know how to actually display Chinese characters and English text from the same .txt file, there's gotta be a font change in there somewhere.
Code: [Select]@echo off>trial.txt cls setlocal
for /f "tokens=*" %%1 in ('dir /b trial.xml') do (
findstr /v /c:"<" /c:">" %%1>>trial.txt )
type trial.txt
|