| 1. |
Solve : Reproduce unicode from text file?? |
|
Answer» I'm writing a script which I want to support reading unicode paths from a text file. I want all the unicode to be reproduced the same as pasting the characters into DOS where it appears as a square-like character. (DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.) I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what? Thanks for your help Linux but you're still misunderstanding what's required. If you have Windows, open a command prompt window and paste this: Code: [Select]for /f "usebackq delims=" %x in ('"C:\Folder き"') do set myvar=%x This sets myvar to "C:\Folder き". The き character is displayed as a square, but it is still read as き. Do this to save a text file unicode.txt as UTF-16LE (Unicode) containing "C:\Folder き": Code: [Select]%ComSpec% /u /c echo:"C:\Folder き">unicode.txt Then try this: Code: [Select]for /f "delims=" %x in ('type unicode.txt') do set myvar=%x Does nothing because type fails. Then try: Code: [Select]type unicode.txt You will see that this does not output "C:\Folder き", but instead: Code: [Select] C : \ F o l d e r M0" So I need something that will output "C:\Folder き". ghostdog74: I'll try your code as well. Thanks!ghostdog74, I installed Python 3.1.2 and saved your script as read.py. My UTF-16LE file uni.txt contains: "C:\Folder き" Both files are inside the current directory. This is my MS-DOS command prompt: What am I doing wrong? Quote from: orange_batch on September 08, 2010, 11:15:00 PM What am I doing wrong?if you are using Python 3.1 ++, the print statement is now a function, so add brackets. Code: [Select]import codecs import sys inputfile=sys.argv[1] f = codecs.open( inputfile, "r", "utf-16" ) for line in f: print(line) And Python is particular about indentation, so indent properly.I'm not 100% sure how much better it would be as an alternative but following your example, I used more < unicode.txt and I got this: Code: [Select]D:\>more < unicode.txt "C:\Folder ?" So I thought- maybe changing the codepage and THEN using this more to do it will work? Code: [Select]D:\>chcp 65001 Active code page: 65001 D:\>more < unicode.txt Not enough memory. That was certainly an unexpected result. I wasn't able to get anything useful out of it either. ( tried copy unicode.txt con as well, which gave the same result as the type command). I found the fact that more wasn't displaying it spaced out and had a question mark rather then the ASCII-i-fied unicode promising, but redirecting that to a text file, even if I used the /u switch to start cmd still resulted in ASCII, with the double-byte character being converted to a question mark. Personally, I'd go with GhostDog's solution. Quote from: orange_batch on September 08, 2010, 11:15:00 PM ghostdog74, I installed Python 3.1.2 and saved your script as read.py.I think the codecs package/import is deprecated, or at least the codecs.open routine is, in python 3.1 and later. works in activestate perl 2.6.4... or, to be more precise, it runs. (without a syntax error, not sure why you are getting that) I do get this though: Code: [Select]D:\>python D:\test.py unicode.txt Traceback (most recent call last): File "D:\test.py", line 5, in <module> for line in f: File "C:\Python\lib\codecs.py", line 679, in next return self.reader.next() File "C:\Python\lib\codecs.py", line 610, in next line = self.readline() File "C:\Python\lib\codecs.py", line 525, in readline data = self.read(readsize, firstline=True) File "C:\Python\lib\codecs.py", line 472, in read newchars, decodedbytes = self.decode(data, self.errors) File "C:\Python\lib\encodings\utf_16.py", line 90, in decode raise UnicodeError,"UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM for some reason the command prompt doesn't redirect it's unicode output with a Byte Order Marker. I did a little looking to see if there was a way to specify to the open method to pretend there isn't a BOM and go with either LE or BE ordering, but my search was fruitless. It appeared to work for ghostdog, so I assume we are both doing something incorrect- your version is too new (no idea wether this is really the case) and probably changed something, whereas I... well, I'm not sure what I did wrong. perhaps I made the unicode file incorrectly. Updated. Same here BC. BC, Orange, first of all, the cmd.exe shell does not support utf-16, so its no use using it to run your script (although Python does process it normally at the back end). Secondly, I am using Python 2.6.x. so i am not sure about Python 3.1.X but, here's another method Code: [Select]data = open("c:\\test\\file1", 'rb').read() decoded = data.decode('utf-16') print decoded try the above using the Python Windows editor (comes with distribution) or some other platform that can display unicode.. I only know the tip of the iceberg of unicode so for more information, check with the docs 1) Unicode how to 2) codecs module 3) search at stackoverflowWell, the script itself is not unicode (DOS fails spectacularly with that), but it deals with path names which might contain unicode. Microsoft knows some dirty secret about this unicode business. Still not working ghostdog. But I'm not disheartened, because jeb on DosTips.com made me realize a fair enough workaround: First of all, I'm only dealing with folder names, but this can be applied to files as well. Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode. Retrieve the unicode in a folder name: for /d %%x in ("C:\Folder ?") set folder="%%x" (or retrieve the unicode in a file name:) for %%x in ("C:\file ?") set file="%%x" There are two problems however: 1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc. 2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode. Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd. For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this. It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired. |
|