Solve : Reproduce unicode from text file??

1.	Solve : Reproduce unicode from text file??
Answer» I'm writing a script which I want to support reading unicode paths from a text file. I want all the unicode to be reproduced the same as pasting the characters into DOS where it appears as a square-like character. (DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.) Type will not reproduce unicode from a text file properly, regardless of the file's encoding and the current code page. For cannot process type's junky unicode output which makes it impossible to stick in a variable. For has no problem with pasted unicode as aforementioned... So far: I open cmd.exe /u so I can redirect a unicode path to a text file. This encodes it as UTF-16LE. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this: "C:\Folder き" Basically, I need a utility that works like type to read the text file and reproduce this string properly, as-is. I was told gawk might be able to do it, but I don't have a clue where to start with gawk. Can anyone who knows vbscript accomplish this in vbscript? If it can't output the text properly to command prompt, if it can at least send the lines properly as arguments to an executable, that would be okay. I found this (second half, Unicode to ASCII) but don't know how much it could help. I've spent hours on this scouring the web and I've found next to nothing. Help please! Thanks!I am not sure I understand what you are trying to say in your description, but I do understand the question in the subject so THATS what I am going to answer. I think in order to turn a ASCII text file into a unicode text file, you simply need to rewrite the file with every 2nd byte a null character. I might be wrong. If I am not, this could probably be accomplished in almost any programming language that supports file IO. In VB you could allocate two byte arrays. One of which twice the size of the other. The file could then be read into the first array and then copied to every other byte in the second array. Then every 2nd byte in the 2nd array could be set to 0. Finally, the second array could be written back to the file or to a new file if necessary. If it is confirmed that my reasoning is correct, then I would write the code for you. I am just not sure that this is the correct way to convert.Thanks for responding. As you thought, you're not on the right track. Btw, Japanese characters overwrite that 00 padding, as part of unicode. I have a text file encoded as UTF-16LE containing the following: "C:\Folder き" I want to type this file in a batch script to the same effect as typing set MYVAR="C:\Folder き" manually in command prompt, where the き character would appear as a square (which means the character is there, it just can't be displayed). But as I described, the type command only works properly with ANSI text files, and type-ing the text file within a for loop to set the output to a variable does not work - for fails. For works fine when entering the text manually as I described though. It leads me to believe it's how type outputs the characters based on the current code page, which is apparently not the same. Again, I believe I just need a vbscript or something that can output the text file properly - in the same character coding context as doing it manually. Edit: I should probably show what I'm describing after all, lol. for /f "delims=" %%x in ('type unicode_path.txt') do set myvar=%%x Does not do anything, because type is outputting sloppy ANSI/code-page interpreted unicode and for doesn't like it. I think it would be a command prompt REVOLUTION of sorts if you could code a vbscript which works! This is a big issue! lol I'm likely to reproduce my big script in PowerShell once it's done (I'll still complete it, unicode or not), but I'd like to exhaust any options first.Is this sufficient? It just reads a text file into the command prompt. There is a way to make it take the file name as an argument, but I can't remember that part right now. You probably could have done that yourself, but i figured I would respond anyway. I might be able to come up with something better later if needed. I am using a MAC now. Code: [Select]Set objFSO = CreateObject("Scripting.FileSystemObject") Set objFile = objFSO.OpenTextFile("c:\blablabla.txt", 1) Do Until objFile.AtEndOfStream Wscript.Echo objFile.ReadLine Loop objFile.Close EDIT: Be aware that you have to run this using cscript not wscript or it will just pop up a messagebox.I don't know vbscript at all actually, it's one of those things I can read but can't write. I only need to process one file, so it doesn't need to accept a file name for an argument, but it does need to work with relative paths so hopefully that's not a problem. I tested it and it's reading the text file similar to type, bad encoding. The only difference being that I can manipulate the output in for, so that's a good sign that it could just require some encoding work next. There might be some vbscript clues in the link in my original post: http://www.xtremedotnettalk.com/showpost.php?p=457931&postcount=7 Here is what they all output. UNI is the unicode file I'm working with. UTF is the unicode file converted to UTF-8. I also changed code pages from the default to UTF-8 just to see: One more thing, if type utf.txt under the UTF-8 code page would work with for, that would almost solve the problem, except that I'd need a utility to convert the UTF-16LE file to UTF-8, and changing to code page 65001 makes batch scripts exit... I think I did it. I decided to program it in c because I thought this may be a useful tool for other people to use and it seems more official to make an exe tool. I know you said you didn't need it to take arguments, but I added one anyway just encase someone else needs it. I've attached the exe and code below. [recovering disk space - old attachment deleted by admin]I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what? I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out.If you can use Python, Code: [Select]import codecs import sys inputfile=sys.argv[1] f = codecs.open( inputfile, "r", "utf-16" ) for line in f: print line output Code: [Select]C:\test>more i.am.utf16.file.txt here i am , utf-16 encoded file C:\test>file i.am.utf16.file.txt i.am.utf16.file.txt; Little-endian UTF-16 Unicode text, with no line terminators C:\test>od -c i.am.utf16.file.txt 0000000 ■ h \0 e \0 r \0 e \0 \0 i \0 \0 0000020 a \0 m \0 \0 , \0 \0 u \0 t \0 f \0 0000040 - \0 1 \0 6 \0 \0 e \0 N \0 c \0 o \0 0000060 d \0 e \0 d \0 \0 f \0 i \0 l \0 e \0 0000100 \n \0 0000102 C:\test>python test.py i.am.utf16.file.txt here i am , utf-16 encoded file its so much easier Quote from: Linux711 on September 08, 2010, 03:31:39 PM I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what? I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out. Thanks for your help Linux but you're still misunderstanding what's required. If you have Windows, open a command prompt window and paste this: Code: [Select]for /f "usebackq delims=" %x in ('"C:\Folder き"') do set myvar=%x This sets myvar to "C:\Folder き". The き character is displayed as a square, but it is still read as き. Do this to save a text file unicode.txt as UTF-16LE (Unicode) containing "C:\Folder き": Code: [Select]%ComSpec% /u /c echo:"C:\Folder き">unicode.txt Then try this: Code: [Select]for /f "delims=" %x in ('type unicode.txt') do set myvar=%x Does nothing because type fails. Then try: Code: [Select]type unicode.txt You will see that this does not output "C:\Folder き", but instead: Code: [Select] C : \ F o l d e r M0" So I need something that will output "C:\Folder き". ghostdog74: I'll try your code as well. Thanks!ghostdog74, I installed Python 3.1.2 and saved your script as read.py. My UTF-16LE file uni.txt contains: "C:\Folder き" Both files are inside the current directory. This is my MS-DOS command prompt: What am I doing wrong? Quote from: orange_batch on September 08, 2010, 11:15:00 PM What am I doing wrong? if you are using Python 3.1 ++, the print statement is now a function, so add brackets. Code: [Select]import codecs import sys inputfile=sys.argv[1] f = codecs.open( inputfile, "r", "utf-16" ) for line in f: print(line) And Python is particular about indentation, so indent properly.I'm not 100% sure how much better it would be as an alternative but following your example, I used more < unicode.txt and I got this: Code: [Select]D:\>more < unicode.txt "C:\Folder ?" So I thought- maybe changing the codepage and THEN using this more to do it will work? Code: [Select]D:\>chcp 65001 Active code page: 65001 D:\>more < unicode.txt Not enough memory. That was certainly an unexpected result. I wasn't able to get anything useful out of it either. ( tried copy unicode.txt con as well, which gave the same result as the type command). I found the fact that more wasn't displaying it spaced out and had a question mark rather then the ASCII-i-fied unicode promising, but redirecting that to a text file, even if I used the /u switch to start cmd still resulted in ASCII, with the double-byte character being converted to a question mark. Personally, I'd go with GhostDog's solution. Quote from: orange_batch on September 08, 2010, 11:15:00 PM ghostdog74, I installed Python 3.1.2 and saved your script as read.py. My UTF-16LE file uni.txt contains: "C:\Folder き" Both files are inside the current directory. This is my MS-DOS command prompt: What am I doing wrong? I think the codecs package/import is deprecated, or at least the codecs.open routine is, in python 3.1 and later. works in activestate perl 2.6.4... or, to be more precise, it runs. (without a syntax error, not sure why you are getting that) I do get this though: Code: [Select]D:\>python D:\test.py unicode.txt Traceback (most recent call last): File "D:\test.py", line 5, in <module> for line in f: File "C:\Python\lib\codecs.py", line 679, in next return self.reader.next() File "C:\Python\lib\codecs.py", line 610, in next line = self.readline() File "C:\Python\lib\codecs.py", line 525, in readline data = self.read(readsize, firstline=True) File "C:\Python\lib\codecs.py", line 472, in read newchars, decodedbytes = self.decode(data, self.errors) File "C:\Python\lib\encodings\utf_16.py", line 90, in decode raise UnicodeError,"UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM for some reason the command prompt doesn't redirect it's unicode output with a Byte Order Marker. I did a little looking to see if there was a way to specify to the open method to pretend there isn't a BOM and go with either LE or BE ordering, but my search was fruitless. It appeared to work for ghostdog, so I assume we are both doing something incorrect- your version is too new (no idea wether this is really the case) and probably changed something, whereas I... well, I'm not sure what I did wrong. perhaps I made the unicode file incorrectly. Updated. Same here BC. BC, Orange, first of all, the cmd.exe shell does not support utf-16, so its no use using it to run your script (although Python does process it normally at the back end). Secondly, I am using Python 2.6.x. so i am not sure about Python 3.1.X but, here's another method Code: [Select]data = open("c:\\test\\file1", 'rb').read() decoded = data.decode('utf-16') print decoded try the above using the Python Windows editor (comes with distribution) or some other platform that can display unicode.. I only know the tip of the iceberg of unicode so for more information, check with the docs 1) Unicode how to 2) codecs module 3) search at stackoverflowWell, the script itself is not unicode (DOS fails spectacularly with that), but it deals with path names which might contain unicode. Microsoft knows some dirty secret about this unicode business. Still not working ghostdog. But I'm not disheartened, because jeb on DosTips.com made me realize a fair enough workaround: First of all, I'm only dealing with folder names, but this can be applied to files as well. Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode. Retrieve the unicode in a folder name: for /d %%x in ("C:\Folder ?") set folder="%%x" (or retrieve the unicode in a file name:) for %%x in ("C:\file ?") set file="%%x" There are two problems however: 1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc. 2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode. Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd. For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this. It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.

Answer»

I'm writing a script which I want to support reading unicode paths from a text file. I want all the unicode to be reproduced the same as pasting the characters into DOS where it appears as a square-like character. (DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.)

Type will not reproduce unicode from a text file properly, regardless of the file's encoding and the current code page. For cannot process type's junky unicode output which makes it impossible to stick in a variable. For has no problem with pasted unicode as aforementioned...

So far:
I open cmd.exe /u so I can redirect a unicode path to a text file. This encodes it as UTF-16LE. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this:
"C:\Folder き"

Basically, I need a utility that works like type to read the text file and reproduce this string properly, as-is. I was told gawk might be able to do it, but I don't have a clue where to start with gawk.

Can anyone who knows vbscript accomplish this in vbscript? If it can't output the text properly to command prompt, if it can at least send the lines properly as arguments to an executable, that would be okay. I found this (second half, Unicode to ASCII) but don't know how much it could help. I've spent hours on this scouring the web and I've found next to nothing. Help please! Thanks!I am not sure I understand what you are trying to say in your description, but I do understand the question in the subject so THATS what I am going to answer.

I think in order to turn a ASCII text file into a unicode text file, you simply need to rewrite the file with every 2nd byte a null character. I might be wrong. If I am not, this could probably be accomplished in almost any programming language that supports file IO. In VB you could allocate two byte arrays. One of which twice the size of the other. The file could then be read into the first array and then copied to every other byte in the second array. Then every 2nd byte in the 2nd array could be set to 0. Finally, the second array could be written back to the file or to a new file if necessary.

If it is confirmed that my reasoning is correct, then I would write the code for you. I am just not sure that this is the correct way to convert.Thanks for responding. As you thought, you're not on the right track. Btw, Japanese characters overwrite that 00 padding, as part of unicode.

I have a text file encoded as UTF-16LE containing the following:
"C:\Folder き"

I want to type this file in a batch script to the same effect as typing set MYVAR="C:\Folder き" manually in command prompt, where the き character would appear as a square (which means the character is there, it just can't be displayed). But as I described, the type command only works properly with ANSI text files, and type-ing the text file within a for loop to set the output to a variable does not work - for fails. For works fine when entering the text manually as I described though. It leads me to believe it's how type outputs the characters based on the current code page, which is apparently not the same.

Again, I believe I just need a vbscript or something that can output the text file properly - in the same character coding context as doing it manually.

Edit: I should probably show what I'm describing after all, lol.

for /f "delims=" %%x in ('type unicode_path.txt') do set myvar=%%x

Does not do anything, because type is outputting sloppy ANSI/code-page interpreted unicode and for doesn't like it.

I think it would be a command prompt REVOLUTION of sorts if you could code a vbscript which works! This is a big issue! lol

I'm likely to reproduce my big script in PowerShell once it's done (I'll still complete it, unicode or not), but I'd like to exhaust any options first.Is this sufficient? It just reads a text file into the command prompt. There is a way to make it take the file name as an argument, but I can't remember that part right now. You probably could have done that yourself, but i figured I would respond anyway. I might be able to come up with something better later if needed. I am using a MAC now.

Code: [Select]Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("c:\blablabla.txt", 1)
Do Until objFile.AtEndOfStream
Wscript.Echo objFile.ReadLine
Loop
objFile.Close

EDIT: Be aware that you have to run this using cscript not wscript or it will just pop up a messagebox.I don't know vbscript at all actually, it's one of those things I can read but can't write.

I only need to process one file, so it doesn't need to accept a file name for an argument, but it does need to work with relative paths so hopefully that's not a problem.

I tested it and it's reading the text file similar to type, bad encoding. The only difference being that I can manipulate the output in for, so that's a good sign that it could just require some encoding work next.

There might be some vbscript clues in the link in my original post:

http://www.xtremedotnettalk.com/showpost.php?p=457931&postcount=7

Here is what they all output. UNI is the unicode file I'm working with. UTF is the unicode file converted to UTF-8. I also changed code pages from the default to UTF-8 just to see:

One more thing, if type utf.txt under the UTF-8 code page would work with for, that would almost solve the problem, except that I'd need a utility to convert the UTF-16LE file to UTF-8, and changing to code page 65001 makes batch scripts exit...

I think I did it. I decided to program it in c because I thought this may be a useful tool for other people to use and it seems more official to make an exe tool. I know you said you didn't need it to take arguments, but I added one anyway just encase someone else needs it.

I've attached the exe and code below.

[recovering disk space - old attachment deleted by admin]I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what?

I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out.If you can use Python,
Code: [Select]import codecs
import sys
inputfile=sys.argv[1]
f = codecs.open( inputfile, "r", "utf-16" )
for line in f:
print line

output
Code: [Select]C:\test>more i.am.utf16.file.txt
here i am , utf-16 encoded file

C:\test>file i.am.utf16.file.txt
i.am.utf16.file.txt; Little-endian UTF-16 Unicode text, with no line terminators

C:\test>od -c i.am.utf16.file.txt
0000000 ■ h \0 e \0 r \0 e \0 \0 i \0 \0
0000020 a \0 m \0 \0 , \0 \0 u \0 t \0 f \0
0000040 - \0 1 \0 6 \0 \0 e \0 N \0 c \0 o \0
0000060 d \0 e \0 d \0 \0 f \0 i \0 l \0 e \0
0000100 \n \0
0000102

C:\test>python test.py i.am.utf16.file.txt
here i am , utf-16 encoded file

its so much easier Quote from: Linux711 on September 08, 2010, 03:31:39 PM

I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what?

I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out.

Thanks for your help Linux but you're still misunderstanding what's required.

If you have Windows, open a command prompt window and paste this:

Code: [Select]for /f "usebackq delims=" %x in ('"C:\Folder き"') do set myvar=%x
This sets myvar to "C:\Folder き". The き character is displayed as a square, but it is still read as き.

Do this to save a text file unicode.txt as UTF-16LE (Unicode) containing "C:\Folder き":

Code: [Select]%ComSpec% /u /c echo:"C:\Folder き">unicode.txt
Then try this:

Code: [Select]for /f "delims=" %x in ('type unicode.txt') do set myvar=%x
Does nothing because type fails. Then try:

Code: [Select]type unicode.txt
You will see that this does not output "C:\Folder き", but instead:

Code: [Select] C : \ F o l d e r M0"
So I need something that will output "C:\Folder き".

ghostdog74: I'll try your code as well. Thanks!ghostdog74, I installed Python 3.1.2 and saved your script as read.py.

My UTF-16LE file uni.txt contains:
"C:\Folder き"

Both files are inside the current directory.

This is my MS-DOS command prompt:

What am I doing wrong? Quote from: orange_batch on September 08, 2010, 11:15:00 PM

What am I doing wrong?

if you are using Python 3.1 ++, the print statement is now a function, so add brackets.
Code: [Select]import codecs
import sys
inputfile=sys.argv[1]
f = codecs.open( inputfile, "r", "utf-16" )
for line in f:
print(line)
And Python is particular about indentation, so indent properly.I'm not 100% sure how much better it would be as an alternative but following your example, I used more < unicode.txt and I got this:

Code: [Select]D:\>more < unicode.txt
"C:\Folder ?"

So I thought- maybe changing the codepage and THEN using this more to do it will work?

Code: [Select]D:\>chcp 65001
Active code page: 65001

D:\>more < unicode.txt
Not enough memory.

That was certainly an unexpected result. I wasn't able to get anything useful out of it either. ( tried copy unicode.txt con as well, which gave the same result as the type command). I found the fact that more wasn't displaying it spaced out and had a question mark rather then the ASCII-i-fied unicode promising, but redirecting that to a text file, even if I used the /u switch to start cmd still resulted in ASCII, with the double-byte character being converted to a question mark.

Personally, I'd go with GhostDog's solution.

Quote from: orange_batch on September 08, 2010, 11:15:00 PM

ghostdog74, I installed Python 3.1.2 and saved your script as read.py.

My UTF-16LE file uni.txt contains:
"C:\Folder き"

Both files are inside the current directory.

This is my MS-DOS command prompt:

What am I doing wrong?

I think the codecs package/import is deprecated, or at least the codecs.open routine is, in python 3.1 and later.

works in activestate perl 2.6.4... or, to be more precise, it runs. (without a syntax error, not sure why you are getting that)

I do get this though:

Code: [Select]D:\>python D:\test.py unicode.txt
Traceback (most recent call last):
  File "D:\test.py", line 5, in <module>
   for line in f:
  File "C:\Python\lib\codecs.py", line 679, in next
   return self.reader.next()
  File "C:\Python\lib\codecs.py", line 610, in next
   line = self.readline()
  File "C:\Python\lib\codecs.py", line 525, in readline
   data = self.read(readsize, firstline=True)
  File "C:\Python\lib\codecs.py", line 472, in read
   newchars, decodedbytes = self.decode(data, self.errors)
  File "C:\Python\lib\encodings\utf_16.py", line 90, in decode
   raise UnicodeError,"UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM
for some reason the command prompt doesn't redirect it's unicode output with a Byte Order Marker. I did a little looking to see if there was a way to specify to the open method to pretend there isn't a BOM and go with either LE or BE ordering, but my search was fruitless.

It appeared to work for ghostdog, so I assume we are both doing something incorrect- your version is too new (no idea wether this is really the case) and probably changed something, whereas I... well, I'm not sure what I did wrong. perhaps I made the unicode file incorrectly.

Updated. Same here BC.

BC, Orange, first of all, the cmd.exe shell does not support utf-16, so its no use using it to run your script (although Python does process it normally at the back end). Secondly, I am using Python 2.6.x. so i am not sure about Python 3.1.X but, here's another method

Code: [Select]data = open("c:\\test\\file1", 'rb').read()
decoded = data.decode('utf-16')
print decoded

try the above using the Python Windows editor (comes with distribution) or some other platform that can display unicode..
I only know the tip of the iceberg of unicode so for more information, check with the docs

1) Unicode how to
2) codecs module
3) search at stackoverflowWell, the script itself is not unicode (DOS fails spectacularly with that), but it deals with path names which might contain unicode.

Microsoft knows some dirty secret about this unicode business.

Still not working ghostdog. But I'm not disheartened, because jeb on DosTips.com made me realize a fair enough workaround:

First of all, I'm only dealing with folder names, but this can be applied to files as well.

Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode.

Retrieve the unicode in a folder name:
for /d %%x in ("C:\Folder ?") set folder="%%x"

(or retrieve the unicode in a file name:)
for %%x in ("C:\file ?") set file="%%x"

There are two problems however:

1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc.

2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode.
Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd.

For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this.

It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.

Solve : Reproduce unicode from text file??

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment