Solve : Reproduce unicode from text file?? – InterviewSolution

InterviewSolution

1.	Solve : Reproduce unicode from text file??
Answer» <html><body><p>I'm writing a script which I want to support reading unicode paths from a text file. I want all the unicode to be reproduced the same as pasting the characters into DOS where it appears as a square-like character. (DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.)<br/><br/><strong>Type</strong> will <em>not</em> reproduce unicode from a text file properly, regardless of the file's encoding and the current code page. <strong>For</strong> cannot process <strong>type</strong>'s junky unicode output which makes it impossible to stick in a variable. <strong>For</strong> has no problem with pasted unicode as aforementioned...<br/><br/>So far:<br/>I open <strong>cmd.exe /u</strong> so I can redirect a unicode path to a text file. This encodes it as <strong>UTF-16LE</strong>. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this:<br/>"C:\Folder き"<br/><br/>Basically, I need a utility that works like <strong>type</strong> to read the text file and reproduce this string properly, as-is. I was told gawk might be able to do it, but I don't have a clue where to start with gawk.<br/><br/>Can anyone who knows vbscript accomplish this in vbscript? If it can't output the text properly to command prompt, if it can at least send the lines properly as arguments to an executable, that would be okay. I found <a href="http://www.xtremedotnettalk.com/showpost.php?p=457931&postcount=7">this</a> (second half, Unicode to ASCII) but don't know how much it could help. I've spent hours on this scouring the web and I've found next to nothing. Help please! Thanks!I am not sure I understand what you are trying to say in your description, but I do understand the question in the subject so <a href="https://interviewquestions.tuteehub.com/tag/thats-248329" style="font-weight:bold;" target="_blank" title="Click to know more about THATS">THATS</a> what I am going to answer.<br/><br/>I think in order to turn a ASCII text file into a unicode text file, you simply need to rewrite the file with every 2nd byte a null character. I might be wrong. If I am not, this could probably be accomplished in almost any programming language that supports file IO. In VB you could allocate two byte arrays. One of which twice the size of the other. The file could then be read into the first array and then copied to every other byte in the second array. Then every 2nd byte in the 2nd array could be set to 0. Finally, the second array could be written back to the file or to a new file if necessary.<br/><br/>If it is confirmed that my reasoning is correct, then I would write the code for you. I am just not sure that this is the correct way to convert.Thanks for responding. As you thought, you're not on the right track. Btw, Japanese characters overwrite that 00 padding, as part of unicode.<br/><br/>I have a text file encoded as UTF-16LE containing the following:<br/>"C:\Folder き"<br/><br/>I want to <strong>type</strong> this file in a batch script to the same effect as typing <strong>set <a href="https://interviewquestions.tuteehub.com/tag/myvar-1108041" style="font-weight:bold;" target="_blank" title="Click to know more about MYVAR">MYVAR</a>="C:\Folder き"</strong> manually in command prompt, where the き character would appear as a square (which means the character is there, it just can't be displayed). But as I described, the <strong>type</strong> command only works properly with ANSI text files, and <strong>type</strong>-ing the text file within a <strong>for</strong> loop to set the output to a variable does not work - <strong>for</strong> fails. <strong>For</strong> works fine when entering the text manually as I described though. It leads me to believe it's how <strong>type</strong> outputs the characters based on the current code page, which is apparently not the same.<br/><br/>Again, I believe I just need a vbscript or something that can output the text file properly - in the same character coding context as doing it manually.<br/><br/>Edit: I should probably show what I'm describing after all, lol.<br/><br/>for /f "delims=" %%x in ('type unicode_path.txt') do set myvar=%%x<br/><br/>Does not do anything, because <strong>type</strong> is outputting sloppy ANSI/code-page interpreted unicode and <strong>for</strong> doesn't like it.<br/><br/>I think it would be a command prompt <a href="https://interviewquestions.tuteehub.com/tag/revolution-247052" style="font-weight:bold;" target="_blank" title="Click to know more about REVOLUTION">REVOLUTION</a> of sorts if you could code a vbscript which works! This is a big issue! lol<br/><br/>I'm likely to reproduce my big script in PowerShell once it's done (I'll still complete it, unicode or not), but I'd like to exhaust any options first.Is this sufficient? It just reads a text file into the command prompt. There is a way to make it take the file name as an argument, but I can't remember that part right now. You probably could have done that yourself, but i figured I would respond anyway. I might be able to come up with something better later if needed. I am using a <a href="https://interviewquestions.tuteehub.com/tag/mac-546540" style="font-weight:bold;" target="_blank" title="Click to know more about MAC">MAC</a> now.<br/><br/> Code: <a>[Select]</a>Set objFSO = CreateObject("Scripting.FileSystemObject")<br/>Set objFile = objFSO.OpenTextFile("c:\blablabla.txt", 1)<br/>Do Until objFile.AtEndOfStream<br/>Wscript.Echo objFile.ReadLine<br/>Loop<br/>objFile.Close<br/><br/>EDIT: Be aware that you have to run this using cscript not wscript or it will just pop up a messagebox.I don't know vbscript at all actually, it's one of those things I can read but can't write.<br/><br/>I only need to process one file, so it doesn't need to accept a file name for an argument, but it does need to work with relative paths so hopefully that's not a problem.<br/><br/>I tested it and it's reading the text file similar to <strong>type</strong>, bad encoding. The only difference being that I can manipulate the output in <strong>for</strong>, so that's a good sign that it could just require some encoding work next.<br/><br/>There might be some vbscript clues in the link in my original post:<br/><br/><a href="http://www.xtremedotnettalk.com/showpost.php?p=457931&postcount=7">http://www.xtremedotnettalk.com/showpost.php?p=457931&postcount=7</a><br/><br/>Here is what they all output. UNI is the unicode file I'm working with. UTF is the unicode file converted to UTF-8. I also changed code pages from the default to UTF-8 just to see:<br/><br/><br/><br/>One more thing, if <strong>type utf.txt</strong> under the UTF-8 code page would work with <strong>for</strong>, that would almost solve the problem, except that I'd need a utility to convert the UTF-16LE file to UTF-8, and changing to code page 65001 makes batch scripts exit... <br/><br/>I think I did it. I decided to program it in c because I thought this may be a useful tool for other people to use and it seems more official to make an exe tool. I know you said you didn't need it to take arguments, but I added one anyway just encase someone else needs it.<br/><br/>I've attached the exe and code below.<br/><br/>[recovering disk space - old attachment deleted by admin]I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what?<br/><br/>I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out.If you can use Python, <br/> Code: <a>[Select]</a>import codecs<br/>import sys<br/>inputfile=sys.argv[1]<br/>f = codecs.open( inputfile, "r", "utf-16" )<br/>for line in f:<br/> print line<br/><br/>output<br/> Code: <a>[Select]</a>C:\test>more i.am.utf16.file.txt<br/>here i am , utf-16 encoded file<br/><br/>C:\test>file i.am.utf16.file.txt<br/>i.am.utf16.file.txt; Little-endian UTF-16 Unicode text, with no line terminators<br/><br/>C:\test>od -c i.am.utf16.file.txt<br/>0000000 ■ h \0 e \0 r \0 e \0 \0 i \0 \0<br/>0000020 a \0 m \0 \0 , \0 \0 u \0 t \0 f \0<br/>0000040 - \0 1 \0 6 \0 \0 e \0 <a href="https://interviewquestions.tuteehub.com/tag/n-236724" style="font-weight:bold;" target="_blank" title="Click to know more about N">N</a> \0 c \0 o \0<br/>0000060 d \0 e \0 d \0 \0 f \0 i \0 l \0 e \0<br/>0000100 \n \0<br/>0000102<br/><br/><br/>C:\test>python test.py i.am.utf16.file.txt<br/>here i am , utf-16 encoded file<br/><br/><br/>its so much easier Quote from: Linux711 on September 08, 2010, 03:31:39 PM</p><blockquote>I just realized that what I posted might not work. Do you just want all the unicode characters removed from the string or do you want to replace them with a space or a box or what?<br/><br/>I will reply with new code when you confirm this. I have already put way too much time into this thread so bear with me until I can figure this out.<br/></blockquote> <br/>Thanks for your help Linux but you're still misunderstanding what's required.<br/><br/>If you have Windows, open a command prompt window and paste this:<br/><br/> Code: <a>[Select]</a>for /f "usebackq delims=" %x in ('"C:\Folder き"') do set myvar=%x<br/>This sets myvar to "C:\Folder き". The き character is displayed as a square, but it is still read as き.<br/><br/>Do this to save a text file unicode.txt as UTF-16LE (Unicode) containing "C:\Folder き":<br/><br/> Code: <a>[Select]</a>%ComSpec% /u /c echo:"C:\Folder き">unicode.txt<br/>Then try this:<br/><br/> Code: <a>[Select]</a>for /f "delims=" %x in ('type unicode.txt') do set myvar=%x<br/>Does nothing because type fails. Then try:<br/><br/> Code: <a>[Select]</a>type unicode.txt<br/>You will see that this does not output "C:\Folder き", but instead:<br/><br/> Code: <a>[Select]</a> C : \ F o l d e r M0"<br/>So I need something that will output "C:\Folder き".<br/><br/>ghostdog74: I'll try your code as well. Thanks!ghostdog74, I installed Python 3.1.2 and saved your script as read.py.<br/><br/>My UTF-16LE file uni.txt contains:<br/>"C:\Folder き"<br/><br/>Both files are inside the current directory.<br/><br/>This is my MS-DOS command prompt:<br/><br/><br/><br/>What am I doing wrong? Quote from: orange_batch on September 08, 2010, 11:15:00 PM<blockquote>What am I doing wrong?<br/></blockquote> if you are using Python 3.1 ++, the print statement is now a function, so add brackets.<br/> Code: <a>[Select]</a>import codecs<br/>import sys<br/>inputfile=sys.argv[1]<br/>f = codecs.open( inputfile, "r", "utf-16" )<br/>for line in f:<br/> print(line)<br/>And Python is particular about indentation, so indent properly.I'm not 100% sure how much better it would be as an alternative but following your example, I used more < unicode.txt and I got this:<br/><br/> Code: <a>[Select]</a>D:\>more < unicode.txt<br/>"C:\Folder ?"<br/><br/>So I thought- maybe changing the codepage and THEN using this more to do it will work?<br/><br/> Code: <a>[Select]</a>D:\>chcp 65001<br/>Active code page: 65001<br/><br/>D:\>more < unicode.txt<br/>Not enough memory.<br/><br/>That was certainly an unexpected result. I wasn't able to get anything useful out of it either. ( tried copy unicode.txt con as well, which gave the same result as the type command). I found the fact that more wasn't displaying it spaced out and had a question mark rather then the ASCII-i-fied unicode promising, but redirecting that to a text file, even if I used the /u switch to start cmd still resulted in ASCII, with the double-byte character being converted to a question mark.<br/><br/>Personally, I'd go with GhostDog's solution.<br/><br/><br/> Quote from: orange_batch on September 08, 2010, 11:15:00 PM<blockquote>ghostdog74, I installed Python 3.1.2 and saved your script as read.py.<br/><br/>My UTF-16LE file uni.txt contains:<br/>"C:\Folder き"<br/><br/>Both files are inside the current directory.<br/><br/>This is my MS-DOS command prompt:<br/><br/><br/><br/>What am I doing wrong?<br/></blockquote> I think the codecs package/import is deprecated, or at least the codecs.open routine is, in python 3.1 and later.<br/><br/>works in activestate perl 2.6.4... or, to be more precise, it runs. (without a syntax error, not sure why you are getting that)<br/><br/>I do get this though:<br/><br/> Code: <a>[Select]</a>D:\>python D:\test.py unicode.txt<br/>Traceback (most recent call last):<br/> File "D:\test.py", line 5, in <module><br/> for line in f:<br/> File "C:\Python\lib\codecs.py", line 679, in next<br/> return self.reader.next()<br/> File "C:\Python\lib\codecs.py", line 610, in next<br/> line = self.readline()<br/> File "C:\Python\lib\codecs.py", line 525, in readline<br/> data = self.read(readsize, firstline=True)<br/> File "C:\Python\lib\codecs.py", line 472, in read<br/> newchars, decodedbytes = self.decode(data, self.errors)<br/> File "C:\Python\lib\encodings\utf_16.py", line 90, in decode<br/> raise UnicodeError,"UTF-16 stream does not start with BOM"<br/>UnicodeError: UTF-16 stream does not start with BOM<br/>for some reason the command prompt doesn't redirect it's unicode output with a Byte Order Marker. I did a little looking to see if there was a way to specify to the open method to pretend there isn't a BOM and go with either LE or BE ordering, but my search was fruitless.<br/><br/>It appeared to work for ghostdog, so I assume we are both doing something incorrect- your version is too new (no idea wether this is really the case) and probably changed something, whereas I... well, I'm not sure what I did wrong. perhaps I made the unicode file incorrectly.<br/><br/>Updated. Same here BC.<br/><br/>BC, Orange, first of all, the cmd.exe shell does not support utf-16, so its no use using it to run your script (although Python does process it normally at the back end). Secondly, I am using Python 2.6.x. so i am not sure about Python 3.1.X but, here's another method<br/><br/> Code: <a>[Select]</a>data = open("c:\\test\\file1", 'rb').read()<br/>decoded = data.decode('utf-16')<br/>print decoded<br/><br/>try the above using the Python Windows editor (comes with distribution) or some other platform that can display unicode..<br/>I only know the tip of the iceberg of unicode so for more information, check with the docs <br/><br/>1) <a href="https://docs.python.org/howto/unicode.html">Unicode</a> how to <br/>2) <a href="https://docs.python.org/library/codecs.html">codecs</a> module<br/>3) search at <a href="https://stackoverflow.com/search?q=python++utf16+">stackoverflow</a>Well, the script itself is not unicode (DOS fails spectacularly with that), but it deals with path names which might contain unicode.<br/><br/>Microsoft knows some dirty secret about this unicode business. <br/><br/><br/><br/>Still not working ghostdog. But I'm not disheartened, because jeb on DosTips.com made me realize a fair enough workaround:<br/><br/>First of all, I'm only dealing with folder names, but this can be applied to files as well.<br/><br/>Forget the whole <strong>cmd /u</strong> thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode.<br/><br/>Retrieve the unicode in a folder name:<br/><strong>for /d %%x in ("C:\Folder ?") set folder="%%x"</strong><br/><br/>(or retrieve the unicode in a file name:)<br/><strong>for %%x in ("C:\file ?") set file="%%x"</strong><br/><br/>There are two problems however:<br/><br/>1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc.<br/><br/>2. <strong>for /r "C:\Folder ?"...</strong> does not work. Other things might not work either, without retrieving the unicode.<br/>Solution: <strong>pushd "C:\Folder ?"</strong> then <strong>for /r</strong> with no <strong>"path"</strong> (processes current directory) then <strong>popd</strong>.<br/><br/>For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this.<br/><br/> It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.</body></html>

Discussion

No Comment Found

Related InterviewSolutions