Solve : I need to determine an unkown encoding of a .XLS file.?

1.	Solve : I need to determine an unkown encoding of a .XLS file.?
Answer» I am working on writing a python(3) script to take data out of an xls file, and place it in an sqlite database. The xls file is downloaded from the Internet, and encoded in a way I haven't been able to get to work with the xlrd library. Code: [Select]Python 3.5.2+ (default, Sep 22 2016, 12:18:14) [GCC 6.2.0 20160927] on linux Type "copyright", "credits" or "license()" for more information. >>> import xlrd >>> file = xlrd.open_workbook("test.xls") WARNING * file size (262435) not 512 + multiple of sector size (512) WARNING * OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> file = xlrd.open_workbook("test.xls") File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook ragged_rows=ragged_rows, File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls bk.parse_globals() File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals self.handle_writeaccess(data) File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess strg = unpack_unicode(data, 0, lenlen=2) File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode strg = unicode(rawstrg, 'utf_16_le') File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda> unicode = lambda b, enc: b.decode(enc) File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data I was reading through the docs, and it didn't give a clear way to test how the file is encoded. Link Related: http://xlrd.readthedocs.io/en/latest/unicode.html I did try other codes I have heard of, but I've yet to get one to work. Any thoughts?Sounds like the file is missing the CODEPAGE information. You'll have to override it as documented here. Code: [Select]file = xlrd.open_workbook('test.xls', encoding_override="utf_16_le") You'll have to experiment with different encodings depending on what kind it actually is.I have been try different codecs, but have yet to get one to work. I may write a script that will just try them all... An interesting thing I have FOUND is that no matter what codec I have given as a argument, I always get the same error: Code: [Select]>>> xlrd.open_workbook('test.xls', encoding_override="iso8859_15") WARNING * file size (262435) not 512 + multiple of sector size (512) WARNING * OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook ragged_rows=ragged_rows, File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls bk.parse_globals() File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals self.handle_writeaccess(data) File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess strg = unpack_unicode(data, 0, lenlen=2) File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode strg = unicode(rawstrg, 'utf_16_le') File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda> unicode = lambda b, enc: b.decode(enc) File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data OK, then the file PROBABLY does have a CODEPAGE but the file itself is corrupted. Can you open the file in Excel/LibreOffice?Yes, I can open it in LibreOffice, and even if I change the file extension libreoffice doesn't complain. It is not an xlsx for sure because it isn't a compressed file. (I can see the text of the spread sheet with a hex editor)I have been investigating the file a bit with a hex editor... I found the data of the first cell of the downloaded file here: Code: [Select]4F 00 72 00 64 00 65 00 72 00 20 00 47 00 75 00 69 00 64 00 65 00 Libre Office interprets it as "Order Guide" Now I copied this cell to a blank workbook and saved it. I found "Order Guide" in the document as: Code: [Select]4F 72 64 65 72 20 47 75 69 64 65 Does anyone with more experience with encodings have any idea what the downloaded file could be encoded with? If not does anyone know where else to look in the file?First example is UTF-16 Little Endian (LE).I've tried most COMMON codecs with no success. So maybe the file is utf_16_le. Assuming the file is intentionally corrupted (100% reproducible), with python I would like to attempt to reconstruct it. Code: [Select]UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data Position 108 I am assuming means the 108th byte in the file. 0x20 from what I understand is usually a space, I will add 00 before it. brbThe error message was not helpful, no byte 0x20 was near byte position 108, nor near positions of 108*2^n. So I decided to loop through the file with: Code: [Select]file = open("prices.xls", encoding="utf-16-le") byte = file.read(1) while byte != "": byte = file.read(1) file.close() and I got this error: Code: [Select]Traceback (most recent call last): File "./test.py", line 5, in <module> byte = file.read(1) File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1018-1019: illegal encoding Bytes in position 1018-1019 are: 00 00 On a side note, if I open the file in libreoffice and save as an xls, I can open it with xlrd, and the file shrinks significantly. 262.5kb to 217.6kb. Any ideas on what to do next? I do not want my script to be dependent on libreoffice being installed.Quote Assuming the file is intentionally corrupted (100% reproducible) Where did it come from?The working par t of the script, fills out a couple webforms, and downloads the xls file using beautiful soup. This is the part that does it: Code: [Select]request = browser.session.get(myURL, stream=True) with open(path + os.sep +"test.xls", "wb") as output: output.write(request.content) output.close() Quote Assuming the file is intentionally corrupted (100% reproducible) Sorry for not being clear to what is happening. So the file that my code downloads, matches the file that I can download using Firefox. I know the xls file is generated by some webservice. That xls file I assume is corrupted is some form. To clean up the xls file my current scripted solution is to run this: Code: [Select]from subprocess import call call(["unoconv", "-d", "spreadsheet", "--format=ods", path + os.sep +"test.xls"]) call(["unoconv", "-d", "spreadsheet", "--format=xls", path + os.sep +"test.ods"]) While this solution works, I am not happy with it because it adds dependencies of libreoffice and unoconv. What I would like to do is accomplish the same end result, but without adding nonpython dependencies. Any idea where to start?

Answer»

I am working on writing a python(3) script to take data out of an xls file, and place it in an sqlite database. The xls file is downloaded from the Internet, and encoded in a way I haven't been able to get to work with the xlrd library.

Code: [Select]Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
[GCC 6.2.0 20160927] on linux
Type "copyright", "credits" or "license()" for more information.
>>> import xlrd
>>> file = xlrd.open_workbook("test.xls")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
file = xlrd.open_workbook("test.xls")
File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
bk.parse_globals()
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
self.handle_writeaccess(data)
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
strg = unpack_unicode(data, 0, lenlen=2)
File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
I was reading through the docs, and it didn't give a clear way to test how the file is encoded.
Link Related: http://xlrd.readthedocs.io/en/latest/unicode.html

I did try other codes I have heard of, but I've yet to get one to work. Any thoughts?Sounds like the file is missing the CODEPAGE information. You'll have to override it as documented here.

Code: [Select]file = xlrd.open_workbook('test.xls', encoding_override="utf_16_le")

You'll have to experiment with different encodings depending on what kind it actually is.I have been try different codecs, but have yet to get one to work. I may write a script that will just try them all...

An interesting thing I have FOUND is that no matter what codec I have given as a argument, I always get the same error:
Code: [Select]>>> xlrd.open_workbook('test.xls', encoding_override="iso8859_15")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
bk.parse_globals()
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
self.handle_writeaccess(data)
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
strg = unpack_unicode(data, 0, lenlen=2)
File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
OK, then the file PROBABLY does have a CODEPAGE but the file itself is corrupted. Can you open the file in Excel/LibreOffice?Yes, I can open it in LibreOffice, and even if I change the file extension libreoffice doesn't complain. It is not an xlsx for sure because it isn't a compressed file. (I can see the text of the spread sheet with a hex editor)I have been investigating the file a bit with a hex editor...
I found the data of the first cell of the downloaded file here:
Code: [Select]4F 00 72 00 64 00 65 00 72 00 20 00 47 00 75 00 69 00 64 00 65 00
Libre Office interprets it as "Order Guide"

Now I copied this cell to a blank workbook and saved it.
I found "Order Guide" in the document as:
Code: [Select]4F 72 64 65 72 20 47 75 69 64 65

Does anyone with more experience with encodings have any idea what the downloaded file could be encoded with? If not does anyone know where else to look in the file?First example is UTF-16 Little Endian (LE).I've tried most COMMON codecs with no success.
So maybe the file is utf_16_le.

Assuming the file is intentionally corrupted (100% reproducible), with python I would like to attempt to reconstruct it.

Code: [Select]UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
Position 108 I am assuming means the 108th byte in the file.
0x20 from what I understand is usually a space, I will add 00 before it.
brbThe error message was not helpful, no byte 0x20 was near byte position 108, nor near positions of 108*2^n.

So I decided to loop through the file with:
Code: [Select]file = open("prices.xls", encoding="utf-16-le")
byte = file.read(1)
while byte != "":
byte = file.read(1)
file.close()
and I got this error:
Code: [Select]Traceback (most recent call last):
File "./test.py", line 5, in <module>
byte = file.read(1)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1018-1019: illegal encoding
Bytes in position 1018-1019 are: 00 00

On a side note, if I open the file in libreoffice and save as an xls, I can open it with xlrd, and the file shrinks significantly. 262.5kb to 217.6kb.

Any ideas on what to do next?
I do not want my script to be dependent on libreoffice being installed.Quote

Assuming the file is intentionally corrupted (100% reproducible)

Where did it come from?The working par t of the script, fills out a couple webforms, and downloads the xls file using beautiful soup.
This is the part that does it:
Code: [Select]request = browser.session.get(myURL, stream=True)
with open(path + os.sep +"test.xls", "wb") as output:
output.write(request.content)
output.close()
Quote

Assuming the file is intentionally corrupted (100% reproducible)

Sorry for not being clear to what is happening.
So the file that my code downloads, matches the file that I can download using Firefox. I know the xls file is generated by some webservice. That xls file I assume is corrupted is some form.
To clean up the xls file my current scripted solution is to run this:
Code: [Select]from subprocess import call
call(["unoconv", "-d", "spreadsheet", "--format=ods", path + os.sep +"test.xls"])
call(["unoconv", "-d", "spreadsheet", "--format=xls", path + os.sep +"test.ods"])
While this solution works, I am not happy with it because it adds dependencies of libreoffice and unoconv.
What I would like to do is accomplish the same end result, but without adding nonpython dependencies.
Any idea where to start?

Solve : I need to determine an unkown encoding of a .XLS file.?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment