Decoding Utf-8 Strings In Python
I'm writing a web crawler in python, and it involves taking headlines from websites. One of the headlines should've read : And the Hip's coming, too But instead it said: And the Hi
Solution 1:
It's an encoding error - so if it's a unicode string, this ought to fix it:
text.encode("windows-1252").decode("utf-8")
If it's a plain string, you'll need an extra step:
text.decode("utf-8").encode("windows-1252").decode("utf-8")
Both of these will give you a unicode string.
By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:
>>> import chardet
>>> chardet.detect(u"And the Hip’s coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}
Solution 2:
You need to properly decode the source text. Most likely the source text is in UTF-8 format, not ASCII.
Because you do not provide any context or code for your question it is not possible to give a direct answer.
I suggest you study how unicode and character encoding is done in Python:
Post a Comment for "Decoding Utf-8 Strings In Python"