Member 14978731 Ответов: 0

Пытаясь сохранить электронную почту в виде HTML и PDF - кодирования проблемы, продолжайте иметь � , Â и \u2020

I'm trying to write a program that will download my emails and save them as PDF.

I've encountered a problem with encoding.

I'm using the email and imaplib modules. When I use this method to write the file: part.get_payload(decode=True) I get an html file with \u2013 and � in it.

Writing the raw email in html works and doesn't show any � but it also shows the header of the email message, trying to get rid of the headers makes the � return. I've tried changing the encoding to ISO-8859-1 which removes the � but instead I get \u2020 and \u2013

Removing this line from the html solved the problem, until I converted it to PDF: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:asp="remove"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"></meta><meta name="format-detection" content="telephone=no, date=no, address=no, email=no, url=no"></meta><style type="text/css">

When I converted it to PDF Â and â started appearing on the document.

This is the code I wrote:

Что я уже пробовал:

m = imaplib.IMAP4_SSL('imap.mail.yahoo.com')
m.login('xxxx@yahoo.com', 'xxxxx')




m.select('IL', readonly=True)
resp, data = m.search(None, '(SINCE "01-Jul-2019" BEFORE "29-Oct-2020" SUBJECT \"Your order\")')

messages = data[0].split()


for item in messages:
    typ, data = m.fetch(item, '(RFC822)')
    raw_email = data[0][1].decode("utf-8")
    email_message = email.message_from_string(raw_email)
    to_ = email_message['To']
    from_ = email_message['From']
    subject_= email_message['Subject']
    date_ = email_message['date']
    counter = 1
    for part in email_message.walk():
        if part.get_content_maintype() == "multipart":
            continue
        filename = part.get_filename()
        content_type = part.get_content_type()
        if not filename:
            ext = mimetypes.guess_extension(content_type)
            if not ext:
                ext = '.bin'
            filename = 'msg-part-%08d%s' %(counter, ext)
        counter +=1
    save_path = os.path.join(os.getcwd(), "emails", date_, subject_)
    if not os.path.exists(r'save_path'):
        print (save_path)
        os.makedirs(r'save_path')
    with open(os.path.join(r'save_path', filename), 'wb') as fp:
        fp.write(part.get_payload(decode=True))
    pdfkit.from_file('msg-part-00000001.htm', 'test.pdf')

<pre lang="Python">

Питон HTML кодирование PDF UTF-8 почта Источник

Gerry Schmitz

Это пунктуация; проверьте свой набор символов.

Пытаясь сохранить электронную почту в виде HTML и PDF - кодирования проблемы, продолжайте иметь � , Â и \u2020

Gerry Schmitz

0 Ответов

Категории

Недавние ответы

Изменение источника данных (базы данных) программно

Проблема при попытке вставить данные: ошибка при преобразовании типа данных nvarchar в числовой.

Может ли кто-нибудь, пожалуйста, сказать мне, как я практичен в жизненном цикле страницы .NET

C# get и set ? любая помощь ценится!

Как получить список ip-адресов пользователей, подключенных к моему Wi-Fi