Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure for ~200k messages #6

Open
viric opened this issue Oct 13, 2010 · 13 comments
Open

Failure for ~200k messages #6

viric opened this issue Oct 13, 2010 · 13 comments

Comments

@viric
Copy link

viric commented Oct 13, 2010

Hello,

having four or five stops, I could end up downloading 47k of messages of 200k:
$ ls gmailbackup/new/ | wc -l
43815

I run the command, to grab all the messages:
$ python imap2maildir -u xxxxxx -r "[Gmail]/Tots els missatges" -s ALL --create -v -d gmailbackup

and for every run, I'm asked the password, and then it goes:
Opening sqlite3 database 'gmailbackup/.imap2maildir.sqlite'
Synchronizing 199663 messages from imap.gmail.com:[Gmail]/Tots els missatges to /home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/gmailbackup...
TURBO MODE ENGAGED!
Exception! Clearing locks and safing database.
Traceback (most recent call last):
File "imap2maildir", line 495, in
main()
File "imap2maildir", line 476, in main
search=options.search)
File "imap2maildir", line 396, in copy_messages_by_folder
for i in folder.Summaries(search=search):
File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 357, in Summaries
summ = self.__parent.get_summary_by_uid(u)
File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 256, in get_summary_by_uid
'(UID ENVELOPE RFC822.SIZE INTERNALDATE)')
File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 753, in uid
typ, dat = self._simple_command(name, command, *args)
File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 1060, in _simple_command
return self._command_complete(name, self._command(name, *args))
File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 890, in _command_complete
raise self.abort('command: %s => %s' % (name, val))
imaplib.abort: command: UID => socket error: unterminated line

I cannot download anymore. It takes quite a lot of time until the error appears. Can it be that gmail disconnects due to an inactivity timeout?

@viric
Copy link
Author

viric commented Oct 15, 2010

I notice that in checkmessage() the turbo mode does an sql select query for every possible message to check if the message is there. This is a lot of work; I think that it would be far better to get the list into memory into an appropiate searchable structure, and do the check there.

@rtucker
Copy link
Owner

rtucker commented Oct 18, 2010

I've run into a couple cases where a specific message is "corrupted" on gmail's end, and trying to fetch it via IMAP fails. In simpleimap.py, putting a try/except around the get_summary_by_uid should find the IMAP UID that is choking it:

try:
    summ = self.__parent.get_summary_by_uid(u)
except:
    print "uid", u
    raise

Once you have that, it should be possible to delete the offending message.

It should be doing a better job of handling errors such as these. And yes, it is doing a SQL query for each UID... I don't remember why I did it that way, but I think memory consumption was a concern. On second thought, it shouldn't take THAT much memory, and it would likely improve performance a lot. :-) Good catch.

@viric
Copy link
Author

viric commented Oct 18, 2010

Gmail simply closes the socket due to that much inactivity during the first stage of the TURBO MODE.

Once having the list of uids on memory, and checking there instead of by a sql query per uid, I think the turbo mode will work great.

I'm trying without turbo mode, but gmail disconnects me before I can reach even the 15% of my mail.

@rtucker
Copy link
Owner

rtucker commented Oct 18, 2010

Well.

On my gmail mailbox of ~145,000 messages,
Last night's run: about 3.75 hours
With a cache: 7 minutes, 22 seconds

Pull in the latest HEAD and let me know how that works for you.

@viric
Copy link
Author

viric commented Oct 18, 2010

I just tried. I got, with turbo mode, with the old maildir directory that had some letters:

Exception!  Clearing locks and safing database.
Traceback (most recent call last):
  File "./imap2maildir", line 536, in 
    main()
  File "./imap2maildir", line 517, in main
    seencache=seencache)
  File "./imap2maildir", line 435, in copy_messages_by_folder
    for i in folder.Summaries(search=search):
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 357, in Summaries
    summ = self.__parent.get_summary_by_uid(u)
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 256, in get_summary_by_uid
    '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 753, in uid
    typ, dat = self._simple_command(name, command, *args)
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 1060, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 890, in _command_complete
    raise self.abort('command: %s => %s' % (name, val))
imaplib.abort: command: UID => socket error: unterminated line

I am not very good at python, so sorry if I don't get more into details of the code. :)
I will try again creating a new maildir.

@rtucker
Copy link
Owner

rtucker commented Oct 18, 2010

Well, at least it should be faster to test :-)

I just pushed a patch that will spit out the UID it choked on. Once you have that UID, you can try firing up Python and seeing if you can figure out what's wrong with the message:

import simpleimap
server = simpleimap.Server(hostname='imap.gmail.com', username='rtucker@gmail.com', password='blah').Get()
server.select('[Gmail]/All Mail')
server.uid('FETCH', 376544, '(RFC822)')

... would spit out message uid 376544. Try the neighboring messages (presumably 376543 and 376545) as well. You can also try:

    server.uid('FETCH', 376544, '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')

to see what that does, since that's what it is trying to do when it crashes.

imap2maildir could easily ignore this exception and have it continue on, but I think understanding why it is happening will be a very good thing.

Thanks! -rt

@viric
Copy link
Author

viric commented Oct 19, 2010

Here you have it:

>>> server.uid('FETCH', 165982, '(RFC822)')
('OK', [('43816 (UID 165982 RFC822 {5523}', 'Delivered-To: viriketo@gmail.com\r\nReceived: by 10.142.169.1 with SMTP id r1cs178792wfe;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReceived: by 10.115.23.19 with SMTP id a19mr4311058waj.133.1222613393492;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReturn-Path: \r\nReceived: from n16a.bullet.sp1.yahoo.com (n16a.bullet.sp1.yahoo.com [69.147.64.121])\r\n        by mx.google.com with SMTP id t1si2136057poh.13.2008.09.28.07.49.52;\r\n        Sun, 28 Sep 2008 07:49:52 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) client-ip=69.147.64.121;\r\nDomainKey-Status: good\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) smtp.mail=sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com; domainkeys=pass header.From=tradukado@yahoogroups.com\r\nComment: DomainKeys? See http://antispam.yahoo.com/domainkeys\r\nDomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=lima; d=yahoogroups.com;\r\n\tb=LSlgVDUGFtooqe064kt32c5atqJ2pBA+7kklkoqGGl95lG8xCcl8wjfXI6G5C61jPvg4vE0TWl1f2ZdNkYh5Xeade6B9I0le2BqDz8bMtZLINLIKi8XRYyp1pFTQEyGw;\r\nReceived: from [69.147.65.171] by n16.bullet.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nReceived: from [66.218.67.109] by t13.bullet.mail.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nX-Yahoo-Newman-Id: 9862331-m5848\r\nX-Sender: jorgos@aliceadsl.fr\r\nX-Apparently-To: tradukado@yahoogroups.com\r\nX-Received: (qmail 68424 invoked from network); 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (66.218.67.96)\r\n  by m45.grp.scd.yahoo.com with QMQP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (HELO mail.libertysurf.net) (213.36.80.105)\r\n  by mta17.grp.scd.yahoo.com with SMTP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from aliceadsl.fr (192.168.10.57) by mail.libertysurf.net (8.0.015)\r\n        id 482DC6AA00F031DC for tradukado@yahoogroups.com; Sun, 28 Sep 2008 16:49:42 +0200\r\nMessage-Id: \r\nX-Sensitivity: 3\r\nTo: "=?iso-8859-1?Q?tradukado?=" \r\nX-XaM3-API-Version: 3.2 R18 (B34 pl1)\r\nX-type: 0\r\nX-SenderIP: 91.171.195.43\r\nX-Originating-IP: 213.36.80.105\r\nX-eGroups-Msg-Info: 1:12:0:0:0\r\nFrom: "=?iso-8859-1?Q?jorgos@aliceadsl.fr?=" \r\nX-Yahoo-Profile: jorgos_esperanto\r\nSender: tradukado@yahoogroups.com\r\nMIME-Version: 1.0\r\nMailing-List: list tradukado@yahoogroups.com; contact tradukado-owner@yahoogroups.com\r\nDelivered-To: mailing list tradukado@yahoogroups.com\r\nList-Id: \r\nPrecedence: bulk\r\nList-Unsubscribe: \r\nDate: Sun, 28 Sep 2008 16:49:42 +0200\r\nSubject: =?iso-8859-1?Q?Re:[tradukado]_verboj_por_tabulaj_sportoj_(surftabulo,\r\n\t_negxtabulo,_rultabulo,_ktp)?=\r\nReply-To: tradukado@yahoogroups.com\r\nX-Yahoo-Newman-Property: groups-email-tradt-m\r\nContent-Type: text/plain; charset=ISO-8859-1\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\nOni jam delonge neplu biciklumas au gitarludas sed biciklas=0D\r\nkaj gitaras (kvankam ne mem estas biciklo au gitaro) kaj=0D\r\npraktikas bicikladon kaj gitaradon, ^cu ne ? ; nu kial ne ? =0D\r\n=0D\r\n^Ciu elektu mem kaj la popolo decidos tion, kion akcepti...=0D\r\n=0D\r\nJs.=0D\r\n=0D\r\ntradukado, 28 Sep 2008 : verboj por tabulaj sportoj=0D\r\n(surftabulo, negxtabulo, rultabulo, ktp)=0D\r\n=0D\r\nSaluton,=0D\r\nkiel vi verbe esprimus la diversajn X-tabulan sportojn, ekz=0D\r\nuzon de=0D\r\nsurftabulo, negxtabulo, rultabulo, ktp?=0D\r\n1. simple verbigu la substantivon, kompreneble!=0D\r\nsurftabuli, negxtabuli, rultabuli, ...  Do "Li X-tabulas."=0D\r\n2. ne ne, tia verba formo de "tabul-" sensencas aux sugestas=0D\r\nke la=0D\r\nsubjekto ESTAS tia tabulo, do necesas aldoni -um al la=0D\r\nsubstantivo:=0D\r\nsurftabulumi, negxtabulumi, rultabulumi, ... Do "Li X-tabulumas"=0D\r\n3. ne eblas verbigi tiel, oni bezonas uzi ian verbon kun la=0D\r\nsubstantivo: rajdi surftabulon, gliti sur negxtabulo, veturi=0D\r\nper rultabulo, ... Do "Li iras per X-tabulo" aux "Li iras=0D\r\nX-tabule" ktp=0D\r\n4. io alia...?=0D\r\nKiel oni nomu la agadojn substantive?=0D\r\n1. surftabulado, negxtabulado, rultabulado, ...=0D\r\n2. surftabulumado, negxtabulumado, rultabulumado, ...=0D\r\n3. surftabulrajdado, negxtabulglitado, rultabulveturado, ...=0D\r\n4. io alia...?=0D\r\ndankon,    russ=0D\r\n\r\n\r\n\r\n---------------------- ALICE N=B01 de la RELATION CLIENT 2008*-------------=\r\n-------\r\nD=E9couvrez vite l\'offre exclusive ALICE BOX! En cliquant ici http://abonne=\r\nment.aliceadsl.fr Offre soumise =E0 conditions.*Source : TNS SOFRES / BEARI=\r\nNG POINT. Secteur Fournisseur d.Acc=E8s Internet\r\n\r\n\r\n\r\n------------------------------------\r\n\r\nYahoo! Groups Links\r\n\r\n<*> To visit your group on the web, go to:\r\n    http://groups.yahoo.com/group/tradukado/\r\n\r\n<*> Your email settings:\r\n    Individual Email | Traditional\r\n\r\n<*> To change settings online go to:\r\n    http://groups.yahoo.com/group/tradukado/join\r\n    (Yahoo! ID required)\r\n\r\n<*> To change settings via email:\r\n    mailto:tradukado-digest@yahoogroups.com=20\r\n    mailto:tradukado-fullfeatured@yahoogroups.com\r\n\r\n<*> To unsubscribe from this group, send an email to:\r\n    tradukado-unsubscribe@yahoogroups.com\r\n\r\n<*> Your use of Yahoo! Groups is subject to:\r\n    http://docs.yahoo.com/info/terms/\r\n\r\n'), ' FLAGS (\\Seen))'])

The big trouble looks like the Subject: line having a \r\n\t in the middle.

The relevant information
from rfc2822 is in section 2.2.3. In short:

"""
The process of moving from this folded multiple-line
representation of a header field to its single line
representation is called "unfolding". Unfolding is
accomplished by simply removing any CRLF that is
immediately followed by WSP. Each header field should
be treated in its unfolded form for further syntactic
and semantic evaluation.
"""
(I took this reference from this http://bugs.python.org/issue504152 )

@viric
Copy link
Author

viric commented Oct 23, 2010

Sorry, I notice it is a problem of imaplib, still in python2,.7 and python3.
I'll have to get around it somehow.

@viric
Copy link
Author

viric commented Oct 23, 2010

I had the chance to investigate the issue more. My mailbox has messages from a specific person that, when he wrote long Subjects, his letters were written with an RFC 2822 violation. Instead of breaking the subject with CRLF + WSP, his letters have the subject broken only LF + WSP. That affects parsing the ENVELOPE answer, as imaplib works with readline(), and for readline() either \n or \r\n are end of lines.
I wrote a patch for imaplib so I can keep on downloading. When finding a line ending in \n (not \r\n), I concatenate the next line and remove the \n\t sequence.

@rtucker
Copy link
Owner

rtucker commented Oct 23, 2010

Cool! I, unfortunately, haven't had a chance to look at this yet but that's probably where I was headed.

I am not opposed to working around bugs in imaplib.py using simpleimap.py... see the SimpleImapSSL class for an example of this. The process of getting a bug fixed in the Python library is very slow, and then it has to actually make it onto people's systems via Debian/Ubuntu/RHEL/CentOS/. And yes, there are more than a few such bugs.

@viric
Copy link
Author

viric commented Nov 18, 2010

Once I success getting all my gmail mail, I'll try to write something worth sending, for that bug.

@viric
Copy link
Author

viric commented Nov 19, 2010

Ouch - my quick hack worked for the case I had, but I got a new more difficult to defeat, also failing in the python library, not your code:
Date: Sat, 12 Aug 2006 21:07:54 +0400
Subject: [EK-MASI] =?koi8-r?B?IkFydG8ga2FqIGFrdGl2ZWNvIg0KDQojRWtvdG9waW8gMjAwNiBaYWplanhv?=
=?koi8-r?B?dmEgU2xvdmFraW8j?=

@rtucker
Copy link
Owner

rtucker commented Nov 23, 2010

Niiiice!

See my comment on Issue #10 -- having the "raw" response from the IMAP server helps with testing the weird ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@viric @rtucker and others