cyrus… you bastard !


My days are filled with excitement, that is for sure.  ‘Challenge’ is my middle name.  So I get a call from a customer who tells me that their gateway server (we sell an all-in-one product for email, firewall, gateway services, etc… for small-medium businesses) went down over the weekend.  It is up but nothing is working… Ut oh… LOL here we go… So first by qualifying what their definition of “up but nothing is working” I get to the root, EMAIL is not working. I already knew the web aspect was working as I was on the system looking at logs and such… So in the time that I am on the phone talking, someone decided to reboot the system… wtf? Well that happens, and not really a big deal. In fact in this case, it was a good f-up as the system did not come back up… So off I went to their site to see what was going on.  Upon arrival and setting up the monitor/keyboard/mouse that they didn’t have connected (more wtf’s???) I saw that the OS had dropped into a maintenance shell… Ut oh… can you say fsck??? I can and did (say it), then I typed it.. LOL.. well it went along fixing lots of stuff… as it does… told me it was all done and everything was fine. Of course I didn’t trust that so I executed it again, and low and behold, it wasn’t lying to me… Reboot…. (think– could this be it?? am I all done?? )

Of course not…

But the system was back up and now I was able to view it remotely, so off I went back to the office to open a support ticket with the OS provider and see what they had to say about this… (You know I tried all kinds of routine admin type stuff while I was there to try to get it working, to no avail, this wasn’t going to be a 10 minute fix…*(remember that I said this… It is fairly significant @ the end..))… So this was Monday the 8th, I think that is worth mentioning as it took 2 flippin’ days to resolve this…?

Initially the logs were just spewing tons and tons (literally 2GB maillog file in a few hours… which is nuts for a company this size) of errors:

DBERROR PANIC: fatal region error detected; run recovery

DBERROR: critical database situation

process 7498 exited, status 75

service imaps pid 7498 in READY state: terminated abnormally <– PID’s changed here

Hmmm… not overly revealing there is it? Well I didn’t think so, that’s for sure.  To be honest, I am not even sure what I did up until I solved the issue, LOL. Aside form opening the ticket with the OS provider,  and them doing nothing to assist me aside from telling me to look for a .db file in the logs (which @ the time there were none) I was on my own… well me and Google that is  LOL 🙂 . We were very close to to making a recommendation to the customer to just blow away their current OS in favor of rebuilding on the latest and greatest release (they have a centOS 4.3 platform, the new one is built on centOS 5.4)… this is the time when you should stop, take a step back, and regroup.  As a last resort, kind of like suicide, shouldn’t you try EVERYTHING before this recommendation? Ahh.. yes you should, things that you would normally view as un-safe can now be considered as you are about to KILL the system anyway so what do you have to lose right?

So @ some point today (Wednesday March10th), after some reflection in the shower… I re-approached this a-new. It was just some silly stuff was getting in my way of seeing this clearly.  One being the friggin default path of the cyrus user. (How it would not see a dir by default that you must be the cyrus user to execute is beyond me, but hey… it is what it is… so by executing the following command (Oh I should mention that I have 2 windows open here, one running a tail -f on the maillog so I can see the direct result of my actions and the other to execute the commands):

su cyrus <– not su – , remember the PATH anomaly…

ctl_cyrusdb -r

I finally got the log entries I needed to really start fixing this… LOL… shown below,  namely the “DBERROR /var/lib/imap/deliver.db” was finally pointing to an actual file that was an issue. Yeeeehaaawww !!! So now what?

Cont… (following all of this crap)

Log Evidence:
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR DB_ENV->log_flush: LSN of 15/1164196 past current end-of-log of 15/1075183: DB_ENV->log_flush: LSN of %lu/%lu past current end-of-log of %lu/%lu
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR Database environment corrupt; the wrong log files may have been removed or incompatible database files imported from another environment: %s %s %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR /var/lib/imap/deliver.db: unable to flush page: 0: %s: unable to flush page: %lu
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR txn_checkpoint: failed to flush the buffer cache Invalid argument: txn_checkpoint: failed to flush the buffer cache %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR PANIC: Invalid argument: PANIC: %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR: critical database situation
Mar 10 08:43:38 gateway reconstruct[30323]: DBERROR PANIC: fatal region error detected; run recovery: PANIC: fatal region error detected; run recovery
Mar 10 08:43:38 gateway reconstruct[30323]: DBERROR: critical database situation
Mar 10 08:50:34 gateway ctl_cyrusdb[30728]: DBERROR /var/lib/imap/deliver.db: unexpected file type or format: %s: unexpected file type or format
Mar 10 08:50:34 gateway ctl_cyrusdb[30728]: recovering cyrus databases
Mar 10 08:50:34 gateway ctl_cyrusdb[30728]: skiplist: recovered /var/lib/imap/mailboxes.db (57 records, 11908 bytes) in 0 seconds
Mar 10 08:50:34 gateway ctl_cyrusdb[30728]: skiplist: recovered /var/lib/imap/annotations.db (20 records, 3436 bytes) in 0 seconds
Mar 10 08:50:34 gateway ctl_cyrusdb[30728]: done recovering cyrus databases

Directories of Significance:
/var/spool/postfix/postgrey

Commands of significance:
/usr/sbin/postgrey -d –unix=/var/spool/postfix/postgrey/socket
As cyrus user (su cyrus <– note no “-” here…)
ctl_cyrusdb -r
reconstruct
Exploratory commands:
ldd /usr/lib/cyrus-imapd/ctl_mboxlist

I used all of the above @ some point in the process so I thought it was worth listing them…

In the beginning…

Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR DB_ENV->log_flush: LSN of 15/1164196 past current end-of-log of 15/1075183: DB_ENV->log_flush: LSN of %lu/%lu past current end-of-log of %lu/%lu
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR Database environment corrupt; the wrong log files may have been removed or incompatible database files imported from another environment: %s %s %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR /var/lib/imap/deliver.db: unable to flush page: 0: %s: unable to flush page: %lu
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR txn_checkpoint: failed to flush the buffer cache Invalid argument: txn_checkpoint: failed to flush the buffer cache %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR PANIC: Invalid argument: PANIC: %s
Mar 10 08:43:30 gateway ctl_cyrusdb[30322]: DBERROR: critical database situation

…Cont… So lets get t how this is fixed shall we??? Well if the /var/lib/imap/deliver.db file is the issue, lets just get rid of it right? Well Mr. Caution (no idea why) can’t be brave enough to really just rm the file, so I cp and re -name it…

-rw——-   1 root  root 684032 Mar 10 08:48 deliver.db.bak.2010-03-10

Then I did this…

[root@gateway cyrus-imapd]# su cyrus (the PATH for my cyrus use was incorrect, thus the su not su – execution here)
bash-3.00$ pwd
/usr/lib/cyrus-imapd
bash-3.00$ ctl_cyrusdb -r <– hey this worked !!
bash-3.00$ reconstruct <– rebuilding the database … working…
user/admin
user/bhorbettjr
user/bhorbettjr/Calendar
user/bhorbettjr/Contacts
user/bhorbettjr/Notes
user/bhorbettjr/Tasks
user/bhorbettjr@bigldist.com
user/bhorbettsr
user/calendar@bigldist.com
user/ccooper
user/ccooper@bigldist.com
user/cwuest
user/dbiniszkiewicz
user/dbiniszkiewicz@bigldist.com
user/dkachelmeyer
user/dkachelmeyer/Calendar
user/dkachelmeyer/Contacts
user/dkachelmeyer/Notes
user/dkachelmeyer/Tasks
user/dkachelmeyer@bigldist.com
user/email-archive
user/ghoffman
user/ghoffman@bigldist.com
user/jheck
user/jheck@bigldist.com
user/jladowski
user/jladowski/Calendar
user/jladowski/Contacts
user/jladowski/Notes
user/jladowski/Tasks
user/jturner
user/jturner@bigldist.com
user/lhirner
user/lpotts
user/lpotts@bigldist.com
user/lsetlock
user/mkegler
user/mkegler@bigldist.com
user/mpiegay
user/mpiegay@bigldist.com
user/nclifford
user/nclifford@bigldist.com
user/pjankowski
user/psc
user/skegler
user/skegler@bigldist.com
user/slokey
user/slokey@bigldist.com
user/szwolinkiewicz
user/szwolinkiewicz@bigldist.com
user/tmarfurt
user/tmarfurt/Calendar
user/tmarfurt/Contacts
user/tmarfurt/Notes
user/tmarfurt/Sent Items
user/tmarfurt/Tasks
user/tmarfurt@bigldist.com
bash-3.00$ pwd
/usr/lib/cyrus-imapd

bash-3.00$ ls -l /var/lib/imap/
total 720
-rw——-   1 cyrus mail   3436 Mar 10 08:50 annotations.db
-rw——-   1 root  root   3436 Mar  9 11:00 annotations.db.bak.2010-03-09
drwx——   2 cyrus mail   1024 Mar 10 04:02 backup
drwx——   2 cyrus mail   1024 Mar 10 08:50 db
drwx——   2 cyrus mail   1024 Mar  6 08:35 db.backup1
drwx——   2 cyrus mail   1024 Mar  6 08:05 db.backup2
-rw——-   1 cyrus mail   8192 Mar 10 08:50 deliver.db <– new file after reconstruct
-rw——-   1 root  root 684032 Mar 10 08:48 deliver.db.bak.2010-03-10 <– corrupt file?
drwx——   2 cyrus mail   1024 Mar 22  2006 log
-rw——-   1 cyrus mail  11908 Mar 10 08:50 mailboxes.db
drwx——   2 cyrus mail   1024 Mar 22  2006 msg
drwx——   2 cyrus mail   1024 Mar  6 08:35 proc
drwx——   2 cyrus mail   1024 Mar 22  2006 quota
drwx——   2 cyrus mail   1024 Mar  9 11:02 rpm
drwx——   2 cyrus mail   1024 Mar 22  2006 sieve
drwxr-x—   2 cyrus mail   1024 Mar  9 11:02 socket
-rw——-   1 cyrus mail   8192 Mar  6 08:05 tls_sessions.db
drwx——  13 cyrus mail   1024 Feb  9  2009 user

Cool.. that worked and now I am not seeing the error in the logs… But I am seeing postgrey errors… crap !

Postgrey: Still not working???
Mar 10 09:12:07 gateway postfix/smtpd[32583]: warning: connect to /var/spool/postfix/postgrey/socket: Connection refused
Mar 10 09:12:07 gateway postfix/smtpd[32583]: warning: problem talking to server /var/spool/postfix/postgrey/socket: Connection refused
Mar 10 09:12:07 gateway postfix/smtpd[32323]: warning: connect to /var/spool/postfix/postgrey/socket: Connection refused
Mar 10 09:12:07 gateway postfix/smtpd[32323]: warning: problem talking to server /var/spool/postfix/postgrey/socket: Connection refused
Mar 10 09:12:07 gateway postfix/smtpd[32323]: NOQUEUE: reject: RCPT from 140.227.0-93.rev.gaoland.net[93.0.227.140]: 450 Server configuration problem; from=<uxyximiiit1429@gaoland.net> to=<|mobley@bigldist.com> proto=ESMTP helo=<gaoland.net>

Ok.. where am I here???

[root@gateway log]# service postgrey status
postgrey dead but subsys locked
[root@gateway log]# service postgrey start
Starting postgrey:                                         [  OK  ]
[root@gateway log]# service postgrey status
postgrey dead but subsys locked

Ya… postgrey is hosed…

[root@gateway postgrey]# pwd
/var/spool/postfix/postgrey
[root@gateway postgrey]# ls -l
total 44067
-rw——-  1 postgrey postgrey  5483260 Mar  6 08:36 log.0000000609
-rw——-  1 postgrey postgrey  1089536 Mar  6 08:35 postgrey_clients.db
-rw——-  1 postgrey postgrey 38371328 Mar  6 08:36 postgrey.db
-rw——-  1 postgrey postgrey        0 Mar 10 09:17 postgrey.lock
srw-rw-rw-  1 postgrey postgrey        0 Mar 10 09:21 socket
[root@gateway postgrey]# mv postgrey.db bak.`date -I`.postgrey.db
[root@gateway postgrey]# ls -l
total 44067
-rw——-  1 postgrey postgrey 38371328 Mar  6 08:36 bak.2010-03-10.postgrey.db
-rw——-  1 postgrey postgrey  5483260 Mar  6 08:36 log.0000000609
-rw——-  1 postgrey postgrey  1089536 Mar  6 08:35 postgrey_clients.db
-rw——-  1 postgrey postgrey        0 Mar 10 09:17 postgrey.lock
srw-rw-rw-  1 postgrey postgrey        0 Mar 10 09:21 socket
[root@gateway postgrey]# mv postgrey_clients.db  bak.`date -I`.postgrey_clients.db
[root@gateway postgrey]# rm postgrey.lock
rm: remove regular empty file postgrey.lock? y
[root@gateway postgrey]# ls -l
total 44067
-rw——-  1 postgrey postgrey  1089536 Mar  6 08:35 bak.2010-03-10.postgrey_clients.db
-rw——-  1 postgrey postgrey 38371328 Mar  6 08:36 bak.2010-03-10.postgrey.db
-rw——-  1 postgrey postgrey  5483260 Mar  6 08:36 log.0000000609
srw-rw-rw-  1 postgrey postgrey        0 Mar 10 09:21 socket

Ok.. so lets make the same assumption about postgrey (the dabase is corrupt) and do the same thing, with the addition of the lock file rm, then restart it…

[root@gateway postgrey]# /usr/sbin/postgrey -d –unix=/var/spool/postfix/postgrey/socket
[root@gateway postgrey]# ls -l
total 44141
-rw——-  1 postgrey postgrey  1089536 Mar  6 08:35 bak.2010-03-10.postgrey_clients.db
-rw——-  1 postgrey postgrey 38371328 Mar  6 08:36 bak.2010-03-10.postgrey.db
-rw——-  1 postgrey postgrey    16384 Mar 10 09:23 __db.001 <– hmm none of these were there before… (001-004)
-rw——-  1 postgrey postgrey   278528 Mar 10 09:23 __db.002
-rw——-  1 postgrey postgrey    98304 Mar 10 09:23 __db.003
-rw——-  1 postgrey postgrey    16384 Mar 10 09:23 __db.004
-rw——-  1 postgrey postgrey  5503181 Mar 10 09:23 log.0000000609
-rw——-  1 postgrey postgrey     8192 Mar 10 09:23 postgrey_clients.db
-rw——-  1 postgrey postgrey     8192 Mar 10 09:23 postgrey.db
-rw——-  1 postgrey postgrey        0 Mar 10 09:23 postgrey.lock
srw-rw-rw-  1 postgrey postgrey        0 Mar 10 09:23 socket

[root@gateway postgrey]# service postgrey status <– oh please oh please…
postgrey (pid 720) is running… <– YES !! HELL YES !!

Loop through the Queue to see what is going on…
[root@gateway cyrus-imapd]# for i in {1..20}; do qshape incoming active deferred; sleep 2; done
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  1  1  0  0  0  0   0   0   0    0     0
bigldist.com  1  1  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  1  1  0  0  0  0   0   0   0    0     0
bigldist.com  1  1  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  2  2  0  0  0  0   0   0   0    0     0
bigldist.com  2  2  0  0  0  0   0   0   0    0     0 <– 2 in
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  2  2  0  0  0  0   0   0   0    0     0
bigldist.com  2  2  0  0  0  0   0   0   0    0     0 <– 2 in (same 2 most likely)
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0 <– processed 2
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  1  1  0  0  0  0   0   0   0    0     0
bigldist.com  1  1  0  0  0  0   0   0   0    0     0 <– 1 in
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  1  1  0  0  0  0   0   0   0    0     0
bigldist.com  1  1  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0 <– processed
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  0  0  0  0  0  0   0   0   0    0     0
T  5 10 20 40 80 160 320 640 1280 1280+
TOTAL  1  1  0  0  0  0   0   0   0    0     0
bigldist.com  1  1  0  0  0  0   0   0   0    0     0

So… in the end a 5 minute fix took 2 days… Should I be saying that out loud? oops…

Advertisements

, , , , , , , , , , , , ,

  1. #1 by gmconklin on July 12, 2010 - 11:09 am

    *NOTE – You may have to append each cyrus command with a ./ as I found on a system today, the cyrus user path must not have had the command directory in it’s own path??? Odd…

  2. #2 by Andy on March 10, 2010 - 4:31 pm

    I dont like you anymore, lol

  1. Quota violation… cha cha cha ! « A Technological Farce

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: