So, what went wrong (or WordPress, Cron and Squid)

Recently my web host (Gradwell) moved to a new hosting platform (Apache 2, php 5.2) to try and bring things up-to-date. In general, the end result worked okay. However, the load balancing they had in front of their web cluster was apparently sub-par. This became entirely apparent when a single customer was able to bring the whole thing to a grinding halt with some kind of chess related website.

Now, I know it’s shared hosting, and you have to take the performance hits every now and then, but there’s a difference between ‘takes 2 or 3 seconds longer sometimes’ and ‘didn’t load’, ‘won’t load’, ‘took 8 minutes’. I raised a ticket on the Friday when the problems got to their worst, but for reasons I’m not sure about, that didn’t get looked at by anyone technical until Monday. So from Friday to Monday all my Gradwell sites were basically unusable between 1pm and 8pm UK time.

Gradwell made some changes on Monday and spoke to the owner of the other site, but it didn’t really fix the problem. Eventually they decided to replace whatever load balancer they were using with a Squid reverse proxy, which had been running ‘fine’ in front of their php4 cluster. They did this Tuesday night and since then the site has been a lot quicker.

However, it broke WordPress. Let me explain.

Since WordPress is a web application it doesn’t do anything until someone loads a page. However, there are things WordPress likes to try and schedule in the background, like posting scheduled posts or sending out pingbacks/trackbacks, so that they don’t delay the actual use of the site. In order to achieve this, WordPress has a scheduled tasks queue, and whenever anyone loads a page it launches a task to process that queue which if I understand correctly should happen in the background without then affecting the page load times.

Obviously, since WordPress is generating this request, it is initiated by the same web server that you connect to, to read the site. Essentially the web server has to be able to talk to itself.

Yesterday I noticed that a scheduled post I had ready to go didn’t post on time. And so started 8 hours of problem investigation yesterday evening. I installed plugins and grep’ed log files and read more useless posts on the WordPress support forum (horrible forum guys) than I ever wanted to. Behind the scenes, WordPress uses a file call wp-cron.php to do the actual cron work, and that page is called from within cron.php which is hooked into all page loads. I eventually narrowed the issue down to WordPress not being able to load wp-cron.php itself, since if I loaded it remotely myself (you need to pass a special hashed key to do that) then the queue processed fine. Using a plugin called Core Control I was able to test the various transports that WordPress supports and it looked like the cURL one was broken.

Now, there’s a bug in WordPress 2.7 which means it doesn’t gracefully fall back to trying additional transports if the first one fails, and cURL was the first one. I raised this with Gradwell and they thought they found and fixed a networking issue preventing the web servers from talking into the squid proxy and back to themselves, which they resolved last night. However, while I could then use the cURL transport, my cron queue still wasn’t processing.

I tried everything but now I wasn’t sure where the problem lay, despite what my web host thinks of me, I try not to instantly blame them for any faults and I knew WordPress certainly had some patches in for 2.7.1 to fix cron issues. Maybe there was a switch set when something didn’t work that wasn’t getting reset.

So I looked and looked and played. I went through the WordPress code surrounding the cron stuff, to try and work out where things might be failing, which is when I worked out that WordPress essentially loads wp-cron.php in the background. What I couldn’t understand is why WordPress couldn’t load that page when I could. I tried changing the timeouts, the type of function being used and disabling various transports. Nothing made any difference.

Checking the web server access logs from the day Squid was put in place showed no attempts to load the wp-cron.php page (other than ones from my machine manually), however from before the Squid proxy went in there were several. But, they weren’t GET’s they were HEAD’s.

xx.xx.xx.xx - - [03/Jan/2009:01:09:08 +0000]
"HEAD /wp-cron.php?check=hash HTTP/1.0" 200
- "-" "WordPress/2.7"

I had a small epiphany, when I load that page with a browser it sends a GET. I tried manually sending a HEAD using both telnet and then curl.

$ telnet xx.xx.xx.xx 80
Trying xx.xx.xx.xx...
Connected to xx.xx.xx.xx (xx.xx.xx.xx).
Escape character is '^]'.
HEAD /wp-cron.php?check=hash HTTP/1.0
host perceptionistruth.com

HTTP/1.0 403 Forbidden
Server: squid
Date: Thu, 07 Jan 2009 23:13:40 GMT
Content-Type: text/html
Content-Length: 1126
X-Squid-Error: ERR_ACCESS_DENIED 0
X-Cache: MISS from squid-4
Via: squid-4
Connection: close

Connection closed by foreign host.

and

$ curl --head "http://perceptionistruth.com"
HTTP/1.0 403 Forbidden
Server: squid
Date: Thu, 07 Jan 2009 23:34:33 GMT
Content-Type: text/html
Content-Length: 1102
X-Squid-Error: ERR_ACCESS_DENIED 0
X-Cache: MISS from squid-3
Via: squid-3
Connection: close

Trying again with regular GET’s worked fine. So, it was Squid. Somehow Squid was blocking the HEAD requests but allowing the GETs. A quick google suggested that this was something other people got wrong in Squid, so I filled in a more detailed ticket and Gradwell today fixed the configuration error and the cron queue started working again.

There are a number of people who report WordPress not posting scheduled posts, not sending out pings, or not processing entries in the cron queue. There are several fixes, in 2.7 without patches if cURL doesn’t work on your host you may find WordPress doesn’t try the other transports. However, if the transports seem to work and if you can manually run wp-cron.php remotely (remembering to send the correct hash), then maybe you should check the following.

Can your web server access itself using it’s regular URI (i.e. are there firewalls or proxies in the way which block any kind of self referral)?
Is the server behind a Squid installation which is blocking valid HTTP HEAD requests?

The second is easy to check (see above, use curl). The former is harder to check, but if you use the Core Control plugin, and hack it a little to change the test URI to be your own site (core-control/modules/core_control_http.php, and change line 139 to be your own site, you’ll need to create a working php file which returns the text 1563) then you can make sure transports on your web host can talk back to the same web host.

Good luck.

I’m just glad everything is working, and my scheduled posts are now posting correctly again.

Perception is Truth

it's back, no better than before

So, what went wrong (or WordPress, Cron and Squid)

Related

2 thoughts on “So, what went wrong (or WordPress, Cron and Squid)”