Fixing Drupal's Cache mechanism

Well, my previously hopeful mysql optimization routine is not as ideal as I had thought. I've found I have to optimize the table several times a day in order to keep the site humming, and that's not a good solution.

I've found there are two major problems, and I don't know why my site is having problems when I can't really find others with such problems, but at least I've managed to figure out what the problem really is.

Both problems stem from the fact that drupal caches pages based on the request URL rather than the actual content. Makes sense, right? Ok, perhaps, but there are two aspects to drupal that make this problematic.

1) any 404 error page is generated by drupal automatically, but retains the requested URL.
2) any URL of the form node/* where * is anything except a number returns a page with a list of recent nodes.

Combine these two things with overzealous search engines trying to find pages on the old site, and making up weird URL's to boot, and you'll find your cache table being filled with thousands of copies of the exact same data. I think we're having issues because of the large number of pages we have (around 6,000 pages) plus the fact that search engines are still trying to crawl old pages whose URL's are no longer valid.

Our cache table has thousands of entries for 404 pages. For example, if you request /asdfasdf that's one entry for a 404 page. Then if you request /bobobobo that's another, with the exact same 404 page content. These can really add up. In addition, we're getting a lot of requests for URL's like /node/node/node/node/conferences. I don't know where these are coming from, but each one returns a list of recent pages, which are all cached with the exact same content...

So how to fix?

This patch has potential. But until I can try it out, I've made two modifications.

In includes/bootstrap.inc, I've modified the function page_set_cache() to look like:

function page_set_cache() {
  global $user, $base_url;

  if (!$user->uid && $_SERVER['REQUEST_METHOD'] == 'GET') {
    // This will fail in some cases, see page_get_cache() for the explanation.
    if ($data = ob_get_contents()) {
    	// DIFF - check if it's a 404 page before caching
    	if (strpos($data, '404: Page Not Found') === false) { // DIFF
      	if (function_exists('gzencode')) {
        	if (version_compare(phpversion(), '4.2', '>=')) {
         		$data = gzencode($data, 9, FORCE_GZIP);
        	}
        	else {
          	$data = gzencode($data, FORCE_GZIP);
        	}
      	}
      	ob_end_flush();
      	cache_set($base_url . request_uri(), $data, CACHE_TEMPORARY, drupal_get_headers());
    	} // DIFF 
     }
  }
}

Basically, check to see if it's a 404 page and if so, don't cache it. The downside to this is that every time a search bot requests a page that doesn't exist, a page is going to be generated from scratch rather than sending it a cached version. But I don't know of an easy way around this as at the point caching is done, nothing about the page is known other than the URL, so you can't really tell it's going to be a 404 page and serve up the cached copy. Since mysql is having problems, I'm doing this to save the database at the expense of some more CPU cycles...

The second modification is in the node_page() function in modules/node.module.

    default:
    	if (arg(1)) {
    		drupal_not_found();
    	} else {
      	        drupal_set_title('');
      	        print theme('page', node_page_default());
    	}

Basically, by default drupal just calls the else part and the node_page_default() function displays a list of recent nodes. I've modified this so that if there's anything after the node/ part of the request it will display the 404 page rather than the list of recent pages. Since the 404 pages aren't being cached thanks to mod #1, these won't get added to the cache table either.

This is not ideal, but it should keep the cache table from growing to an unmanageable size.

Comments

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options