Apache mod_cache in the Real World

I thought I’d share our experiences implementing Apache’s mod_cache. We wanted to implement caching of product and category pages for the SoftSlate Commerce Java shopping cart application of one of our clients. The product and category pages of an ecommerce storefront don’t often change so they are good candidates for caching. If Apache can serve them directly it saves Tomcat from having to deal with them and things go much more smoothly under super-heavy load. We were already using Hibernate’s second-level cache to cache the database interaction, and believe me that helps tremendously but we wanted even faster responses. At times we have made static .html files manually from key pages like the home page and some key category pages. mod_cache seemed like a better way.

Typically we deploy SoftSlate Commerce under Tomcat where Apache serves the requests initially and hands them off to Tomcat using mod_jk or mod_proxy. Obviously having Apache in the mix in this way is a prerequisite for using mod_cache.

Here’s the config we ended up with:

<IfModule mod_cache.c>

# 300 = 5 minutes
CacheDefaultExpire 300

# With CacheIgnoreNoLastMod On set, we don't need to
# define expires headers for the pages to be cached by the
# server. And we don't want to because we'll want to control
# the cache on the server. We don't want browsers to cache.
CacheIgnoreNoLastMod On

# Ignore the query string - newsletter links have tracking
# info attached to them. We want to ignore those parameters.
# Take care if this is a store that has sorting enabled on
# category pages - this will also ignore the sorting parameters!
CacheIgnoreQueryString On

# Do not store Set-Cookie headers with the cache or you'll
# get session poisoning!!!
CacheIgnoreHeaders Set-Cookie
<IfModule mod_disk_cache.c>

# Must be writable by apache
CacheRoot /var/local/apache_cache
CacheEnable disk /product
CacheEnable disk /category
CacheDirLevels 1
CacheDirLength 1
</IfModule>
</IfModule>

Apache: Please Cache This, Browsers: Please Don’t

The Apache Cache Guide was a great help but it left open a lot of questions. First off, we wanted complete control over the cache from the Tomcat application on the server. We wanted to be able to signal Apache to refresh a certain page when critical information changed. And for this reason we did not want browsers to ever cache the pages. We wanted complete control.

As it turns out CacheIgnoreNoLastMod On helped us with this, combined with not defining an Expires header for the pages. Typically mod_cache requires you to define an Expires header for the pages, which is how it determines how long to cache the page. The problem is browsers also look at the Expires headers and will cache the page themselves based on it. CacheIgnoreNoLastMod On tells mod_cache to cache the pages even if there is no Expires header. So we tell mod_cache, yes, cache this page, but we are not telling the browser to cache it. This is what we wanted because we wanted to maintain control of the cache on the server, within our Tomcat/SoftSlate Commerce code.

OK, Now, Apache, Refresh This Page Please

So, how to signal mod_cache to delete the page from the cache and refresh it? Well it turns out by default mod_cache does this any time it receives a request with headers like this:

cache-control: max-age=0

or these:

cache-control: no-cache
pragma: no-cache

The first example is what Firefox sends when you submit CTRL-R to reload the page. The second is what Java’s HttpURLConnection class will send when you do setUseCaches(false). As far as I can tell they are equivalent. They have the effect of telling Apache to clear the page out its cache and refresh it.

Yes, it’s true: Apache will refresh the page’s cache each time any user hits reload in his browser. and I know IE at least allows you to see the refreshed version with every click. As an aside, you ought to know there is a way to tell mod_cache to ignore the above headers, and serve the content from the cache always:

CacheIgnoreCacheControl On

But leaving this at the default Off value has the advantage of serving as our method of signaling Apache to refresh the cache for a given page from our Java code:

// Create the url based on SoftSlate's SEO settings
String code = "categoryCode";
String urlString = baseForm.getSettings().getValue("customerURL")
+ "/Category.do?code=" + code;
AppLinkTag apt = new AppLinkTag();
urlString = apt.createSEOURL(urlString, baseForm.getSettings());

// Send the request, but don't wait for the reply. useCaches=false
// to trigger Apache to refresh its cache
URL url = new URL(urlString);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setUseCaches(false);
con.setDoOutput(false);
con.setDoInput(true);
con.getInputStream();

There is our friend setUseCaches(false), which sets the headers telling Apache to refresh the page. I am no expert with HttpURLConnection but the above code seems to behave the way we want it to. It sends the request off without waiting for a response. So the application is not hung up in any way making the connection. Apache gets the request and knows to clear its cache and send the request along to Tomcat, and cache the new result. Danger: be careful that you don’t trigger this request recursively and end up with a nasty infinite loop of requests that trigger requests that trigger more requests!

Bitten by the Cookie

Now you may be wondering about this in our configuration:

CacheIgnoreHeaders Set-Cookie

We started out without this configuration in place and boy did it bite us. It’s surprising to me mod_cache does not have this in place by default. What this is doing is telling Apache not to store the Set-Cookie header in the cache with the page’s content. You definitely do not want to do that in most typical web applications where session identifiers are defined via cookies! Doing so meant that if a user happened to request a page that had expired from the cache as the first hit of his session, his session ID was being stored with the cached page along with the page’s content. Anyone else hitting the page then gets the same cookie! Needless to say mayhem ensues as people get assigned the same session identifier. So, please, please consider adding this little line. I’m not sure when you wouldn’t want to add it. I mean really, how often would you want to send the exact same cookie to everyone?

Metrics, Please?

The last thing we wanted to do of course was to find out how many pages are being cached and how many times Apache is requesting it from Tomcat. (You kind of have to know how the thing is working for you.) First we added the Age header to our Apache logging configuration:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"
               \"%{User-Agent}i\" \"%{Age}o\"" combined

The above outputs “-” for the Age header if the page is not being served from the cache. Otherwise it outputs the number of seconds old the page is. What that in mind, a little grepping will show you how many cache hits there are for a particular day:

grep 21\/Jul access_log | grep -v .*\"-\"

And a little more grepping will tell you how many misses there were:

grep 21\/Jul access_log | grep GET\ \/category.*\"-\"

In our case, 84% of requests to our first category page we tried it on were served from the cache. Pretty good, and it sure beats writing static files manually!

Update – Caching the Home Page

After writing up the above, I had a bear of a time trying to figure out how to cache the home page so thought I would share how I did it in the end. First there is the issue that mod_cache’s regular configuration is not very easy to use when it comes to just caching the home page and not the entire website. When you do this, it caches the entire site:

CacheEnable disk /

Since we only want to cache the home page, category pages, and product pages, we have to fall back on using the no-cache environment variable, then unset that variable for the specific paths we want to cache:

CacheEnable disk /
SetEnv no-cache

# Cache the home page
<LocationMatch "^/$">
 UnsetEnv no-cache
</LocationMatch>

# Cache category pages
<Location /category>
 UnsetEnv no-cache
</Location>

# Cache product pages
<Location /product>
 UnsetEnv no-cache
</Location>

OK, so that’s in place now, but one problem – the home page was still not being cached. (The way I tell is by installing the Web Developer Firefox plug-in, and looking at the response headers. No Age header, no caching.)

Turns out Dr. Google tells me there is an issue with mod_cache and Apache’s DirectoryIndex directive. We had this in place in our configuration, to add index.jsp to the default list of files invoked on a request to a directory (such as the home page, or /):

DirectoryIndex index.html index.htm index.jsp index.php

I’m a little fuzzy but I believe the issue is Apache might cache index.jsp if you tell it to, but mod_cache would store the cache under the index.jsp path, rather than /, which is what we really wanted cached. I tried replacing the above with a RewriteRule directive but that had the effect of Apache spitting out the raw contents of index.jsp. It was not forwarding the request to Tomcat. The other wrinkle here was we’re using mod_jk, but the JkMount directives are apparently processed before RewriteRule. Alas, the eventual solution was to add a JkMount directive for / itself:

JkMount / ajp13

Bingo, now we have home-page caching, which is a good thing because about 17% our client’s website hits are to their home page.

Soooooo, to summarize, here’s what we really ended up with for our overall mod_cache configuration:

# Comment out DirectoryIndex index.jsp!
# DirectoryIndex index.html index.htm index.jsp index.php

# URL patterns that Apache should hand off to Tomcat - add / so Apache
# forwards the home page to Tomcat (who already knows to use index.jsp).
JkMount / ajp13

...

<IfModule mod_cache.c>

# 300 = 5 minutes
CacheDefaultExpire 300

# With CacheIgnoreNoLastMod On set, we don't need to define expires headers
# for the pages to be cached by the server. And we don't want to because we'll
# want to control the cache on the server. We don't want browsers to cache.
CacheIgnoreNoLastMod On

# Ignore the query string - newsletter links have tracking info attached to them.
# We want to ignore those parameters. Take care if this is a store that has sorting
# enabled on category pages - this will also ignore the sorting parameters!
CacheIgnoreQueryString On

# Do not store Set-Cookie headers with the cache or you'll get session poisoning!!!
CacheIgnoreHeaders Set-Cookie
<IfModule mod_disk_cache.c>

# Must be writable by apache
CacheRoot /var/local/apache_cache
CacheDirLevels 1
CacheDirLength 1
CacheEnable disk /
SetEnv no-cache
<LocationMatch "^/$">
UnsetEnv no-cache
</LocationMatch>
<Location /category>
UnsetEnv no-cache
</Location>
<Location /product>
UnsetEnv no-cache
</Location>
</IfModule>
</IfModule>

About David Tobey

I'm a web developer and consultant based in lovely Schenectady, NY, where I run SoftSlate, LLC. In my twenties I worked in book publishing, where I met my wife. In my thirties I switched careers and became a computer programmer. I am still in the brainstorming phase for what to do in my forties, and I am quickly running out of time to make a decision. If you have any suggestions, please let me know. UPDATE: I have run out of time and received no (realistic) suggestions. I guess it's programming for another decade.
This entry was posted in How To's, SoftSlate Commerce, Web Ops. Bookmark the permalink.

3 Responses to Apache mod_cache in the Real World

  1. Witnobfigo says:

    “It’s surprising to me mod_cache does not have this in place by default.” – It’s because RFC2616 section 13.5.1 says that a cache MUST store a user’s individual cookie data and send copies of it to an arbitrarily large number of other random users. It doesn’t seem to have occurred to anyone that when an RFC makes a recommendation which is sufficiently ill-thought-out as to provide a built-in mechanism for people to steal other people’s logins, then strict RFC compliance can legitimately be de-prioritised. (Yes, it’s just bitten me too. It screwed up the mod_usertrack data for thousands of users… and I couldn’t find anything on Google about it until after I’d figured out what was going on by myself because only then did I know what to search for, I thought it was a stale pointer bug in mod_usertrack or something…)

  2. sampath kumar says:

    thanks for your detail explanation,
    i am serving images from apache2, but i want serve even fast i configured following way
    it seems to be not working is missing please tell me.

    CacheDefaultExpire 300
    CacheIgnoreNoLastMod On
    CacheIgnoreHeaders Set-Cookie
    CacheIgnoreQueryString On
    SetEnv no-cache

    LoadModule disk_cache_module modules/mod_disk_cache.so
    # Must be writable by apache
    CacheRoot /var/local/apache_cache
    CacheEnable disk/cms
    CacheEnable disk/products
    CacheDirLevels 1
    CacheDirLength 1

  3. michelbisson says:

    here are some mistakes you made:
    CacheEnable disk/cms
    CacheEnable disk/products

    Should be:
    CacheEnable disk /cms
    CacheEnable disk /products
    (notice the space between ‘disk’ and ‘/cms’)
    If you use the directive ‘SetEnv no-cache’ globally for your site then naturally mod_cache will not cache. Just use this directives in a Location block to select which it3ems should NOT be cached,

Leave a Reply