Chapter 5. Caching

This chapter describes how the proxy server caches documents. It also describes how the cache is configured with the Administration forms and maintained with the garbage collector. (See “Default Caching Configuration” for information on the specific Administration form cache settings.)

How Caching Works

Caching reduces network impact and offers faster response times for clients using the proxy server. When you install the Netscape Proxy Server you specify a disk cache directory and size. The installation creates a cache directory structure that the proxy uses to store documents from remote servers (see “Cache Directory Structure” for a description and Figure 5-5 for an illustration of the cache directory structure).

When a client requests a new document from the proxy, the proxy copies the document from the remote server to its local filesystem in addition to sending the document to the client.

When another request comes for the same file, the proxy returns the file from the cache if the file is up to date. If the proxy determines the file is not up to date, it refreshes the document from the remote server and then sends it to the client. See Figure 5-3 for a diagram of this decision process.

Figure 5-1. The Proxy Up-to-Date Check


Is the Cache Used?

When the proxy starts, it reads in configuration information that tells it if caching is used, and if so, what documents and protocols are cached. If you chose not to set up caching during the installation process, you can use the Proxy Manager's Caching Configuration form to set up a cache system. The proxy needs to know the cache location (cache root directory) and the maximum size of the cache.

If you set up a cache directory, the proxy uses two caching options to determine if a document is cached. The proxy checks if the document's protocol is cached (cached protocols are HTTP, FTP, or Gopher). If the protocol is cached, the proxy uses the caching strategy to see how the protocol is cached.

Even if “everything” is cached, the proxy caches only GET method documents—and only if they have either a Last-Modified or Expires header (or both). Figure 5-2 illustrates caching protocol.

Figure 5-2. Relation of Caching Protocols and Caching Strategy


Is a Document Up to Date?

You can configure the proxy to guarantee that documents are up to date before sending them from the cache. The Cache Refresh Setting (see “Cache Refresh Setting”) specifies whether to make sure a document is up to date or to wait a specific interval before doing the up-to-date check.

Because FTP and Gopher protocols don't include a method for checking that a document is up to date, having a default refresh time is the only way to make FTP and Gopher caching effective.

HTTP, on the other hand, provides an easy way to make sure a document is up to date. The proxy makes one call to the remote server that basically says, “send me this document only if it was modified since this date.” The proxy sends the content of the Last-Modified header that was stored in the proxy's Cache Information File (CIF) as an If-Modified-Since: header.

Figure 5-3. Cache Use Algorithm


Caching HTTP vs. FTP and Gopher

Internally, caching HTTP documents differs from caching FTP and Gopher documents. The HTTP protocol provides effective and efficient ways for caching that are missing from FTP and Gopher protocols.

Caching HTTP

HTTP documents have a descriptive header section that the proxy server uses to compare and evaluate the document in the proxy cache and the document on the remote server. When the proxy does an up-to-date check on an HTTP document, it sends one request to the server that tells the server to return the document if it is out of date. Often, the document hasn't changed and isn't transferred. This saves bandwidth and decreases latency.

To reduce transactions with remote servers, you can set a Cache Expiration setting so that the proxy first estimates if the HTTP document needs an up-to-date check before it actually sends the request (the proxy makes the estimate based on the HTTP document's Last-Modified date). Use an expiration setting that fits your data (you can set different expirations for different HTTP documents by creating and modifying a resource). See “Cache Expiration Setting” for more information about the Cache Expiration setting.

With HTTP documents, you can also use a Cache Refresh setting. This option specifies whether the proxy always does an up-to-date check (which would override an Expiration setting) or if the proxy waits a specific period of time before doing a check. Figure 5-4 shows what the proxy does if both an Expiration setting and a Refresh setting are specified. Using the Refresh setting considerably decreases latency and saves bandwidth.

Figure 5-4. Proxy with Expiration and Refresh Setting


Caching FTP and Gopher

The only way to optimize caching for FTP and Gopher protocols is to set a Cache Refresh time; otherwise, you'll have the proxy retrieving these documents even if the documents in the cache are the latest versions.

If you set a Refresh interval, choose one that you consider safe for the documents the proxy gets. For example, if you store information that rarely changes, use a high number (several days). If the data changes constantly, you'll want the files to be checked at least every few hours. Note that during the refresh time, you risk sending an out-of-date file to the client. But if the interval is short enough (a few hours), you eliminate most of this risk while getting notably faster response times.

If your FTP and Gopher documents vary widely (some change often, others rarely), use the Resource Manager to create a resource for the different documents (for example, create a resource like ftp://*.gif) and then use a Refresh Interval that is appropriate for that resource.

Optimizing Cache Document Retrieval

If your proxy server is used to access a variety of document types (FTP, HTTP, Gopher), you'll want to use several cache features to optimize the way the cache works. First, you'll want to use the Caching Strategy that caches everything unless explicitly forbidden—if you have a very limited cache size, you wouldn't use this option.

The following list describes a recommended cache configuration:

  • Use Cache Refresh Settings to optimize caching of FTP and Gopher documents. Create resources for documents and then use different Refresh Intervals as appropriate. For example, you could create resources such as ftp://*.gz and *.(Z|gz).

  • Set a reasonable Refresh Interval (for example 4-8 hours) for HTTP documents. This reduces the number of times the proxy connects with remote servers. Even though the proxy doesn't do up-to-date checking during the refresh interval, users can force a refresh by clicking the Reload button in the Netscape Navigator (this makes the proxy retrieve the document from the remote server).

  • Use a Cache Expiration Setting to estimate when a document is likely to be out of date (valid only for HTTP documents). Not many HTTP documents use explicit Expires headers, so it's better to estimate based on the Last-modified header (for example, use an estimation factor of 0.1).

Cache Directory Structure

When you install the proxy server, you specify a directory (usually a subdirectory of the server root) for the cache—this is where the proxy temporarily stores documents.

The installation creates the cache root directory and creates a set of subdirectories where it places documents as it retrieves them from the remote servers. Figure 5-5 shows the structure of the cache root.

Figure 5-5. Three Subdirectories of the Cache Root Directory


The proxy stores documents in the last level of directories (there are 4096 directories at this level).

Dispersion of Files in the Cache

The proxy uses a certain algorithm to determine the directory where a document should be stored. This algorithm ensures equal dispersion of documents in the base directories, so the directories contain a small and nearly equal number of documents. This is important for several reasons:

  • Directories with large numbers of documents tend to cause performance problems.

  • Garbage collection is much more stable because there is a relatively consistent number of files per directory, making it easier to estimate the number of files to remove.

Filename and Directory

The proxy uses the RSA MD5 algorithm (Message Digest) to reduce a URL to 8 characters, which it then uses for the filename of the document it stores in the cache.

The MD5 algorithm reduces the URL to 128 bits (16 bytes) of binary data. The proxy uses only 48 bits (6 bytes) to calculate an 8-character filename and determine the storage directory; this is enough to cache millions of URLs.

Limiting the Number of Cache Directories

You can limit the number of top-level cache directories (CacheRoot/a…p). This, in effect, reduces the total number of directories the cache has for storing documents. The number of top-level cache directories must be a power of two (1, 2, 4, 8, or 16):

  • 1 (a) yields 256 bottom-level directories (optimized for approximately 150 MB cache size).

  • 2 (a-b) yields 512 bottom-level directories (optimized for approximately 300 MB cache size).

  • 4 (a-d) yields 1024 bottom-level directories (optimized for approximately 500 MB cache size).

  • 8 (a-h) yields 2048 bottom-level directories (optimized for approximately 1-2 GB cache size).

  • 16 (a-p) yields 4096 bottom-level directories (optimized for 2-5 GB cache size).

To limit the number of directories, you must manually edit the init–cache function in the magnus.conf configuration file (see “magnus.conf File” for more information).


Caution: The cache structure is built during installation and can't be altered later without rebuilding the entire cache from scratch. If you aren't sure what cache size to use, use the largest cache capacity or use the 2 GB default value in the installation forms (this default can hold more than 2 GB of data and can be used with 3-5 GB caches).


Moving or Splitting the Cache Directory

You can move the cache root directory to another filesystem or directory. After you move or rename the directory, you must use the Proxy Manager to tell the proxy server where the new cache location is.

The easiest way to move the directory structure is to pack the directory with tar and untar the file in the new directory:

tar cf - *|(cd [NewDirectory]; tar xvf -)

After you move the physical directory, use the Proxy Manager to point the proxy to the new cache root and then do a soft restart of the proxy server.

You can also split the cache directory by using symbolic links from the original cache root to a new directory. The cache root contains subdirectories a-p, which can be individually relocated as long as a symbolic link is created from the old directory to the new one. For example, you can have the proxy look for the cache structure in a directory called proxycache. This directory could contain the cache subdirectories a through h. You could then have symbolic links for directories i through p that point to a directory called othercache.

In the proxycache directory (the actual cache root), you'd type:

ln –s /othercache/i i

to create a symbolic link from proxycache/i to othercache/i. Repeat this for directories j through p.

You can only use symbolic links with the first subdirectory structure (cacheroot/a-p) and you must copy all subdirectories for each directory.

Using the Cache Manager

The Netscape Proxy Cache Manager lets you control caching for documents, lets you view all cached documents, and lets you see an estimated size of the current cache structure.

The Cache Manager lets you view all cached URLs and lists information about the URLs. You can explicitly expire documents in the cache (so that the next time they are accessed, the proxy does an up-to-date check and possibly refreshes the document in the cache), and you can remove one or more documents from the cache.

Accessing the Cache Manager

You can access the Cache Manager from the Administration forms or from the Resource Manager. From the Administration forms, the Cache Manager lets you view all cached documents grouped by type and site name (for example, http://home.sgi.com). You can then type wildcard patterns to limit the list of sites or URLs you view.


Note: This is the same as if you used the Administration forms, accessed the Cache Manager, and typed in the resource as a wildcard pattern.

From the Resource Manager, you specify a resource first, then click the link called View the list of cached information pertinent to this resource. The Cache Manager appears with the resource you chose as a wildcard pattern, and the cached documents are listed in alphabetical order with radio buttons for expiring and removing the documents.

From the Administration Forms

When you access the Cache Manager from the Administration forms, you either view all cached sites or you use wildcard patterns to restrict what you view. When you view sites or documents, you can then expire or remove them from the cache.

When you view all cached sites, you click a button to view them either listed as bulleted items or listed with radio buttons for expiring and removing the site. The bulleted list is generated more quickly (this is good when you simply want to view information), but the radio buttons list lets you explicitly expire or remove individual sites. With the bulleted list you must first select the site, then choose the documents at that site that you want to expire or remove, so it can actually take you longer to do the same task.

With either list, you can click a link for a single site and then select specific cached URLs from that site to expire or remove.

From the Resource Manager

When you access the Cache Manager from the Resource Manager, the Cache Manager uses the resource wildcard pattern to show only the cached documents that match the resource.

When you click the link called “View the list of cached information pertinent to this resource” the Cache Manager appears with the documents listed with two radio buttons, so you can select the resources you want to remove or expire.

Expiring and Removing Documents in the Cache

You can expire and remove documents in several ways:

  • You can select the radio buttons for each URL and then click a button to expire or remove them.

  • You can select a specific URL, and then expire or remove it.

  • You can simultaneously expire or remove many documents by using wildcard patterns.

Use the Cache Manager to select documents to expire or remove. See the previous section for instructions on selecting documents. When you have a list of documents, you can click the link for a specific document to get a form that lets you remove or expire that particular document.

However, if you know exactly which resources you want to expire or remove, you can use the shortcuts on the Cache Manager form—they are quick to type in, but they are slower to perform for the server because they affect the entire cache. If you choose this way, be careful when selecting the wildcard pattern.

Rebuilding the Cache Directory Structure

The Netscape Proxy Server requires a specific directory hierarchy under the cache root directory. If this hierarchy is missing or damaged, caching won't work.


Note: You can also use this utility if the cache hierarchy has been damaged (for example, parts of directories are missing).

To rebuild the cache hierarchy (if it's damaged or incomplete), click the Special Maintenance link from the Proxy Manager page. In the Cache Builder and Repairer section, click the button to rebuild the cache hierarchy.

While the cache is building, you can view the cache builder messages to see when cache build is complete or if errors have occurred.

Repairing the Cache URL Database

The proxy has a utility that goes through the entire cache and repairs the Cache Manager's URL database. Use this utility if your Cache Manager's URL database appears damaged when viewed through the Cache Manager (for example, if the URL database doesn't seem to contain all the URLs, if the Cache Manager claims that the cache is empty or corrupt, or if it shows garbage even after reloading the page). While running the utility, you can look at the progress report to see its current status.

To repair the cache URL database, click the Special Maintenance link from the Proxy Manager page. Click the button in the URL Database Repair section.

Running the URL repair utility can take from a few minutes to a couple of hours to complete, depending on the size of the cache and the speed and load of your host and its disks. The update takes effect on the top-level URL (site) listing only after the entire operation has completed.

This utility is rarely needed. The only way that the URL database can get out of sync is if something prevents the proxy from updating its URL database after it has completed the cache write. This could happen if the disk suddenly becomes full, if the file permissions are wrong, or if the system suddenly goes down. The URL database is located in the hosts subdirectory under the cache root directory.

This utility can re-create the entire URL database from scratch if it is accidentally deleted.

What is the Garbage Collector?

The Netscape Garbage Collector Daemon performs cache cleanup on a regular basis, ensuring that the cache directories don't get too full or get cluttered with old documents.

The Garbage Collector Daemon runs through the cache files regularly, usually one or more times a day. It starts automatically by the same script that starts the proxy server. You can choose up to three garbage collection times a day. Once a day is sufficient if you have a large cache that doesn't fill up quickly. If you have a small cache or a busy server that caches a lot of documents, you should set the garbage collector to run three times a day.

Every 20 minutes the garbage collector checks if garbage collection is needed (the cache is about to overflow) and starts garbage collection only if it is needed (but it always runs at the times you specify). The cache directory can become full if the proxy is busy and caches a lot of files or if the cache is too small for the number of documents the proxy is configured to cache.

How Garbage Collection Works

When the garbage collector runs, it reads in information for files in the cache directory and uses it to determine which files to remove. Specifically, it looks at

  • how long it took to transfer the document from the remote server

  • how much time has elapsed since the file was last refreshed or had an up-to-date check

This generally means that larger files stay in the cache longer, providing quicker response times when a client requests the document. It also means that older documents are removed before newer ones.

Garbage Collector Process Priority

The garbage collector can be run as a low priority process. You can specify a priority (nice) value—the larger the nice value, the lower the priority. Zero means normal priority, 39 lowest priority.


Caution: If you have a busy CPU and you set a low priority (you use a high nice value), garbage collection might not run as scheduled. In this case, use the normal priority (zero).

If garbage collection never runs (that is, the CPU never gives the process any cycles because the process has too low of a priority), the garbage collector daemon forces a garbage collection at normal priority.

Emergency Garbage Collection

The garbage collector will run immediately if the filesystem is full. This happens if the cache root is on a shared filesystem and other applications fill up the disk space.

Using the pstats Utility

The Netscape Proxy Server has a log analyzer utility that collects data from the proxy access log file. This utility produces output that can help you determine the proxy's performance and fine-tune the number of processes the proxy needs, and that can suggest an optimal cache size.

The pstats utility is in [ServerRoot]/utils. You specify the access log name as a parameter (the log must be in the extended log file format):

utils/pstats logs/access

Interpreting the pstats Output

The output of pstats displays

  • Two different transfer time diagrams. The first lists the percentage of requests finished in each service time category (in seconds). The second lists the percentage of the requests completed by a certain maximum service time.

  • Status codes returned by the remote server and to the client. Because a client can interrupt the connection before the transfer is complete, the two numbers of status codes don't necessarily exactly match the number of bytes transferred in each direction.

  • Requests and connections. This is the number of times the proxy was able to send a document from the cache instead of from the remote server. The higher this number is the more efficient the proxy is because it doesn't have to access the remote server and use valuable network time.

  • Cache performance report. This lists detailed cache information such as the number of documents retrieved from the cache and from remote servers.

  • Transfer time report. This is the number of seconds the proxy takes to transfer documents.

Number of Server Processes

The transfer time report is important when determining how many processes the proxy should use (specified by Server Processes in the Configuration form or specified with the MaxProcs directive in magnus.conf).

Average transaction time is the actual time that one server process is busy on average with a single request. This number should be used when determining how many server processes are needed.

The other figures do not take errors into account, and give only the perceived response time to the user, not the actual time that server resources were bound to servicing the entire load of incoming requests.