StashCache By The Numbers
The StashCache federation is comprised of 3 components: Origins, Caches, and Clients. There are additional components that increase the usability of StashCache which I will also mention in this post.
Origins
A StashCache Origin is the authoritative source of data. The origin receives data location requests from the central redirectors. These requests take the form of “Do you have the file X”, to which the origin will respond “Yes” or “No”. The redirector then returns a list of origins that claim to have the requested file to the client.
An Origin is a simple XRootD server, exporting a directory or set of directories for access.
Origin | Base Directory | Data Read |
---|---|---|
LIGO Open Data | /gwdata | 926TB |
OSG Connect | /user | 246TB |
FNAL | /pnfs | 166TB |
OSG Connect | /project | 63TB |
A list of Origins and their base directories.
Clients
The clients interact with the StashCache federation on the user’s behalf. They are responsible for choosing the “best” cache. The available clients are CVMFS and StashCP.
In the pictures above, you can see that most users of StashCache use CVMFS to access the federation. GeoIP is used by all clients in determining the “best” cache. GeoIP location services are provided by the CVMFS infrastructure in the U.S. The geographically nearest cache is used.
The GeoIP service runs on multiple CVMFS Stratum 1s and other servers. The request to the GeoIP service includes all of the cache hostnames. The GeoIP service takes the requesting IP address and attempts to locate the requester. After determining the location of all of the caches, the service returns an ordered list of nearest caches.
The GeoIP service uses the MaxMind database to determine locations by IP address.
CVMFS
Most (if not all) origins on are indexed in an *.osgstorage.org
repo. For example, the OSG Connect origin is indexed in the stash.osgstorage.org
repo. It uses a special feature of CVMFS where the namespace and data are separated. The file metadata such as file permissions, directory structure, and checksums are stored within CVMFS. The file contents are not within CVMFS.
When accessing a file, CVMFS will use the directory structure to form an HTTP request to an external data server. CVMFS uses GeoIP to determine the nearest cache.
The indexer may also configure a repo to be “authenticated”. A whitelist of certificate DN’s is stored within the repo metadata and distributed to each client. The CVMFS client will pull the certificate from the user’s environment. If the certificate DN matches a DN in the whitelist, it uses the certificate to authenticate with an authenticated cache.
StashCP
StashCP works in the order:
- Check if the requested file is available from CVMFS. If it is, copy the file from CVMFS.
- Determine the nearest cache by sending cache hostnames to the GeoIP service.
- After determining the nearest cache, run the
xrdcp
command to copy the data from the nearest cache.
Caches
The cache is half XRootD cache and half XRootd client. When a cache receives a data request from a client, it searches it’s own cache directory for the files. If the file is not in the cache, it uses the built-in client to retrieve the file from one of the origins. The cache will request the data location from the central redirector which in turn, asks the origins for the file location.
The cache listens on port 1094 to regular XRootD protocol, and port 8000 for HTTP.
Authenticated Caches
Authenticated caches use GSI certificates to authenticate access to files within the cache. The client will authenticate with the cache using the client’s certificate. If the file is not in the cache, the cache will use it’s own certificate to authenticate with the origin to download the file.
Authenticated caches use port 8443 for HTTPS.