Simple Data Service Caching using Job Output in Joyent Manta

Sometimes a job can take quite a while to finish. A nice inherent aspect of Joyent Manta is that it stores the outputs of a job. You can take advantage of that to provide simple caching for a data service. In this article, I demonstrate a simple technique for this using Bash alone.

Ground rules first: I assume you already have a Joyent account, have the Node.js Manta CLI tools installed, and have tested them to work with Joyent Manta. This is not a general Manta tutorial or a getting started tutorial. It's a use case tutorial specifically around making effective use of stored Manta job outputs for caching.

If you are concerned about accruing cost, don't be. My tutorial uses a 1MB data set that will take a decade to cost you a single cent for storage. The compute usage is so trivial I doubt you will hit a cent there either. I have spent less than $5 with myriad experimentation and considerable waste in pulling entire data sets back to my workstation with poorly considered queries.

The example I will use is ZipInfo.com's National ZIP+4 database. I will use it to create a dead-simple Bash-based client for a data service that looks up the ZIP codes that contain a particular street name. I have the full data set, but the free sample set will work as long as you choose streets in 979## zip codes.

I'll open with the complete solution and explain it piece by piece.

#!/bin/bash
streetname="$1"
cache="/$MANTA_USER/stor/cache/zips_for_street/${streetname}"
result=$(tempfile)
mget $cache > $result 2> /dev/null
if [ $? -eq 0 ]
then
  cat $result
  rm $result
else
  job=$(mjob create -m "
    fold -w182 |
    grep '^D' |
    grep '^.\{17\}S' |
    grep '^.\{24\}$streename' |
    cut -c2-6" -r "
    sort | uniq")
  mfind /$MANTA_USER/stor/natzip4/zip4/ |
    mjob addinputs $job
  output=$(mjob outputs $job)
  while [ "$output" == '' ]
  do
    sleep 2
    output=$(mjob outputs $job)
  done
  mln "$output" "$cache"
  mget "$cache"
fi

There's a fair bit going on in there and it's a pretty limited implementation. To keep things simple you need to enter the street name in all caps to match the data.

First thing is getting data up in Manta. For the benefit of those not shelling out $300 for a data set I'll run with the sample set. Regrettably, with the sample data set, the benefits of caching are minimal since it cuts the data down by almost four orders of magnitude and the number of files by almost three orders.

If you are concerned about cost. Don't be. Using this sample data set It will take roughly a decade for storage to cost you a cent so go for it. The compute will take less than 5 seconds per run and will use a single node. I have spent less than five dollars on Manta so far with the full 8GB data set in play and a lot more experimentation and bandwidth heavy use cases.

Download the sample data and put it up in Manta.

curl http://data.zipinfo.com/samples/nz4sam.txt > ~/Downloads/979.txt
mmkdir /$MANTA_USER/natzip4/
mmkdir /$MANTA_USER/natzip4/zip4/
mput /$MANTA_USER/natzip4/zip4/979.txt -f ~/Downloads/979.txt

Once the data is up you want to query it. You can test the queries locally and they work largely identically assuming you cat and pipe the data into the map script.

ZipInfo's ZIP4 files are made up of 182-character records with no line separators. The most basic step is a map job that separates the unlined fixed width records into lines for processing. You can try this locally.

cat ~/Downloads/979.txt |
  fold -w182

The next step is to match only detail records for streets. If the first character is a D and the 17th is an S, then we have a street record. For the curious: the file format spec is not included in the sample, only with the full pay version of the data set.

cat ~/Downloads/979.txt |
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S'

After this we can search for records by street name which is characters 25 through 52:

cat ~/Downloads/979.txt |
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S' |
  grep '^.\{24\}HARRISON'

Then we want to get just the ZIP code which is characters 2 through 6:

cat ~/Downloads/979.txt |
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S' |
  grep '^.\{24\}HARRISON' |
  cut -c2-6

We get quite a few duplicates that way, so we apply the standard sort and uniq treatment:

cat ~/Downloads/979.txt |
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S' |
  grep '^.\{24\}HARRISON' |
  cut -c2-6 |
  sort | uniq

Rewrite as a parameterized script:

#!/bin/bash
$streetname="$1"
cat ~/Downloads/979.txt |
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S' |
  grep "^.\{24\}$streetname" |
  cut -c2-6 |
  sort | uniq

We've gotten to the core of the data service implementation here: a service for finding the ZIP codes in which a certain street name exists. This translates very directly to Manta:

mfind /$MANTA_USER/stor/natzip4/zip4/ |
  mjob -o -m " fold -w182 |
  grep '^d' |
  grep '^.\{17\}S' |
  grep '^.\{24\}$streename' |
  cut -c2-6" -r "
  sort | uniq"

The motivation for separating sort | uniq into a reduce phase is to sort and uniq the entire output set rather than the results of map phases on individual inputs. With the sample data this won't make a difference, but you should do it anyway because it will burn you on a real data set. The sample 979.txt can be split into multiple Manta objects on any whole number multiple of 182 characters to demonstrate this.

With the basic service defined we want to understand the mechanics around caching using mjob outputs. The basic premise is to attempt to load the cached result first. If it is available, display the result, otherwise perform the job, wait for the outputs and use mln to cache to the result.

First we need a directory to store the cached results.

mmkdir /$MANTA_USER/stor/cache
mmkdir /$MANTA_USER/stor/cache/zips_for_street

We need a way of addressing the cache by query. This could be hashed or other cleverness, I just went with a by-name approach (cache/<scriptname>/<params>):

cache="/$MANTA_USER/stor/cache/zips_for_street/${streetname}"

Then we need to attempt to load from the cache:

result=$(tempfile)
mget $cache > $result 2> /dev/null

After the attempted load we check if it was successful:

if [ $? ]
then
  cat $result
  rm $result
fi

Then we handle the alternate case of having to do the actual work. This is somewhat complex in itself because we need to job's UUID in order to link (snapshot) its output.

First we get the job's UUID:

job=$(mjob create -m "
  fold -w182 |
  grep '^D' |
  grep '^.\{17\}S' |
  grep '^.\{24\}$streename' |
  cut -c2-6" -r "
  sort | uniq")

Then we input the National ZIP+4 data files into the job:

mfind /$MANTA_USER/stor/natzip4/zip4/ |
  mjob addinputs $job

Then we check the outputs of that job and wait until they become available:

output=$(mjob outputs $job)
while [ "$output" == '' ]
do
  sleep 2
  output=$(mjob outputs $job)
done

Finally we link (snapshot) the results and retrieve the output through the cache.

mln "$output" "$cache"
mget "$cache"

Put it all together and we have our original script.

Miscellaneous Useful Scripts

You'll probably want to test and try things out. Here is a useful script I call clear_caches that deletes all cached results so you can force the jobs to run again easily.

#!/bin/bash
mfind -t o /$MANTA_USER/stor/cache | xargs -n1 -i{} mrm "{}"

All jobs create objects that reflect their inputs, outputs, etc. These objects can add up and in experimentation you want to avoid storing them permanently.

I created a script called clear_job_objects for precisely that reason.

#!/bin/bash
mls /$MANTA_USER/jobs/ | xargs -n1 mrm -r

Remember that mln snapshots the linked object so deleting the original does not break the link. It is safe to remove a job after its result has been cached. I wouldn't be in the practice of continuously deleting all jobs. It is likely better to clean up a single job immediately after caching its outputs.