Manta: Big Data for everyone

    _   _              _    ___  _            _     _                  _
   /_\ | |_ __  ___ __| |_ / _ \| |__ ___ ___| |___| |_ ___   _ _  ___| |_
  / _ \| | '  \/ _ (_-<  _| (_) | '_ (_-</ _ \ / -_)  _/ -_)_| ' \/ -_)  _|
 /_/ \_\_|_|_|_\___/__/\__|\___/|_.__/__/\___/_\___|\__\___(_)_||_\___|\__|
                                                   In Glorious ASCII-VISION

Summary: It's Map-Reduce with standard Unix tools!

2013-07-22 Mon: Manta: Big Data for everyone

Joyent's recently released Manta is an exciting new way to process data, especially when you've got a lot of it. Manta is a combined storage and compute platform, you store your data in the S3-like storage system and submit jobs to run on that data.

I've spent the last few days using Manta to process web logs for ActiveInbox (if you use GMail and want a better way to handle emails as tasks then check it out now, I'll wait). I've found it a fantastic way to slice and dice large amounts of data and get at the secrets within.

The really cool thing about Manta though is that these jobs run on a Unix environment and can consist of standard Unix commands and scripts. That means if I wanted to count the number of lines in a file on in Manta storage I might submit a job by running the following command from my computer:

$ echo /$MANTA_USER/stor/logs/2013-07-21.log | \
  mjob create -o 'wc -l'

Manta runs the job (in this case 'wc -l') near to where it keeps the data which can be lot more efficient than pulling down the data to another machine first. The mjob command (from the Manta SDK) above is used to create a job and the input filename is fed in via stdin, the -o option makes the command wait for the job to finish and output the result (otherwise you can retrieve the output later).

Lots and lots of files

Manta can also run jobs on many files and for this it uses a Map-Reduce architecture, you give a command (the map) that is to be run on each file individually and another command (the reduce) to combine the outputs of the first (it's also possible to do this as a multi-stage thing with multiple reducers which is useful when you're dealing lots of data).

So, if I wanted to count the lines in all my log files I might run a job like this:

$ mfind /$MANTA_USER/stor/logs/ | \
  mjob create -o -m 'wc -l' -r "awk '{total += \$1} END {print total}'"

Here I used the mfind command (also from the Manta SDK) to list all the log files in the directory on the Manta store and pipe the filenames into the stdin of mjob. The job this time consists of both a map step that counts the lines in each file and a reduce step that sums the resulting numbers together.

Python, Perl or Node

This style of job is great when you want to get a quick answer about your data but often you need more. Manta has lots of popular interpreters for languages available so you can write your jobs in Python, Ruby, Perl, Node or others.

Next time I'll write about using Python to do some crunching on our big log files…

Plug As I may have mentioned above ActiveInbox is awesome and you should give it a try. When I'm not working on ActiveInbox I'm also available for hire as a freelance developer, so why not drop me an email.

< follow me on Twitter: @almostobsolete >
        \   ^__^
         \  (OO)\_______
            (__)\       )\/\
                 || ----w|
                 ||     ||

Date: 2013-07-22 19:41:39 BST

Author: Thomas Parslow

Org version 7.7 with Emacs version 23

Validate XHTML 1.0