About This Blog: Beanstalk Messaging Queue 22 comments

posted Thursday, December 20, 2007 by topfunky

Messaging queues are a tool for executing code without taxing your web application processes.
Figure A Messaging queues are a tool for executing code without taxing your web application processes.

Web developers often get into the rut of thinking about every programming task in the context of a request and a response. A request comes for a URL, content is retrieved and converted into useful output, then sent back to the client. Lather, rinse, repeat.

But there are also other types of programming tasks that don’t fit into that cycle. What about tasks that need to happen

  • at a certain time of day?
  • after the response has been sent back to the client?
  • at some point in the future after another event has happened?

Here are some examples from my applications:

  • Time of day: Log parsing, sales reports
  • After a response: State-based payment transactions with other servers that need to complete before moving to the next stage, spam-filtering of blog comments against an external API server
  • In the future: A task to checkup on an order to make sure it has been delivered within an hour after purchase

Previously, I approached most of these problems with a few rake tasks and a cron job that ran every minute. While it worked, it wasn’t as fast as it could be and felt a bit hackish (a delay of even one minute is too slow sometimes).

For a while, I’ve wanted to learn more about messaging queues. I love tools that don’t only enhance something I’m already doing, but completely change the way I think about designing an application.

The Problem

If one process calls a resource-intensive method, it won't be available to answer web requests.
Figure B If one process calls a resource-intensive method, it won’t be available to answer web requests.

Queues are a great tool for some tasks. Having the ability to send something off to a queue can solve some of these problems and also give you another option for optimizing the speed of normal HTTP responses, too.

Initially, I decided to try this out on my blog. I use the remote Akismet service to check comments for SPAM. To be honest, Akismet is usually fast enough that I could make the call in the middle of the request without any problems, but I wanted to try out the message queue before deploying a similar idea at PeepCode.com.

In the application, every comment starts out with a received state (using acts_as_state_machine). The Comment controller will fire off a job and a separate worker will handle the SPAM-checking so the web process can respond quickly and get back to work responding to other web requests.

Sending jobs to a queue keeps Rails processes available to do what they do best...answer web requests.
Figure C Sending jobs to a queue keeps Rails processes available to do what they do best…answer web requests.

Messaging Servers

There’s been some fresh activity even just over the last few weeks in this area. Ara Howard compared some of these recently. I haven’t evaluated all of these packages, but here are a few I’ve looked at:

Product Features Drawbacks
beanstalkd Fast, simple, designed to mirror the style of memcached. Rails plugin available, or usable with a simple Ruby-based API. Server written in C, but is very easy to install. Memory only…jobs are not persistent. New, so the internal protocol may change. Workers may be difficult to manage.
bj Rails plugin. Self-spawning. Can only send shell commands. Jobs start a full copy of your Rails app on every execution.
BackgroundRB Ruby-based. Can be polled for incremental feedback on the progress of a job. Recently rewritten.
Amazon SQS Runs on Amazon’s cluster, so it can handle a ton of traffic. Operated by Amazon, so it doesn’t run locally. Not open source.
Apache ActiveMQ Well-known. Persistent. Requires several installation steps and database tables.
ActiveMessaging Rails plugin. Works with ActiveMQ and others. Requires external job server.
BBQ Nothing to install…involves only a single line of code! Doesn’t work on Windows NT4.

For this blog, I chose to try beanstalkd.

  • It is not persistent and exists only in memory, but my application uses acts_as_state_machine, so I can see if any job failed to run.
  • It’s very fast and was made to help with scaling a multi-million person Facebook app running on Rails.
  • There are some nice features like delayed jobs. Put a job in the queue and it can show up immediately, or after a period of time.
  • There is a Ruby client and a Rails plugin.

Installation

Download the beanstalkd server and compile it. Use make for production or make debug for your development copy (to print out extra messages as it’s working). There’s no task to install it, but you can just copy the executable to /usr/local/bin.

Start the server (use -h to see other possible arguments):

% beanstalkd
beanstalkd: net.c:90 in unbrake: releasing the brakes

Install the beanstalk-client gem. For this blog, I chose to use the gem directly.

sudo gem install beanstalk-client

In merb_init.rb (or config/environment.rb), I setup a connection to the beanstalk server.

BEANSTALK = Beanstalk::Pool.new(['localhost:11300'])

In the Comments controller, I put a comment job into the queue, using the id of the new comment.

# Comments controller
def create
  @comment = Comment.new(params[:comment])
  if @comment.save
    BEANSTALK.yput({:type => "comment", :id => @comment.id}) rescue nil
    # Then redirect and return

The yput method uses YAML to serialize any arguments and put them into the queue.

Finally, I wrote a rake task to function as the worker.

loop do
  job = BEANSTALK.reserve
  # ybody deserializes the job
  job_hash = job.ybody
  case job_hash[:type]
  when "comment" 
    if Comment.check_for_spam(job_hash[:id])
      job.delete
    else
      job.bury
    end
  else
    puts "Don't know what type of job this is: #{job_hash.inspect}" 
  end
end

In the future, I’d like to look into using daemonize or some other method for running the worker. In the meantime, I’m using god to start the worker and keep it running.

The details are a bit of a hack, but here is the god.conf if you want to try it. The benefit is that god keeps the worker running and daemonizes it so it runs in the background.

sudo god start -c /var/www/apps/mysite.com/current/config/god.conf

I can also call god restart beanstalk-worker from a Capistrano task to restart it and keep the code fresh.

Results

The comments screen shows the state of comments.
Figure D The comments screen shows the state of comments.

In practice over the past week or so, this has been very reliable. The message passing is so fast that sometimes it actually runs the SPAM check before the redirect back to the article page is done!

It was fairly simple to setup and now provides me with a tool for accomplishing tasks that don’t need to be completed in the scope of an HTTP response.

Tips

  • Keep queue items small. Put an id and some kind of identifier, not an entire model. I could have stored the entire contents of the comment, but it’s more efficient to just pass the id of the comment and let the worker get a current copy from the database.
  • Keep worker code small: By calling methods on the Comment model, I keep the code in one place (even though it will be executed in different contexts).

Future

Now that it’s working smoothly here, I hope to use it on PeepCode. Some possibilities:

  • Google Checkout pings: I need to ping Google after their server notifies me that an order has passed a security check. It can’t be done in a model callback because I need to complete an XML response to Google before moving to the next state. This kind of setup will be perfect since I can put a job in the queue and give it a small delay before it runs.
  • Order followup in the future: I currently have some cron jobs that check recent orders to make sure they have completed succesfully. It would be easy to use this system to put several jobs in the queue after an order is placed. They would run after 30 minutes, 1 hour, or 2 hours. The job would send a notification if the order has not been completed, or ignore it if it’s done by the time the job runs.

Finally, PeepCode in Italiano!

Ryan Daigle’s Rails 2 PDF is now available in Italiano as well as English and Español.

22 comments

Leave a response

  • Gravatar icon Joe VAn Dyk

    BackgroundRB just reached 1.0. And looks very cool, and is very active development. I’ve had good success with it.

  • As ever, an interesting and useful read – thanks!

  • Can be easily done in a BackgrounDRb worker:

     
    def create
      @comment = Comment.new(params[:comment])
      if @comment.save
        MiddleMan.ask_work(:worker => :spam_worker,:worker_method => :mark_as_spam,:data => @comment.id)
      end
    end
    
    class SpamWorker
      set_worker_name :spam_worker
      def create
      end
    
      def mark_as_spam(comment_id)
        thread_pool.defer(comment_id) do |comment_id|
          Comment.check_for_spam(comment_id)
        end
      end
    end
    

  • Gravatar icon Venkat

    Love your posts. Very informative.

    Have you considered AP4R as well?

    The link to Rubyforge is below:

    http://rubyforge.org/projects/ap4r/

    I am not involved in this project. But sounds like a good choice for these kind of tasks as well.

    Cheers, Venkat.

  • You may be interested in the recent alpha release of the InlineBBQ Queue service ;)

  • Gravatar icon bronson

    I’m using ap4r to do this. I find it a little more convenient because it just calls right back into whatever controller I tell it. No need to worry about rake tasks.

    For instance, this:

    def MyController
      def queue
        ap4r.async_to({:action => 'download'}, {:story => story.id, :url => params[:url]})
      end
      def download
        # long-running task
      end
    end

    will work just as you expect: the queue action will queue the download request and return immediately. Later (configurable), ap4r calls the download action with :story and :url. You can, of course, tell it to call a completely different controller if you want. And it’s load-balanced like any other incoming request.

    Pretty handy.

  • Sparrow is a pretty interesting project.

    http://code.google.com/p/sparrow/

    Speaks Memcached, written in ruby and Eventmachine so should be pretty bullet proof.

    *...(

  • ”# ybody unserializes the job”

    Ouch my ears! “deserializes” please.

    I know, this is very trivial.

  • Question:

    So I understand that you add a job to the queue, but I’m confused about when the worker picks it up.

    So do you simply start your rake task and it runs forever, checking the queue with the do loop?

    How’s this with memory/cpu usage?

  • Gravatar icon topfunky

    @Alex: Fixed…thanks.

    Yes, the rake task sits in a do loop. It’s been running constantly for over a week now and CPU usage is at 0.02. So it doesn’t max out the CPU as one would expect.

  • I get a 503 error with this link:

    http://pastie.textmate.org/private/ovgxu2ihoicli2ktrwtbew

  • As ever, an interesting and useful read – thanks!

  • I would tend to agree with John Topley and say that it can be easily done in a BackgrounDRb worker:

  • Gravatar icon Ted

    Thanks for this post—Queueing is something that I’ve been wondering about ever since Amazon introduced their queueing service (really expensive!)

  • I just implemented this because backgroundrb was being finicky, and so far I like it better (simpler and cleaner and easier to test). If you don’t want to fire up ruby at the end of the beanstalk stop command, try:

    kill -9 `ps ax | grep beanstalk:daemon | grep -v grep | cut -c 1-5`

  • Gravatar icon Ivan

    Interesting post.

    What’s the difference between job.delete and job.bury ? I couldn’t find any information about that on the beanstalkd or beanstalk-client sites.

  • Gravatar icon Geoffrey Grosenbach

    @Ivan: Delete removes a job altogether, while bury puts it in a deferred queue that can be accessed separately.

    If a job completed successfully, it can be deleted. If it had a problem or couldn’t be done right away for some reason, it can go into the buried queue.

  • it’s not quite accurate to say that bj can only run shell commands. it’s actually possible to have bj load the rails app initially and then to evaluate the job from the db. so you can simply submit ruby code to run. the reason it does not do this is for robustness: rails apps leak memory like crazy, everyone knows it, so stacking up a backgroudn daemon that loads a rails app and runs jobs through it effectively doubled the processes that are leaking and memory requirements of the system. by loading only on demand per process the system stays robust and has minimal memory requirements at the expense of cpu. of course, as i said, bj allows you to do this if you want to, but it is not the default. another feature of bj is that it allows you to cluster backend process – it’s very easy to setup 10 background job runners. just an fyi.

  • Gravatar icon topfunky

    Thanks for the clarification, Ara.

    Message queueing systems are popping up all over the place. Shopify wrote their own, too:

    http://blog.leetsoft.com/2008/2/17/delayed-job-dj

  • This:

    def MyController def queue ap4r.async_to({:action => ‘download’}, {:story => story.id, :url => params[:url]}) end def download # long-running task end end

    doesnt work. Are you sure everythink is fine?

    Thanks Lukas Kalender

  • Gravatar icon Sebastian

    Hi! I am using somewhat the same setup as you are on one of my sites and I have a question regarding your god config file. I hope you want to answer. How do you manage to monitor the beanstalkd process from god? Monitoring the ruby scripts that interact with the beanstalkd isn’t a problem, but the message queue system itself seems harder monitor. How do you do that? Are you also monitoring f.ex nginx from your god setup? I would love to hear how you are doing those two things!

    If you don’t want to write about it here, you can also send me an email directly!

    Thanks.

    Best regards Sebastian

  • Gravatar icon Eliot

    If you’re not using the queue for object persistence (ie. you only stick a model’s id there), why use beanstalk at all? Wouldn’t a script/runner (or merb -r) task which works directly from the DB work just as well? You might want to add some kind of locking or priority to the model, but it wouldn’t add much complexity.

Your Comment

Nuby on Rails

Geoffrey Grosenbach / Ruby / Code / Graphics / Design / Rails / Merb / Javascript / CSS

Manufactured with

Subscribe

Subscribe (RSS)