ApacheCon NA 2013

Portland, Oregon

February 26th – 28th, 2013

Register Now!

Wednesday 11:45 a.m.–12:30 p.m.

Planning and Deploying Apache Flume

Arvind Prabhakar

Big Data
Audience level:


Apache Flume is a distributed, reliable, scalable and extensible system for collecting and aggregating large volumes of streaming event data such as log information. It requires minimal setup and when planned and sized to requirements, virtually no manual maintenance at all. This talk focuses on the capacity planning and sizing your Flume deployment to meet any data center requirements.


Apache Flume facilitates the collection and aggregation of large volumes of streaming data from different corners of your network to one or more central stores where it can be analyzed conveniently.

Log Collection Use-Case

Consider the example of log collection from a farm of servers that host content for an enterprise. As long as the number of servers is low, manual copying of log files could suffice. When the number of servers is a bit more, say in double digits, you could potentially use scripts with some basic automation framework like cron to do the needful. Needless to say, maintaining such an environment is non-trivial and will be a source of continuous concerns and discomfort for the operators. This solution completely breaks down when the number of servers producing logs is large, say in hundreds or thousands or even more, and they are distributed over various data centers spanning continental boundaries.

What you need in such a situation is a system that can be deployed across all these data centers that is capable of running indefinitely without constant intervention. As your business grows resulting in the growth of number of servers and geographical boundaries, you want this system to be able to scale with minimal administrative overhead. And considering the importance of the log data you are collecting, you would like this system to provide reliability guarantees, quick delivery to destination, and minimal ordering guarantees as well.

Apache Flume to the Rescue

All of the requirements mentioned above, along with many high level features are what make Flume a compelling solution for such use-cases. Not only does Flume provide a scalable system that requires minimal effort to setup and maintain, it also provides data safeguards to ensure delivery, provides ordering semantics, contextual routing and a host of other features that can make the task of catering to this use-case a breeze. With declarative configuration Flume requires no code changes to be able to address such use-cases. For situations where you would like to do sophisticated contextual routing or filtering, or would like to connect systems that are not supported by default, you could easily create custom components that can be dropped into Flume for seamless integration.

Focus of this Talk

This session will first introduce you to Flume at a high level and then take you through the sizing and capacity planning for your deployment. We will specifically discuss a converging flow configuration, aka fan-in flow, and highlight the way you can compute the topology details, configuration details and evaluate the hardware capacity needed to make it function to the specific requirements.

At the end of this session, you will have a clear understanding of how to plan your Flume deployment in a manner that enables you to address all your current and projected needs for the immediate future. You will also develop a deeper understanding of Flume as a distributed system and be able to build on this knowledge to plan bigger topologies or modify existing topologies where necessary.