Using a fault-tolerant architecture, Flume is a distributed system for collecting logs data from many sources, aggregating it, and moving large amounts of it to a centralized data store such as the Hadoop Distributed File System (HDFS) or HBase. Flume is designed to be a flexible distributed system that can scale out very easily and is highly customizable. A correctly configured Flume agent and a pipeline of Flume agents created by connecting agents with each other is guaranteed to not lose data, provided durable channels are used.
In previous article, I explained why we really need a buffer system as Flume to ingesting log data. In this article, we will discuss about how to scale out easily a Flume cluster when the number of servers producing data consistently increases.
Data flow model
A Flume event can be defined as a unit of data flow having a payload (bytes) and an optional set of string attributes. A Flume agent is a JVM process that hosts the components through which events flow from an external source to the next destination (hop).
A Flume agent has three main components: source, channel, and sink. The source consumes events delivered to it by an external source like a web service. The external source sends events to Flume in a recognizable format. When a Flume source receives events, it stores them into one or more channels (channels uses local file system or memory to storing events). The channel is a passive store that keeps an event until it is consumed by a Flume sink. The sink extracts the event from the channel and puts it in an external repository like the HDFS, or forwards it to the Flume source of the next Flume agent (multi-hop flows) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.
Multi-hop Flows/Complex Flows
Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transaction processes running to ensure that the data is safely stored in the channel of the next hop.
This allows the user to design a fan-in–style flow from a large number of data-producing applications. It is important to restrict the number of applications writing data to any storage system to ensure that the storage system scales with the amount of data being written and can handle bursty data writes.
To design a fan-in topology, there needs to be a number of Flume agents receiving data from the applications producing the data while a few agents write data to the storage system. Depending on how many servers are producing how much data, the agents could be organized into one, two, or more tiers, with agents from each tier for‐warding data from one tier to the next using an RPC sink–RPC source combination.
As shown in below figure, the outermost tier has the maximum number of agents to receive data from the application, though the number of Flume agents is usually only a small fraction of the number of application servers; the exact number depends on a variety of factors including the network, the hardware, and the amount of data. When the application produces more data or more servers are added, it is easy to scale out by simply adding more agents to the outermost tier and having them configured to write data to the machines in the second tier.
This kind of topology allows Flume to control the rate of writes to the storage system by backing off as needed, while also allowing the application to write data without any worry. Such a topology is also often used when the application producing the data is deployed in many different data centers, and data is being aggregated to one cluster. By writing to a Flume agent within the same data center, the application can avoid having to write data across a cross–data center WAN link, yet ensure that the data will eventually get persisted to the storage system. The communication between Flume agents can be configured to allow higher cross–data center latency to ensure that the agent-to-agent communication can complete successfully without timeouts.
Having more tiers allows for absorbing longer and larger spikes in load, by not over‐whelming any one agent or tier and draining out data from each tier as soon as possible. Therefore, the number of tiers required is primarily decided by the amount of data being pushed into the Flume deployment. Since the outermost tier receives data from the largest number of machines, this tier should have the maximum number of agents to scale the networking architecture. As we move further into the Flume topology, the number of agents can reduce significantly
By having a number of Flume agents receive data from application servers, which then write the data to HDFS or HBase (either directly or via other Flume agents), it is possible to scale the number of servers and the amount of data that can be written to HDFS by simply adding more Flume agents.
– AnTQ –