Apache Hadoop has come from its beginning as a reliable storage pool with batch processing using the scalable and parallelizable MapReduce framework. Derive with the recent additions real-time component like Impala and Apache Solr as a search engine for free-form text exploration. Getting started with Hadoop is now a lot easier, just install CDH and all the Hadoop ecosystem components are under control.
But then, where should we do next? What is a good first use case? How to ask “bigger questions”? Although the obviously variations of industries and applications, there is always a common theme in Hadoop: the presence of Big Data architecture.
Big Data Architecture
Big Data architecture is based on a skill set for developing reliable, scalable and completely automated data pipelines. That skill set requires knowledge of every layer in the stack, beginning with cluster design and spanning everything from Hadoop tuning to setting up the top chain responsible for processing the data. The diagram in above shows the complexity of the stack, as well as how data pipeline engineering touches every part of it.
The main flow in the diagram is that data pipeline take raw data and transform it into insight. Along that way, the Big Data engineers have to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally or externally, what tools to use to process the data… The people who design and implement Big Data architecture can be known as Big Data engineers.
Now we can take a look to various components in the stack and their role in creating data pipelines.
Cluster planning is a “chicken-and-egg” problem, as cluster design is inherently driven by the use-cases running later on, and often the use case is not yet clear. Most vendors have a reference architecture guideline to help you select the proper class of machines. In general, the current recommended machines are dual CPU with 4 to 8 cores each, at least 48GB up to 512GB of RAM for cached data, at least 6 up to 12 or larger HDDs for storage heavy configurations and otherwise standard rack-mountable 19” servers. When in doubt, you can always try the cloud services first, and once you know your requirements better, you can move things around.
After you have spun up your cluster, you have to decide how to load data. In practice there are two main approaches: batch and event-driven. The former is appropriate for file and structured data, while the latter is appropriate for most near-real-time events such as log or transactional data.
When ingest data from structured data sources such as RDBMS, the choice is universally Apache Sqoop, which allows you to move data into Hadoop from RDBMSs. You can select partial or full data sets and do full or incremental transfers. Sqoop uses MapReduce as its workhorse and employs default JDBC drivers for many database systems.
The more complex batch ingest method is file loading. There are many ways to achieve that but none are really established. The matter is complicated by the location of the files, as well as the API to load them. In fact, when possible, it is better to switch to the event ingest to avoid bulk loading of files.
For event-based ingest there is Apache Flume, which allows you to define a redundant, failsafe network of so-called agents that transport event records from a generating system to the consuming one, it might be HDFS, but it can also be Spark or HBase, or a combination of both.
Flume has been tested at large user clusters and allows you to reliably deliver data to where it is needed. The tricky part is to configure the Flume topology and the agents correctly. The agents need to be able to buffer enough data on persistent media so that all anticipated “normal” server failures are covered. Also, tuning the batch sizes of events that are sent between agents is vital to achieve either higher throughput or lower latencies.
Once the data has arrived in Hadoop as a whole, there remains the task of staging it for processing. This is not just about storing it somewhere, but rather storing it in the right format, with the right size, and the right access mask.
The right data format depends on whether the application is batch or real-time, but so is whether the format retains the full reliability of data and is open source. For batch, container file formats including SequenceFile and Avro formats are both useful and popular. And for real-time application, the new bright cancidate is Apache Parquet which similar to columnar databases lays out the data in columns with built-in structure and compression that allow you to very efficiently scan very large data sets.
It is also important to think about what happens with your data over time. You might want to implement policies that rewrite older data into different file or compression formats, so that you make better use of the available cluster capacity.
As you stage your data, there is another important aspect to consider: how you partition or how big is your size data. For starters, Hadoop is good at managing fewer very large files. You do not want to design an architecture that lands many small files in HDFS and then be surprised when the NameNode starts to perform badly. Of course you can do that but you would need to implement an ETL stage that combines smaller files into larger ones.
While you are transforming files as they arrive, the next step is to split them into decent chunks for later processing. This is usually done using partitions on HDFS. In HBase the partitioning is implicit as it divides data into regions of contiguous rows, sorted by their row key. For HDFS, you have to plan ahead of time so you might need to sample data and explore its structure to decide what is best for you. The rule of thumb is for partitions to span at least a decent amount of data worth processing without creating the small-file problem mentioned above. It should be at least 1GB in a single file, and knowing the size of the total dataset, tune this up to even larger sizes.
The last part you have to consider is what we call information architecture (IA), which addresses the need to lay out the data in such a way that multiple teams can work safely on a shared cluster (multi-tenancy).
It is not enough to have each job read from one directory and emit to another. If you share a cluster across departments, you need to devise a concise access schema that controls tightly who has access to what data. The IA is where these rules are defined, with that, you can further define a plan on how data is read from storage during processing and pushed through the various stages of the data processing pipeline. One way to handle proper processing is to create a time-stamped directory for every running job , this ensures that jobs can run in parallel without overwriting each other’s data mid-flight.
Thus far you have learned about staging the incoming data. The next step is automatically processing it as part of the data pipeline.
Just because you transform your data doesn’t mean you need to lose any of its detail: this is not your typical ETL which is often lossy, but rather an optional step to increase the effectiveness of your cluster. Plan to do whatever is needed for staging, which might also extend to rewriting data over time. You could, for example, employ heuristics that check how often and in what way data is used and change its layout over time.
The more interesting part of processing is the analytics done on top of the staged data. The currently most hyped topic is machine learning, wherein you build mathematical models for recommendations or clustering/classification of incoming new data—for example, to do risk assessment, fraud detection, or spam filtering. The more ordinary tasks in analysis, such as building aggregations and reporting data, are still very common.
Either way, after prototyping the algorithm and approach, you have to convert it into an automated workflow.
Before we can automate, we have to combine the tools in each component of the stack into more complex data pipelines. There are two main types of such pipelines: micro and macro.
Micro-pipelines are streamlined helpers that allow you to abstract parts of the larger processing. Tools for this purpose include Morphlines, Crunch, and Cascading. Morphlines tie together smaller processing steps applied to each record or data pair as it flows through the processing. Crunch and Cascading define an abstraction layer on top of the processing, where you deal with data points. But a Crunch or Cascading “meta” job can further be combined to yet more complex workflows, which is usually done in macro-pipelines.
Apache Oozie is one of those macro-pipelines tools. It defines workflows as directed, acyclic graphs (DAGs) that have control and action elements, where the former influences how the flow proceeds and the latter what has to be done for each step. Oozie also has a server component that tracks the running flows and measures to handle their completion. As with single jobs or micro-pipelines, a “workflow” is not automated but rather just a definition of work. It has to be invoked manually to start the flow processing. This is where another part of Oozie, the coordinators, come in. Oozie coordinators help define the time or frequency a workflow should run, and/or the dependencies to other workflows and data sources. With this feature, you can define the missing link in automating processing.
The following diagram adds the discussed tools and concepts to the data pipeline architecture:
While Hadoop has grown tremendously, there are still functional gaps for putting data pipelines into production easily, so skilled Big Data engineers are needed and demand for these engineers is high and expected to grow. The Hadoop ecosystem, helpfully, offers most of the tools needed to build and automate these pipelines based on business rules—testing and deploying pipelines is easier with proper tooling support, while operating the same pipelines in production can be equally automated and transparent.
In the meantime, happy Hadoop-ing!
Vu Thanh Hai – FPT SoftwareRelated posts: