My name is SALAH and i am a Big-Data engineer working at BIGAPPS
Big Apps is a specialist in performance management consulting and the integration of innovative technological solutions in Big Data.
We believe that IT is transforming our societies. We know that milestones are the result of sharing knowledge and the pleasure of working together. We are always looking for the best ways to go.
Apache NiFi as Flow based Programming platform.
- Introduction to Apache Nifi and Apache Kafka
- What is Data Flow ?
- What is Apache NIFI ?
- What is Apache MINIFI ?
- What can be done with Apache Nifi ?
- Apache Nifi architecture
- How to install Apache Nifi on centos 7 ?
- Build a first processor and data processing
- Example 1
- Apache NiFi and Apache Kafka together
- Apache Nifi as Producer and Consumer Kafka
- Example 2
- Apache Nifi in real-Time Event Streaming with Kafka
- Example 3
Today, we have many of ETL and data integration software, Some of these solutions are not free and more expansive, and others are maintained and operated by a community of developers looking to democratize the process.
with Dataflow Programming tools you can visually assemble programs from boxes and arrows, writing zero lines of code. Some of them are open source and some are suitable for ETL
ETL is short for extract, transform, load.
Yes, you don’t have to know any programming language. You just use ready-made “processors” represented with boxes, connect them with arrows, which represent exchange of data between “processors,” and that’s it.
There are three main types of boxes: sources, processors, and sinks. Think Extract for sources, Transform for processors, and Load for sinks.’
Almost anything can be a source, for example, files on the disk or AWS, JDBC query, Hadoop, web service, MQTT, RabbitMQ, Kafka, Twitter, or UDP socket.
A processor can enhance, verify, filter, join, split, or adjust data. If ready-made processor boxes are not enough, you can code on Python, Shell, Groovy, or even Spark for data transformation.
Sinks are basically the same as sources, but they are designed for writing data.
Apache Kafka is an open source, distributed streaming platform used to storing, reading and analysing streaming data.
Kafka was originally created at LinkedIn, where it played a part in analysing the connections between their millions of professional users in order to build networks between people. It was given open source status and passed to the Apache Foundation — which coordinates and oversees development of open source software — in 2011.
Apache Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Apache NiFi can work as a Producer and a Consumer for Kafka. Both ways are suitable and depends upon requirements and scenarios.
For more details about Kafka you can follow this links :Apache KafkaApache Kafka: A Distributed Streaming Platform.kafka.apache.orgIntroduction to Kafka — Confluent PlatformApache Kafka® is a distributed streaming platform that: Publishes and subscribes to streams of records, similar to a…docs.confluent.io
What does Dataflow mean?
Dataflow is the movement of data through a system comprised of software, hardware or a combination of both.
Dataflow is often defined using a model or diagram in which the entire process of data movement is mapped as it passes from one component to the next within a program or a system, taking into consideration how it changes form during the process.
What is Apache Nifi ?
Apache Nifi is an open source ETL tools and it was donated by the NSA to the Apache Foundation in 2014 and current development and support is provided mostly by Hortonworks.
Apache Nifi is a data flow management systeme, that comes with a web UI built to provide an easy way to handle data flows in real-time, the most important aspect to understand for a quick start with Nifi is a flow-based programming
in plain terms you create a series of nods with a series of edges to create a graph that the data moves through
in Nifi this nodes are processors and this edges are connectors, the data is stored within a packet of information known as a flow file, this flow file has things like content, attributes and edge, we will get in more specifics
Each individual processor comes with a variety of information readily available.
The status is whether is that processor stopped, started or incorrectly configured.
Time statistics gives you a brief window of the activity of that processor, useful in case more or less data is coming through than you thought
here is a very small sample of a few different processors available to us in Nifi, i personally like to regroup them like this : Inputs, Outputs, and the transformations and flow logic that goes in between
this input and outputs range from local files to cloud services to databases and everything in between, Apache Nifi is open-source and easily extendable, any processor not yet included can be created on the fly as per your own specifications, but for now here is the example provided by Nifi home page
What is Apache MINIFI ?
It is a sub-project of Apache NiFi, MINIFI can bring data from sources directly to a central NiFi instance and it is able to run most of NiFi’s available processors.
MINIFI is used as an agent and we can applying primary features of NiFi at the earliest possible stage. and data can be collected from a variety of protocols.
To learn more about MINIFI follow this link :
What can be done with Apache Nifi processors ?
286 Pre-Built Processors
- Ingestion: connectors to read/write data from/to several data sources
‐ Protocols: HTTP (S), AMQP, MQTT, UDP, TCP, CEF,JMS, (S) FTP, etc.
– Brokers: Kafka, JMS, AMQP, MQTT etc.
‐ Databases: JDBC, MongoDB, HBase, Cassandra etc.
- Extraction (XML, JSON, Regex, Grok etc.)
- Transformation : ‐ Format conversion (JSON to Avro, CSV to ORC etc.)
‐ Compression/decompression, Merge, Split,en