Apache NiFi

Apache NiFi Course Objects

  1. NiFi Introduction
    • What is Apache NiFi
    • Challenges of dataflow
    • Capabilities of Apache NiFi
    • Features of Apache NiFi
  2. NiFi core concepts
    • Flow Files, Processors & Connectors
  3. NiFi Architecture
  4. NiFi Download & Installation
  5. Creating NiFi Process Flows Hands-on
  6. References

NiFi Introduction

What is Apache NiFi:

  1. NiFi is framework to automate the flow of data between systems.
  2. An easy to use, powerful, and reliable system to process and distribute data.
  3. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

NiFi is built to help tackle the modern dataflow challenges.

Some of the high-level challenges of dataflow include:

  1. Systems fail, Networks fail, disks fail, software crashes, people make mistakes.
  2. Data access exceeds capacity to consume
  3. Boundary conditions are mere suggestions
  4. Invariably data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format.
  5. Priorities of an organization change – rapidly. Enabling new flows and changing existing ones must be fast.
  6. Systems evolve at different rates
  7. The protocols and formats used by a given system can change anytime
  8. Compliance and security: Laws, regulations, and policies change.
  9. Business to business agreements change.
  10. System to system and system to user interactions must be secure, trusted, accountable.
  11. Continuous improvement occurs in production

Some of the high-level capabilities and objectives of Apache NiFi include:

  1. Web-based user interface
  2. Provides visual creation and management of directed graphs of processors.
  3. Seamless experience between design, control, feedback, and monitoring
  4. Highly configurable
  5. Loss tolerant vs guaranteed delivery
  6. Low latency vs high throughput
  7. Dynamic prioritization
  8. Flow can be modified at runtime
  9. Back pressure
  10. Data Provenance
  11. Track dataflow from beginning to end
  12. Designed for extension
  13. Build your own processors and more
  14. Enables rapid development and effective testing
  15. Secure
  16. SSL, SSH, HTTPS, encrypted content, etc…
  17. Multi-tenant authorization and internal authorization/policy management

Features of Apache NiFi:

  1. Automate the flow of data between systems
    • Eg: JSON -> Database, FTP -> Hadoop, Kafka -> ElasticSearch, etc…
  2. Drag and drop interface
  3. Focus on configuration of processor (i.e what matters only to user)
  4. Scalable across a cluster of machines
  5. Guaranteed Delivery / No Data Loss
  6. Data Buffering/ Back Pressure / Prioritization Queuing/ Latency vs Throughput

What Apache NiFi is good at:

  1. Reliable and secure transfer of data between systems
    • Delivery of data from sources to analytics platforms
    • Enrichment and preparation of data like
      • Conversion between formats
      • Extracting/Parsing
      • Routing decisions

What Apache NiFi shouldn’t be used for:

  1. Distributed Computation
  2. Complex Event processing
  3. Joins, rolling windows, aggregate operations

NiFi Core concepts:

FlowFile: Each piece of “User Data” (i.e., data that the user brings into NiFi for processing and distribution) is referred to as a FlowFile. A FlowFile is made up of two parts: Attributes and Content. The Content is the User Data itself. Attributes are key-value pairs that are associated with the User Data.

Processor: The Processor is the NiFi component that is responsible for creating, sending, receiving, transforming, routing, splitting, merging, and processing FlowFiles. It is the most important building block available to NiFi users to build their dataflows.

In simple terms:

FlowFile:

 It’s basically the data

Comprised of two elements

  • Content: the data itself
  • Attributes: Key value pairs associated with the data (creation date, filename, etc…)

Gets persisted to disk after creation

Processor:

  1. Applies a set of transformations and rules to FlowFiles, to generate new FlowFile
  2. Any processor can process any FlowFile
  3. Processors are passing FlowFile references to each other to advance the data processing
  4. They are running in parallel (different threads)

               

Connector:

It’s basically a queue of all the FlowFiles that are yet to be processed by the next Processor

Defines the rules about how FlowFiles are prioritized (which ones first, which ones not at all)

Can define backpressure to avoid overflow in the system


Categorization of processors:

Over 300+ bundled processors

  1. Data Transformation: ReplaceText, JoltTransformJSON…
  2. Routing and Mediation: RouteOnAttribute, RouteOnContent, ControlRate…
  3. Database Access: ExecuteSQL, ConvertJSONToSQL, PutSQL…
  4. Attribute Extraction: EvaluateJsonPath, ExtractText, UpdateAttribute…
  5. System Interaction: ExecuteProcess …
  6. Data Ingestion: GetFile, GetFTP, GetHTTP, GetHDFS, ListenUDP, GetKafka…
  7. Sending Data: PutFile, PutFTP, PutKafka, PutEmail…
  8. Splitting and Aggregation: SplitText, SplitJson, SplitXml, MergeContent…
  9. HTTP: GetHTTP, ListenHTTP, PostHTTP…
  10. AWS: FetchS3Object, PutS3Object, PutSNS, GetSQS

FlowFile Topology:

A FlowFile has two components

Attributes:

These are the metadata from the FlowFile

Contain information about the content: e.g. when was it created, where is it from, what data does it represent?

Content:

That’s the actual content of the FlowFile. e.g. it’s the actual content of a file you would read using GetFile

A processor can (either or both):

  1. Update, add, or remove attributes
  2. Change content

NiFi Architecture:

NiFi executes within a JVM on a host operating system.

The primary components of NiFi on the JVM are as follows:

Web Server

The purpose of the web server is to host NiFi’s HTTP-based command and control API.

Flow Controller

The flow controller is the brains of the operation. It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute.

Extensions

There are various types of NiFi extensions which are described in other documents. The key point here is that extensions operate and execute within the JVM.

FlowFile Repository

The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition.

Content Repository

The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.

Provenance Repository

The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.

NiFi is also able to operate within a cluster.

  1. Starting with the NiFi 1.0 release, a Zero-Master Clustering paradigm is employed.
  2. Each node in a NiFi cluster performs the same tasks on the data, but each operates on a different set of data.
  3. Apache ZooKeeper elects a single node as the Cluster Coordinator, and failover is handled automatically by ZooKeeper.
  4. All cluster nodes report heartbeat and status information to the Cluster Coordinator.
  5. The Cluster Coordinator is responsible for disconnecting and connecting nodes.
  6. Additionally, every cluster has one Primary Node, also elected by ZooKeeper.
  7. As a DataFlow manager, you can interact with the NiFi cluster through the user interface (UI) of any node.
  8. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points.

NiFi Downloading and Installation

NiFi can be downloaded from the NiFi Downloads Page. There are two packaging options available: a “tarball” that is tailored more to Linux and a zip file that is more applicable for Windows users

Download NiFi binaries from https://nifi.apache.org/download.html

After download run below commands

Now that NiFi has been started, we can bring up the User Interface (UI) in order to create and monitor our dataflow. To get started, open a web browser and navigate to http://localhost:8080/nifi

This will bring up the User Interface, which at this point is a blank canvas for orchestrating a dataflow:

The UI has multiple tools to create and manage your first dataflow:

The Global Menu contains the following options:

Adding a Processor

We can now begin creating our dataflow by adding a Processor to our canvas. To do this, drag the Processor icon (Processor) from the top-left of the screen into the middle of the canvas (the graph paper-like background) and drop it there. This will give us a dialog that allows us to choose which Processor we want to add:

Sample NiFi Process Flow:

WhatsApp us