Skip to main content

Real-time Stream Processing and Analytics in Large Scale Using Apache NiFi, HDFS, Hive and Power BI

Twitter’s developer platform provides numerous API endpoints to collect data, and build app on Twitter. Twitter streaming allows us to collect live tweets. In this blog, I show you how I used Twitter Streaming data to build interactive dashboards.
I used Apache nifi, Power BI and Hive in this work. The tweets are filtered based on certain key words and geo location. You can find the Apache nifi template I built for this work from my Github repo. The nifi template has key words and geo-locations which are differ from what I used in my work.
  • Apachi nifi – to collect tweets from Twitter stream, doing data transformation and routing the collected data to different systems such as Power BI and Hive database
  • Power BI Streaming Dataset - Power BI has streaming dataset. I created streaming dataset and did the data ingestion from Nifi through Streaming dataset API. Power BI dashboard is built based on the streaming dataset. To know about how to create streaming dataset. Please check this link Power BI Streaming dataset
  • Hive - Power BI streaming dataset is used to build dashboard. However, it can store only latest records. For example, it can store 200+ tweets, new tweets will delete the old ones. All the tweets collected from Twitter Stream stored in HDFS and Hive table built on top of HDFS files. Using Hive dataset, can do further process such as data analysis and model building.
You can download the nifi template from the Github repo and import it into your environment. We need to configure the below processors to run the template.

Get Twitter processor
GetTwitter processor allows us to connect to Twitter API endpoints. Configure the processor to set Consumer Key, Consumer Secret, Access Token, and Access Token Secret keys. By default , the nifi template collects tweet based on keywords “insurance, claim, rain, winter, storm” and tweeted from United States. You may need to modify the “Terms to Filter On” and “Locations to Filter On” property if you need.


Power BI data ingestion
Processed data need to be sent to the Power BI streaming dataset. InvokeHTTP processor allows us to call Power BI streaming dataset API endoint. From nifi the data is ingested to Power BI through REST API call. Note: Create the Power BI streaming dataset if you do not have one before you proceed further. Configure the InvokeHTTP processors to set; Remote URL - the Power BI API URL for streaming dataset; Basic Authentication Username – Power BI username; Basic Authentication Password – Power BI password.


Hive data ingestion
Power BI streaming dataset has limited storage and it will store only latest tweets. In order to access all the data, we store the tweets in HDFS files. Hive table needs to be created on top of the HDFS files. We store the processed tweets as delimited file. Please create the Hive table schema based on the delimited input file to PutHDFS processor. Configure PutHDFS processor to set Kerberos Principal – set this for authentication. Example, hdfs-twitter@domain.com; Directory – provide full path to the HDFS directory where you want to store the collected tweets.



The nifi template has two streams one for Power BI streaming dataset and another one for HDFS storage.

Power BI Dashboard
The following Power BI reports were created using the data we ingested from nifi.

Visualize live tweets

Users and their device type

Search tweets


Hive
The nifi template has pipeline to store the tweets in HDFS. Hive table is built on top of the HDFS files. Upon successful completion of table creation, you can access the data in Hive.

Hive query

Thoughts to future work!
  • Build Machine Learning model on collected data
  • Change nifi pipeline to run inference via API call to the model
  • Update Power BI report to include inference result

Popular posts from this blog

JupyterHub setup on Cent OS or Red Hat server

I set up a collaboration tool for a new Data Science team that allows everyone to share their work and use the computational resources efficiently. I'm a Data Scientist, my background is from Computer Engineering and I wear many hats! It was initially hard for me to know where to start but I got good references from the Internet. Eventually, I ended up with JupyterHub for my team! I jotted down some notes during my installation and thought to share with others. I hope it may help someone. I installed the JupyterHub initially on Cent OS and we migrated to Red Hat server. The following steps is applicable for both Cent OS and Red Hat. Check sudo access. You should have sudo access because we are going to serve JupyterHub via sudo user account.  Download  Anacoda distribution  that comes with default packages and libraries. Anacond5.2.0-Linux-x86_64.sh is the latest one while I'm writing this blog. Install the Anaconda at /opt/anaconda3 location. Command to run: ...