This document details steps to setup new applications using Sprngy Admin UI, defining models and running workloads.
Demo Case: Finance
Example 1: Assess the Risk of a Loan
Objective - Determine whether a loan will be approved or not by the issuing bank
Overview
From source:
“The bank wants to improve their services. For instance, the bank managers have only vague idea, who is a good client (whom to offer some additional services) and who is a bad client (whom to watch carefully to minimize the bank loses). Fortunately, the bank stores data about their clients, the accounts (transactions within several months), the loans already granted, the credit cards issued The bank managers hope to improve their understanding of customers and seek specific actions to improve services. A mere application of a discovery tool will not be convincing for them.
The data about the clients and their accounts consist of following relations:
relation account - each record describes static characteristics of an account,
relation client - each record describes characteristics of a client,
relation disposition - each record relates together a client with an account i.e. this relation describes the rights of clients to operate accounts,
relation permanent order - each record describes characteristics of a payment order,
relation transaction - each record describes one transaction on an account,
relation loan - each record describes a loan granted for a given account,
relation credit card - each record describes a credit card issued to an account,
relation demographic data - each record describes demographic characteristics of a district.
Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation "account" and dynamic characteristics (e.g. payments debited or credited, balances) given in relations "permanent order" and "transaction". Relation "client" describes characteristics of persons who can manipulate with the accounts. One client can have more accounts, more clients can manipulate with single account; clients and accounts are related together in relation "disposition". Relations "loan" and "credit card" describe some services which the bank offers to its clients; more credit cards can be issued to an account, at most one loan can be granted for an account. Relation "demographic data" gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.”
Classification:
The file structure is going to look like this:
Download the files:
district
account
client
loan
order
transaction
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Data Warehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Step 1: Setting up the Application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate District as the parent of Account and Client, Account as the parent of Loan, Order, Transaction, Display, and Display as the parent of Card.
We will begin first with the District Dataset:
Step 2: Meta Model, Ingest Model, and Workloads
We are going to use the Bulk Upload Meta feature to create our meta model for the District dataset. In the AdminUI menu , under the Meta Model dropdown, select the Bulk Upload Meta option. Select the Loan Application module and District entity, and import the meta model CSV file into the bulk upload. Once you hit submit, this will automatically generate the meta model for the District dataset.
Next, similar to meta model you can bulk upload ingest model in AdminUI. Or create ingest model by selecting processors. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Workload Management/Run Workloads page.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the LOANAPPLICATION/District/BDL/Fact directory in HDFS
Data Upload
Upload data using 'Data Upload' option. Select the Loan Application module and District entity, and CSV file into the data upload. Once you hit submit, this will automatically upload the district dataset to SDL/Land folder in datalake.
SpringyBi
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are three examples of charts you can make in Superset that show the correlations between the various districts, regional statistics, loan amounts and durations, and the status of the loans.
Demo Case: Sports
Example 1: Evaluating NBA players' performances relative to their salaries
Objective - Build the most efficient team while constraining to a team’s cap space
Overview
With the NBA’s increased emphasis on data analytics, team managers have access to numerous metrics that show a player’s offensive and defensive efficiency. One such metric is FiveThirtyEight’s RAPTOR method, which highlights a player’s net efficiency and wins above replacement (WAR). In 2022, the top five players in Total RAPTOR were as following:
Player | Total RAPTOR | 2022 Salary (in $M) |
---|---|---|
Nikola Jokic | + 14.6 | $31.5 |
Giannis Antetokounmpo | + 8.1 | $39.3 |
Joel Embiid | + 7.8 | $31.5 |
Rudy Gobert | + 6.9 | $35.3 |
Stephen Curry | + 6.4 | $45.8 |
While it would be a manager’s dream to have a starting lineup like this, their combined salaries exceeds the cap space of $122M, leaving the team with both a hefty luxury tax bill and no bench players. The goal of this application is to build a full NBA team that gets the most value per dollar spent, according to the RAPTOR metric.
Classification:
Before defining the application in the system, it is useful to identify key entities and classify the application.
Key Entities:
The file structure is going to look like this:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Data Warehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Explanation of the classification: <>
Normalization: <V Structure/Star Model/Snowflake and its explanation>
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate Team as the parent of Players, and Players as the parent of Salaries and Stats.
We will begin first with the Teams dataset.
Data Upload
For uploading a csv file of teams dataset go to Data Upload from left side meu. Select Module name as NBASTATISTICS and entity name as teams from the dropdown option. Select the file to be uploaded and click on Submit.
On submit the dataset gets uploaded to datalake for that entity.
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
The Ingest Model will be similar to the one we made for the team’s dataset, the only difference being we define two extra processors for the nested quality of the players dataset.
After creating the Ingest Model, run the SDL-FDL and FDL-BDL workloads, checking the respective directories to make sure that both workloads ran correctly.
Creating Data Table In Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal
Salaries Dataset
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. Add an as_of_date column, too. We are going to change the unit for salary to $M, so we can rename the salary column to salary_in_millions, and for each element in that column, divide by 1000000. As we did in the players dataset, we are removing all periods and commas from the player_name column.
nba_salaries <- read.csv("/home/bapdevvm/Downloads/NBAdata/nba_salaries.csv", stringsAsFactors = F) %>% na.omit() %>% rename(salary_in_millions = salary) %>% rename_with(tolower) nba_salaries$as_of_date = as.character(as.Date(Sys.Date())) nba_salaries$salary_in_millions = nba_salaries$salary_in_millions / 1000000 nba_salaries$player_name <- stringr::str_remove_all(nba_salaries$player_name, "[.,]")
Upload the data to SDL/Land:
dataops.appenddata(nba_salaries, nbastatistics.salaries.landdb(), 'salariessDataSpark')
Meta Model, Ingest Model, and Workloads
The first step in creating the Meta Model for Salaries will be the same as that of Players and Teams - inputting column names and types. Because Salaries is nested under Players, we will have to indicate this by editing the player_name entity in the meta model and adding a parent reference to the FDL/Stage directory of the players.
After creating the meta model, we can create the ingest model for the salaries dataset. Refer to the Players dataset ingest model, as the processors for Salaries ingest model will be the same as that of Players'.
Once the ingest model is created, run both the workloads for the Salaries dataset, ensuring that they ran correctly by checking HDFS.
Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal
Statistics Dataset
Read the CSV file into a dataframe, rename the column names in lower case, add an as_of_date column, and remove all periods and commas from player names.
nba_statistics <- read.csv("<nba_statistics_file_path>", stringsAsFactors = F) %>% rename_with(tolower) nba_statistics$as_of_date = as.character(as.Date(Sys.Date())) nba_statistics$player_name <- stringr::str_remove_all(nba_statistics$player_name, "[.,]")
We also want to filter only the players who played for at least 700 minutes in the season, ensuring us a large enough sample size for our results to be accurate.
nba_statistics = nba_statistics %>% dataops.rdo.filterbygreaterthan("mp", 700)
Upload data to SDL/Land:
dataops.appenddata(nba_statistics, nbastatistics.statistics.landdb(), 'statisticsDataSpark')
Meta Model, Ingest Model, and Workloads
In the meta model, input the column names and data types. The statistics data set also shares the player_name column with the players data set, so we can nest those two together by editing the player_name entity and adding parent reference to FDL/Stage directory of the players dataset (see Salaries meta model for reference).
After creating the meta model, we can create the ingest model for the statistics data set.
After creating the ingest model, run the SDL-FDL and FDL-BDL workloads. Check FDL/Stage and BDL/Fact to make sure your workloads ran correctly.
Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal
Step 4: Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the NBAstatistics database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are two examples of charts you can make in Superset that show individual performance relative to salary and each team’s total offensive and defensive efficiency.
Example 2- Data Analysis of IPL 2022 match dataset.
Objective: Analyzing the IPL 2022 match by match dataset.
Overview: This is the IPL 2022 match by match dataset which includes information like match details, team details, winning team name, winning margin, player of the match etc.
IPL dataset:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Modelling This involves building statistical models and testing those. |
|
|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. |
| |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure.
We will begin with the Match dataset.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. Add an as_of_date column, too.
match_data <- read.csv("<match_file_path>", stringsAsFactors = FALSE) %>% rename_with(tolower) match_data$as_of_date = as.character(as.Date(Sys.Date()))
Next, we upload the data to SDL/Land using the dataops.appenddata()
function:
dataops.appenddata(match_data, ipl.match.landdb(), 'matchDataSpark')
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
This is how our meta model looks after adding all the columns with their respective datatypes. Note that you do not add the as_of_date column, as that will be added automatically.
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
We then define the processors to use from FDL to BDL.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the IPL/Match/FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the IPL/Match/BDL/Fact directory in HDFS.
Step 3: Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 4: Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the IPL database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization.
Below is the example of number matches won by a team in IPL 2022.
Demo Case: Retail
Example 1- Exploratory Data Analysis of Car Sales dataset.
Objective: Car Sales dataset will give abstracts like:
Percentage of car sales done by the car companies.
Correlation of the car sales based on engine size and fuel capacity.
Power performance of cars for each manufacturers based on the relationship between Horsepower and the length of the cars.
Visualization of the number of cars manufactured and the horsepower with what they are manufactured
The car companies that did the most sales based on the average horsepower of all their models and average price for all those models .
Overview: This is the Car sales data set from the years 2008 to 2012, which include information about different cars . Here, we have to see which features have more impact on car sales depending on the different attributes given.
Car Sales Dataset:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ||
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Since we have just one layer file system we will have just one entity in it.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. Add an as_of_date column, too.
data <- read.csv("<dataset_file_path>", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date()))
The next step is to upload the data once you are sure about the above steps. Use the below code snippet to upload the data to the Land directory of the corresponding entity:
library(CARSALES) save_data <- data %>% dataops.overwritedata(carsales.CarSales.landdb(),'tempTable')
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Add the column names and data types from the teams dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically.
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.
Step 3: Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 4: Importing Database and Dataset into Superset
After adding the database and datasets (as shown in above screenshots) click on the dataset that you have created to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Next step is to create a new dashboard for displaying all the charts at one place. Below are the examples of the charts you can make in Superset, that shows analysis like:
Percentage of car sales done by the car companies.
Correlation of the car sales based on engine size and fuel capacity.
Power performance of cars for each manufacturers based on the relationship between Horsepower and the length of the cars.
Visualization of the number of cars manufactured and the horsepower with what they are manufactured
The car companies that did the most sales based on the average horsepower of all their models and average price for all those models .
Example 2- Exploratory Data Analysis of eCommerce dataset.
Objective: Analysis of eCommerce dataset will give the information about:
Yearly spent by every individual on individual brands.
Yearly spent based on the number of membership years.
Number if items users bought.
Majorly sold products on eCommerce platform.
Comparison between time spent by users on app vs on website.
Analysis of which category of products have the highest price.
Overview: eCommerce Dataset has two entities: Customers and Sales. Customers being the parent and Sales being the child entities. Sales dataset has all the details about what all products the user has bought and the Customers dataset has all the details about the user’s profile.
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. ce weSince we have just one layer file system we will have just one entity in it.
Step 2: Upload the dataset to the Land directory of the given entity.
Read both the Customers and Sales csv file’s data that you have in your system using the below sample code snippet.
data <- read.csv('path/to/your/data/file.csv', stringsAsFactors = FALSE)
After reading the data, click on the blue arrow near the dataset name in your environment of Rstudio.
Make sure all column names are in a small case, if not then you can use the below code snippet to convert all column names into a small case for both the Customers and Sales csv.
data <- data %>% dplyr::rename_with(tolower)
The dataset should contain one as_of_date column before you upload it to the Land directory of the respective directory.
NOTE: The format for as_of_date should be YYYY-MM-DD
data$as_of_date <- as.character(Sys.Date())
The next step is to upload the data once you are sure about the above steps. Use the below code snippet to upload the data to the Land directory of the corresponding entity.
library(ECOMMERCE) save_data <- data %>% dataops.overwritedata(ecommerce.customers.landdb(),'tempTable') save_data1 <- data1 %>% dataops.overwritedata(ecommerce.sales.landdb(),'tempTable')
Step 3: Creating Bulk Upload Meta Model
Bulk uploads is an option given in the UI for a user who wants to upload a CSV file of their Meta, Ingest, Analytic, or Import Models. Once uploaded, the respective models will get created in the HDFS.
Create the bulk upload meta model for customers and sales using below attached csv files through UI navigating to Meta Model → Bulk Upload Meta.
Step 4: Creating Bulk Upload Ingest Model
Bulk uploads is an option given in the UI for a user who wants to upload a CSV file of their Meta, Ingest, Analytic, or Import Models. Once uploaded, the respective models will get created in the HDFS.
Create the bulk upload ingest model for customers and sales by uploading below attached csv files through UI navigating to Ingest Model → Bulk Upload Ingest.
Step 5: Running the Workloads
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.
Step 6: Creating Data Table in Hive
Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 7: Importing Database and Dataset into Superset and creating a dashboard for the charts
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
Visualization of eCommerce dataset will give the information about:
Yearly spent by every individual on individual brands.
Yearly spent based on the number of membership years.
Number if items users bought.
Majorly sold products on eCommerce platform.
Comparison between time spent by users on app vs on website.
Analysis of which category of products have the highest price.
Example 3- Exploratory Data Analysis of GoSales dataset.
Objective: Analysis of GoSales dataset will give the information about product sale quantity.
Overview: GO Sales dataset from IBM contains information about daily sales, methods, retailers, and products of a fictitious outdoor equipment retail chain “Great Outdoors” (GO).
GOSales Dataset:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ |
|
Modelling This involves building statistical models and testing those. | ✔ |
|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ |
|
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ |
|
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations |
|
This application can be classified as a Data Lake with curated, Correlated, Normalized, Analyze data.
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure.
We will begin with the Match dataset.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. Add an as_of_date column, too. Next, we upload the data to SDL/Land using the dataops.appenddata()
function:
library(GOSALES) # go1k data <- read.csv("/home/bapdevvm/Downloads/go_1k.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date())) data <- data %>% rename(retailercode = retailer.code, productnumber = product.number, salesdate = date) save_data <- data %>% dataops.overwritedata(gosales.go1k.landdb(),'tempTable') # godailysales data <- read.csv("/home/bapdevvm/Downloads/go_daily_sales.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date())) data <- data %>% rename(retailercode = retailer.code, dailysalesproductno = product.number, ordermethodcode = order.method.code, dailysalesdate = date, dailysalesquantity = quantity, dailysalesunitprice = unit.price, dailysalesunitsaleprice = unit.sale.price) save_data <- data %>% dataops.overwritedata(gosales.godailysales.landdb(),'tempTable') # gomethods data <- read.csv("/home/bapdevvm/Downloads/go_methods.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date())) data <- data %>% rename(ordermethodcode = order.method.code, ordermethodtype = order.method.type) save_data <- data %>% dataops.overwritedata(gosales.gomethods.landdb(),'tempTable') # goproducts data <- read.csv("/home/bapdevvm/Downloads/go_products.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date())) data <- data %>% rename(productcolour = product.color, productunitprice = unit.price, producttype = product.type, productnumber = product.number, productbrand = product.brand, productline = product.line, productunitcost = unit.cost) save_data <- data %>% dataops.overwritedata(gosales.goproducts.landdb(),'tempTable') # goretailers data <- read.csv("/home/bapdevvm/Downloads/go_retailers.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) data$as_of_date = as.character(as.Date(Sys.Date())) data <- data %>% rename(retailername = retailer.name, retailercountry = country, retailercode = retailer.code, retailertype = type) save_data <- data %>% dataops.overwritedata(gosales.goretailers.landdb(),'tempTable')
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Add the column names and data types from all the datasets of GoSales into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically.
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL and then FDL to BDL. Below are the SDL-FDL and FDL to BDL processors that we select for all five datasets. You can refer to this page to understand what each of the processors do. Here we have shown for just one parent entity go1k and one child entity godailysales.
Below are the processors for child entity godailysales and for all the 3 other child entities you need to add the same processors for SDL to FDL and FDL to BDL stage.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page for all the 5 entities, below shown is for just 1 entity go1k and you need to follow same steps for other 4 entites separately.
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the GOSALES/go_1k/FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the GOSALES/go_1k/BDL/Fact directory in HDFS.
Step 3: Creating Data Table in Hive
Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 6: Importing Database and Dataset into Superset and creating a dashboard for the charts
Above screenshot shows the configuration of just one entity, you can add the dataset individually for other 4 entities as well.
Below chart shows the visualization of Sales Quantity:
Demo Case: Entertainment
Example 1- Exploratory Data Analysis of a Movie Dataset (IMDB)
Objective: Given the IMDB dataset for seven different entities we will visualize:
The popularity of a movie and it’s genres based on the movie rank given.
Relationship between the movies and directors, which can also give the abstract of which all directors directed the most popular movies.
Details of the directors and the genres on which they make their movies.
Abstract of famous actors of the movies and their popularity based on the roles they played in a movie.
Overview: IMDB Dataset has total 7 different entities, which gives all the details for a particular movie like the popularity of movie, directors, actors, genre of the movies that the actors do and genre of the movies that the directors direct, below is the dataset of IMDB in sql format.
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Since we have just one layer file system we will have just one entity in it.
Since the dataset was in SQL format, it was first imported by using the following command in terminal and then it was converted into parquet format by executing the import model and through which it was stored on SDL Land.
sudo mysql imdb < /home/bapdevvm/Downloads/imdb.sql
Import Model:
The above is for one entity i.e., “actors“, there are total 7 entities and have to follow similar steps for other 6 entities by just changing the table names as per entities.
Entities:
Movie Genres
Movie
Actors
Roles
Movies Directors
Directors
Director Genres
The relationship between the parent-child entities is shown by below flowchart:
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Ingest Model:
Ingest model for the only parent (root) entity:
Ingest model for all child entities:
Analytical Model:
To find top 10 movies
Run Import Workloads:
Run Analytical Workloads
One need to follow the above steps for all the other entities to run the import workload.
Step 3: Necessary manipulations to do before loading the data further from SDL to FDL
Once the data is loaded into SDL Land db by running the import workload, one need to write the below code in R script to add “as_of_date” column into the SDL Land data.
library(IMDB) entities <- c('actors', 'directors', 'directorgenres', 'moviesdirectors', 'movies', 'moviesgenres', 'roles') module_name <- 'imdb' write_data <- lapply(entities, function(ent) { today_date <- as.character(Sys.Date()) land_db <- base::eval(parse(text = base::paste(module_name,ent,'landdb()',sep = '.'))) data <- dataops.readparquet(land_db, "tempTable") %>% dataops.sdo.deselectcolumns("row_id") %>% mutate(as_of_date = today_date) %>% dataops.overwritedata(land_db, "tempTable") })
Since the dataset is too big we will truncate it too 10k rows by writing the following code in R:
data1 <- dataops.readparquet(imdb.moviesgenres.ingesteddb(), "movies_genres") %>% head(10000) %>% dataops.overwritedata(imdb.moviesgenres.landdb(), "movies_genres")
Write the above code for every other remaining entities.
Now, since there were conflicts in the column names between parent and child entities we are making some necessary changes into Land db, use the below code for required changes.
#Renaming the parent "id" to "movie_id" data1 <- dataops.readparquet(imdb.movies.landdb(), "movies") %>% rename(movie_year = year) %>% dataops.overwritedata(imdb.movies.landdb(), "movies") #Renaming the parent "id" to "director_id" data1 <- dataops.readparquet(imdb.directors.landdb(), "directors") %>% rename(director_id = id) %>% dataops.overwritedata(imdb.directors.landdb(), "directors") #Renaming the parent "id" to "actor_id" data1 <- dataops.readparquet(imdb.actors.landdb(), "actors") %>% rename(actor_id = id) %>% dataops.overwritedata(imdb.actors.landdb(), "actors") data1 <- dataops.readparquet(imdb.directorgenres.landdb(), "directors_genres") %>% rename(director_genre = genre) %>% dataops.overwritedata(imdb.directorgenres.landdb(), "directors_genres")
Step 4: Running the workloads
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.
Step 5: Creating Data Table in Hive
Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 6: Importing Database and Dataset into Superset and creating a dashboard for the charts
Creating a dataset for two child entities i.e. for “actors” table and for “director_genres” table.
Click on the dataset to create charts for the visualizations. Create a dashboard to save all the charts that you have created. Below are the charts created to give the abstracts like:
The popularity of a movie and it’s genres based on the movie rank given.
Relationship between the movies and directors, which can also give the abstract of which all directors directed the most popular movies.
Details of the directors and the genres on which they make their movies.
Abstract of famous actors of the movies and their popularity based on the roles they played in a movie.
Below is the visualization of the IMDB dataset:
Demo Case: World
Example 1: Mapping World Airports
Objective - Visualize the number of airports by country and state
Overview
We will plot the location of each airport that is in the airports dataset. We will also visualize which country and region/state has the greatest number of airports.
Classification:
Before defining the application in the system, it is useful to identify key entities and classify the application.
Key Entities:
The file structure is going to look like this:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ||
Modelling This involves building statistical models and testing those. |
Explanation of the classification: <>
Normalization: <V Structure/Star Model/Snowflake and its explanation>
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate Airports as the parent of Frequency, and Frequency as the parent of Runways.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. In the continents column, RStudio is reading the ‘NA’s as null (since NA is a keyword for missing values in R). We have to change wherever NA appears in the continents column to the actual string “NA”, for North America. After doing this, add an as_of_date column and upload the data to SDL/Land.
airports <- read.csv("<airports_file_path>", stringsAsFactors = FALSE) %>% rename_with(tolower) airports$continent[is.na(airports$continent)] <- "NA" airports$as_of_date = as.Date(Sys.Date()) dataops.appenddata(worldairports.airports.landdb(), "airportsDataSpark")
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
We then define the processors to use from FDL to BDL.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
Step 3: Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal.
Repeating with the other Datasets
We will repeat the steps above with the three remaining datasets. Note that there will be some small differences because of the parent-child relationship between the datasets.
The next dataset we will be working with is the frequencies dataset.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. We will have to rename the id column to frequency_id and the airport_ref column to id; Because the airport_ref column in the frequency dataset matches the id column of the airports dataset, both column names must match in order to nest frequencies to airports.
Additionally, type is a keyword in R, so we have to rename that column to something different, in this case to frequency_type. Add an as_of_date column, and then upload data to SDL/Land.
frequencies <- read.csv("/home/bapdevvm/Downloads/airport-frequencies.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) %>% rename(frequency_id = id) %>% rename(id = airport_ref) %>% rename(frequency_type = type) frequencies$as_of_date = as.Date(Sys.Date()) %>% as.character() dataops.overwritedata(frequencies, worldairports.frequency.landdb(), "frequenciesDataSpark")
Meta Model, Ingest Model, and Workloads
The Ingest Model will be similar to the one we made for the airports' dataset, the only difference being we define two extra processors in the SDL layer for the nested quality of the airports dataset.
After creating the Ingest Model, run the SDL-FDL and FDL-BDL workloads, checking the respective directories to make sure that both workloads ran correctly.
Creating Data Table In Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal.
Runways Dataset
We will repeat the steps above with the runways datasets.
In RStudio, read the CSV file into a dataframe, and rename the column names in lower case. We will have to rename the id column to runway_id, airport_ident column to ident, and the airport_ref column to id; Because the airport_ref column in the frequency dataset matches the id column of the runways dataset, both column names must match in order to nest frequencies to airports.
Add an as_of_date column, and then upload data to SDL/Land.
runways <- read.csv("/home/bapdevvm/Downloads/runways.csv", stringsAsFactors = FALSE) %>% rename_with(tolower) %>% rename(runway_id = id) %>% rename(ident = airport_ident) %>% rename(id = airport_ref) runways$as_of_date = as.Date(Sys.Date()) %>% as.character() dataops.overwritedata(runways, worldairports.runways.landdb(), "runwaysDataSpark")
Meta Model, Ingest Model, and Workloads
We also want to nest the ident column between the runways dataset and airports dataset, since both datasets also share that column in common. We do that the same way we nested the ID columns together: Click on the edit icon next to the ident row, and when you arrive at the Key Roles page, mark yes for the nested and parent lookup options and give the directory path to the airports' FDL/Stage.
After creating the meta model, we can create the ingest model for the runways dataset. Refer to the Frequency dataset’s ingest model, as the processors for Runways ingest model will be the same as that of Frequency’s.
Once the ingest model is created, run both the SDL-FDL and FDL-BDL workloads for the Runways dataset, ensuring that they ran correctly by checking HDFS.
Creating Data Table in Hive
In the utilityscripts folder, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement needed to create a data table in Hive. Once you run the script, a .hql file will be created; open this file and copy the statement that was generated. Run it in the hive terminal
Step 4: Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the WORLDAIRPORTS database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are two examples of charts you can make in Superset:
Demo Case: Education
Example 1: To assess the role of a person in the University
Objective - To get the information about the role of a person in the university:
Visualization of professor and students individually at each individual course level
Relationship between Student, the phase they are in currently and for the number of years they are in a university program
How many students are advised by each professor
How many percentage of people teaching at each course level
Overview: The dataset has got 4 different entities which gives information about the relationship and details of the person, course and the role they have in the university, below is the dataset in sql format.
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake Data Lake is a centralized repository to store large amount of raw data | Data Lakehouse / Data Warehouse Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features |
---|---|---|
Curated Curating data involves creating or preparing data to make it usable for business analysis. | ✔ | |
Correlated Correlated data means running Algorithms to discover patterns and relationships within the data. | ✔ | |
Normalized Normalizing data involves structuring the data to enable rapid access. | ✔ | |
Analyze Data Analysis involves identifying useful information to support decision-making, often using visualizations | ✔ | |
Modelling This involves building statistical models and testing those. |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure.
Since the dataset was in SQL format, it was first imported by using the following command in terminal and then it was converted into parquet format by executing the import model and through which it was stored on SDL Land.
sudo mysql imdb < /home/bapdevvm/Downloads/UW_std.sql
Import Model:
There are total 4 entities and have to follow the same steps as done above to create an Import model for other 3 entities i.e. “course“, “person, “taughtBy“.
Here all the entities are considered as parent and there is no relationship assigned between any of these entities.
Write below code snippet in R to make sure that landdb has correct data format.
library(dplyr) library(UNIVERSITY) entities <- c('advisedBy', 'taughtBy', 'person', 'course') module_name <- 'university' write_data <- lapply(entities, function(ent) { today_date <- as.character(Sys.Date()) land_db <- base::eval(parse(text = base::paste(tolower(module_name),tolower(ent),'landdb()',sep = '.'))) data <- dataops.readparquet(land_db, "tempTable") %>% dataops.sdo.deselectcolumns("row_id") %>% dplyr::rename_with(tolower) %>% mutate(as_of_date = today_date) %>% dataops.overwritedata(land_db, "tempTable") })
Step 2: Meta Model: We can now set up the Meta Model in AdminUI. This is just for one entity “advisedBy“ and have to do the same set up for other 3 entities.
Add the column names and data types from the teams dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically
Step 3: Ingest Model: This is just for one entity have to do same set up for other entities individually.
Create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Now, defining ingest model for FDL to BDL Layer
Step 4: Running Import Workloads: Batch Management → Run Import workload: This is just for one entity have to do same set up for other entities individually.
Step 5: Running Workloads SDL to FDL and FDL to BDL: Batch Management → Run Workload: This is just for one entity have to do same set up for other entities individually.
Step 6: Creating Data Table in Hive
Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 7: Importing Database and Dataset into Superset and creating a dashboard for the charts
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
Below screenshot of dataset is for one entity, can create for other 3 remaining entities.
Database:
Step 8: Establishing parent child relationship.
Converting the entities into R data frames and merging (join()) data frames as per the parent-child relationship.
data <- dataops.readparquet(university.person.factdb(), "person") data1 <- dataops.readparquet(university.advisedby.factdb(), "advisedBy") data2 <- dataops.readparquet(university.taughtby.landdb(), "taughtBy") data3 <- dataops.readparquet(university.course.factdb(), "course") rdata1 <- dataops.sdo.copytordo(data) rdata2 <- dataops.sdo.copytordo(data1) rdata3 <- dataops.sdo.copytordo(data2) rdata4 <- dataops.sdo.copytordo(data3) people <- dataops.rdo.innerjoin(rdata1, coursedetails, c("p_id")) advisor <- merge(rdata1, rdata2, by = "p_id", all = TRUE)
Step 9: Creating two directories into HDFS to store the data into it using hive queries and then can be accessed into superset.
Write below code in your terminal to create two different directories into HDFS:
hdfs dfs -mkdir -p /BigAnalytixsPlatform/UNIVERSITY/people hdfs dfs -mkdir -p /BigAnalytixsPlatform/UNIVERSITY/advisor
Write below code in R to add data into two different HDFS directories:
data <- dataops.overwritedata(people, "/BigAnalytixsPlatform/UNIVERSITY/people/BDL/Fact", "people") data <- dataops.overwritedata(advisor, "/BigAnalytixsPlatform/UNIVERSITY/advisor/BDL/Fact", "advisor")
Write below code in your terminal, its the hive queries to create tables which will be accessed through superset:
hive> create external table UNIVERSITY.people ( course_id int, courselevel string, hasposition string, inphase string, p_id int, professor string, student string, yearsinprogram string, hash_code string, row_count int, batch_id string, report_date string, year int, year_start_date string, year_end_date string, month_name string, month_number int, month_start_date string, month_end_date string, quarter_number string, quarter_start_date string, quarter_end_date string, days_in_month int, days_in_quarter int, days_in_year int, day_of_week_long string, day_of_week_abbr string, day_of_month int, day_of_quarter int, day_of_year int, week_day_ind string, week_end_ind string, week_of_month int, week_of_year int, week_start_date string, week_end_date string ) STORED AS PARQUET LOCATION 'hdfs://localhost:9000/BigAnalytixsPlatform/UNIVERSITY/people/BDL/Fact' ; hive> create external table UNIVERSITY.advisor ( p_id_dummy int, hasposition string, inphase string, p_id int, professor string, student string, yearsinprogram string, hash_code string, row_count int, batch_id string, report_date string, year int, year_start_date string, year_end_date string, month_name string, month_number int, month_start_date string, month_end_date string, quarter_number string, quarter_start_date string, quarter_end_date string, days_in_month int, days_in_quarter int, days_in_year int, day_of_week_long string, day_of_week_abbr string, day_of_month int, day_of_quarter int, day_of_year int, week_day_ind string, week_end_ind string, week_of_month int, week_of_year int, week_start_date string, week_end_date string ) STORED AS PARQUET LOCATION 'hdfs://localhost:9000/BigAnalytixsPlatform/UNIVERSITY/advisor/BDL/Fact' ;
Now create two datasets in the superset for the two new entities “people” and “advisor” as shown above.