This document details steps to setup new applications using Sprngy Admin UI, defining models and running workloads.
Table of Contents | ||||
---|---|---|---|---|
|
Demo Case: Finance
Example 1: Assess the Risk of a Loan
Objective - Determine whether a loan will be approved or not by the issuing bank
Overview
From source:
“The bank wants to improve their services. For instance, the bank managers have only vague idea, who is a good client (whom to offer some additional services) and who is a bad client (whom to watch carefully to minimize the bank loses). Fortunately, the bank stores data about their clients, the accounts (transactions within several months), the loans already granted, the credit cards issued The bank managers hope to improve their understanding of customers and seek specific actions to improve services. A mere application of a discovery tool will not be convincing for them.
The data about the clients and their accounts consist of following relations:
relation account - each record describes static characteristics of an account,
relation client - each record describes characteristics of a client,
relation disposition - each record relates together a client with an account i.e. this relation describes the rights of clients to operate accounts,
relation permanent order - each record describes characteristics of a payment order,
relation transaction - each record describes one transaction on an account,
relation loan - each record describes a loan granted for a given account,
relation credit card - each record describes a credit card issued to an account,
relation demographic data - each record describes demographic characteristics of a district.
Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation "account" and dynamic characteristics (e.g. payments debited or credited, balances) given in relations "permanent order" and "transaction". Relation "client" describes characteristics of persons who can manipulate with the accounts. One client can have more accounts, more clients can manipulate with single account; clients and accounts are related together in relation "disposition". Relations "loan" and "credit card" describe some services which the bank offers to its clients; more credit cards can be issued to an account, at most one loan can be granted for an account. Relation "demographic data" gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.”
Classification:
The file structure is going to look like this:
Download the files:
district
account
client
loan
order
transaction
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Step 1: Setting up the Application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate District as the parent of Account and Client, Account as the parent of Loan, Order, Transaction, Display, and Display as the parent of Card.
We will begin first with the District Dataset:
Step 2: Meta Model, Ingest Model, and Workloads
We are going to use the Bulk Upload Meta feature to create our meta model for the District dataset. In the AdminUI menu , under the Meta Model dropdown, select the Bulk Upload Meta option. Select the Loan Application module and District entity, and import the meta model CSV file into the bulk upload. Once you hit submit, this will automatically generate the meta model for the District dataset.
Next, similar to meta model you can bulk upload ingest model in AdminUI. Or create ingest model by selecting processors. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Workload Management/Run Workloads page.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the LOANAPPLICATION/District/BDL/Fact directory in HDFS
Data Upload
Upload data using 'Data Upload' option. Select the Loan Application module and District entity, and CSV file into the data upload. Once you hit submit, this will automatically upload the district dataset to SDL/Land folder in datalake.
SpringyBi
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are three examples of charts you can make in Superset that show the correlations between the various districts, regional statistics, loan amounts and durations, and the status of the loans.
Demo Case: Sports
Example 1: Evaluating NBA players' performances relative to their salaries
Objective - Build the most efficient team while constraining to a team’s cap space
Overview
With the NBA’s increased emphasis on data analytics, team managers have access to numerous metrics that show a player’s offensive and defensive efficiency. One such metric is FiveThirtyEight’s RAPTOR method, which highlights a player’s net efficiency and wins above replacement (WAR). In 2022, the top five players in Total RAPTOR were as following:
Player | Total RAPTOR | 2022 Salary (in $M) |
---|---|---|
Nikola Jokic | + 14.6 | $31.5 |
Giannis Antetokounmpo | + 8.1 | $39.3 |
Joel Embiid | + 7.8 | $31.5 |
Rudy Gobert | + 6.9 | $35.3 |
Stephen Curry | + 6.4 | $45.8 |
While it would be a manager’s dream to have a starting lineup like this, their combined salaries exceeds the cap space of $122M, leaving the team with both a hefty luxury tax bill and no bench players. The goal of this application is to build a full NBA team that gets the most value per dollar spent, according to the RAPTOR metric.
Classification:
Before defining the application in the system, it is useful to identify key entities and classify the application.
Key Entities:
The file structure is going to look like this:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Explanation of the classification: <>
Normalization: <V Structure/Star Model/Snowflake and its explanation>
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate Team as the parent of Players, and Players as the parent of Salaries and Stats.
We will begin first with the Teams dataset.
Data Upload
For uploading a csv file of teams dataset go to Data Upload from left side meu. Select Module name as NBASTATISTICS and entity name as teams from the dropdown option. Select the file to be uploaded and click on Submit.
On submit the dataset gets uploaded to datalake for that entity.
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
The Ingest Model will be similar to the one we made for the team’s dataset, the only difference being we define two extra processors for the nested quality of the players dataset.
After creating the Ingest Model, run the SDL-FDL and FDL-BDL workloads, checking the respective directories to make sure that both workloads ran correctly.
Repeat the same steps for rest of the entities.
SpringyBI
Step 4: Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the NBAstatistics database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are two examples of charts you can make in Superset that show individual performance relative to salary and each team’s total offensive and defensive efficiency.
Example 2- Data Analysis of IPL 2022 match dataset.
Objective: Analyzing the IPL 2022 match by match dataset.
Overview: This is the IPL 2022 match by match dataset which includes information like match details, team details, winning team name, winning margin, player of the match etc.
IPL dataset:
View file | ||
---|---|---|
|
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Normalized
| ✔ | |||||
Modelling
|
|
| ||||
Curated
| ✔ | |||||
Correlated
|
| |||||
Analyze
| ✔ |
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure.
Step 2 : Data Upload
We will begin with the Match dataset.
Go to the Data Upload page from the side navigation. Select module name, entity name and the csv file.
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
This is how our meta model looks after adding all the columns with their respective datatypes. Note that you do not add the as_of_date column, as that will be added automatically.
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the IPL/Match/FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the IPL/Match/BDL/Fact directory in HDFS.
Step 4: Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the IPL database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization.
Below is the example of number matches won by a team in IPL 2022.
Demo Case: Retail
Example 1- Exploratory Data Analysis of Car Sales dataset.
Objective: Car Sales dataset will give abstracts like:
Percentage of car sales done by the car companies.
Correlation of the car sales based on engine size and fuel capacity.
Power performance of cars for each manufacturers based on the relationship between Horsepower and the length of the cars.
Visualization of the number of cars manufactured and the horsepower with what they are manufactured
The car companies that did the most sales based on the average horsepower of all their models and average price for all those models .
Overview: This is the Car sales data set from the years 2008 to 2012, which include information about different cars . Here, we have to see which features have more impact on car sales depending on the different attributes given.
Car Sales Dataset:
View file | ||
---|---|---|
|
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ||||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Since we have just one layer file system we will have just one entity in it.
Data Upload
Go to data upload page from side navigation. Select module name, entity name and csv of the data file. This will upload data to datalake.
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Add the column names and data types from the teams dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically.
Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.
Step 4: Importing Database and Dataset into Superset
After adding the database and datasets (as shown in above screenshots) click on the dataset that you have created to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Next step is to create a new dashboard for displaying all the charts at one place. Below are the examples of the charts you can make in Superset, that shows analysis like:
Percentage of car sales done by the car companies.
Correlation of the car sales based on engine size and fuel capacity.
Power performance of cars for each manufacturers based on the relationship between Horsepower and the length of the cars.
Visualization of the number of cars manufactured and the horsepower with what they are manufactured
The car companies that did the most sales based on the average horsepower of all their models and average price for all those models .
Example 2- Exploratory Data Analysis of eCommerce dataset.
Objective: Analysis of eCommerce dataset will give the information about:
Yearly spent by every individual on individual brands.
Yearly spent based on the number of membership years.
Number if items users bought.
Majorly sold products on eCommerce platform.
Comparison between time spent by users on app vs on website.
Analysis of which category of products have the highest price.
Overview: eCommerce Dataset has two entities: Customers and Sales. Customers being the parent and Sales being the child entities. Sales dataset has all the details about what all products the user has bought and the Customers dataset has all the details about the user’s profile.
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. ce weSince we have just one layer file system we will have just one entity in it.
Step 2: Upload the dataset to the Land directory of the given entity.
Go to data upload page from side navigation. Select module name, entity name and csv of the data file. This will upload data to datalake.
Step 3: Creating Bulk Upload Meta Model
Bulk uploads is an option given in the UI for a user who wants to upload a CSV file of their Meta, Ingest, Analytic, or Import Models. Once uploaded, the respective models will get created in the HDFS.
Create the bulk upload meta model for customers and sales using below attached csv files through UI navigating to Meta Model → Bulk Upload Meta.
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
Step 4: Creating Bulk Upload Ingest Model
Bulk uploads is an option given in the UI for a user who wants to upload a CSV file of their Meta, Ingest, Analytic, or Import Models. Once uploaded, the respective models will get created in the HDFS.
Create the bulk upload ingest model for customers and sales by uploading below attached csv files through UI navigating to Ingest Model → Bulk Upload Ingest.
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
Step 5: Running the Workloads
First, run the SDL-FDL workload. This will apply the processors we selected in the Ingest Model for the SDL to FDL layer. You can see if the workload ran correctly by going to the FDL/Stage directory in HDFS.
Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.
Step 6: Importing Database and Dataset into Superset and creating a dashboard for the charts
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
Visualization of eCommerce dataset will give the information about:
Yearly spent by every individual on individual brands.
Yearly spent based on the number of membership years.
Number if items users bought.
Majorly sold products on eCommerce platform.
Comparison between time spent by users on app vs on website.
Analysis of which category of products have the highest price.
Demo Case: Entertainment
Example 1- Exploratory Data Analysis of a Movie Dataset (IMDB)
Objective: Given the IMDB dataset for seven different entities we will visualize:
The popularity of a movie and it’s genres based on the movie rank given.
Relationship between the movies and directors, which can also give the abstract of which all directors directed the most popular movies.
Details of the directors and the genres on which they make their movies.
Abstract of famous actors of the movies and their popularity based on the roles they played in a movie.
Overview: IMDB Dataset has total 7 different entities, which gives all the details for a particular movie like the popularity of movie, directors, actors, genre of the movies that the actors do and genre of the movies that the directors direct, below is the dataset of IMDB in sql format.
View file | ||
---|---|---|
|
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Since we have just one layer file system we will have just one entity in it.
Since the dataset was in SQL format, it was first imported by using the following command in terminal and then it was converted into parquet format by executing the import model and through which it was stored on SDL Land.
Code Block | ||
---|---|---|
| ||
sudo mysql imdb < /home/bapdevvm/Downloads/imdb.sql |
Import Model:
The above is for one entity i.e., “actors“, there are total 7 entities and have to follow similar steps for other 6 entities by just changing the table names as per entities.
Entities:
Movie Genres
Movie
Actors
Roles
Movies Directors
Directors
Director Genres
The relationship between the parent-child entities is shown by below flowchart:
Step 2: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
or can use bulk upload if you have the csv file available in correct format.
Ingest Model:
Ingest model for the only parent (root) entity:
Similarly add ingest model for all other entities either by using the UI or by bulk uploading the csv.
Run Import Workloads:
One need to follow the above steps for all the other entities to run the import workload.
Importing Database and Dataset into Superset and creating a dashboard for the charts
Creating a dataset for two child entities i.e. for “actors” table and for “director_genres” table.
Click on the dataset to create charts for the visualizations. Create a dashboard to save all the charts that you have created. Below are the charts created to give the abstracts like:
The popularity of a movie and it’s genres based on the movie rank given.
Relationship between the movies and directors, which can also give the abstract of which all directors directed the most popular movies.
Details of the directors and the genres on which they make their movies.
Abstract of famous actors of the movies and their popularity based on the roles they played in a movie.
Below is the visualization of the IMDB dataset:
Demo Case: World
Example 1: Mapping World Airports
Objective - Visualize the number of airports by country and state
Overview
We will plot the location of each airport that is in the airports dataset. We will also visualize which country and region/state has the greatest number of airports.
Classification:
Before defining the application in the system, it is useful to identify key entities and classify the application.
Key Entities:
The file structure is going to look like this:
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ||||||
Modelling
|
Explanation of the classification: <>
Normalization: <V Structure/Star Model/Snowflake and its explanation>
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Make sure to indicate Airports as the parent of Frequency, and Frequency as the parent of Runways.
Step 2: Upload Data
Go to data upload page from side navigation. Select module name, entity name and csv of the data file. This will upload data to datalake.
Step 3: Meta Model, Ingest Model, and Workloads
We can now set up the Meta Model in AdminUI:
Next, create the ingest model in AdminUI. You can refer to this page to understand what each of the processors do.
Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.
Importing Dataset into Superset
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select the WORLDAIRPORTS database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
After adding the datasets, you can select any one of them to bring up the Charts page. Here, you can pick which type of visualization you would like to create and the parameters for that visualization. Below are two examples of charts you can make in Superset:
Demo Case: Education
Example 1: To assess the role of a person in the University
Objective - To get the information about the role of a person in the university:
Visualization of professor and students individually at each individual course level
Relationship between Student, the phase they are in currently and for the number of years they are in a university program
How many students are advised by each professor
How many percentage of people teaching at each course level
Overview: The dataset has got 4 different entities which gives information about the relationship and details of the person, course and the role they have in the university, below is the dataset in sql format.
View file | ||
---|---|---|
|
Classifying the Application:
Based on the data technology used and the business use, the application is classified as below:
Quality of Data | Data Lake
| Data Lakehouse / Data Warehouse
| ||||
---|---|---|---|---|---|---|
Curated
| ✔ | |||||
Correlated
| ✔ | |||||
Normalized
| ✔ | |||||
Analyze
| ✔ | |||||
Modelling
|
Step 1: Setting up the application
Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure.
Since the dataset was in SQL format, it was first imported by using the following command in terminal and then it was converted into parquet format by executing the import model and through which it was stored on SDL Land.
Code Block | ||
---|---|---|
| ||
sudo mysql imdb < /home/bapdevvm/Downloads/UW_std.sql |
Import Model:
There are total 4 entities and have to follow the same steps as done above to create an Import model for other 3 entities i.e. “course“, “person, “taughtBy“.
Here all the entities are considered as parent and there is no relationship assigned between any of these entities.
Write below code snippet in R to make sure that landdb has correct data format.
Code Block | ||
---|---|---|
| ||
library(dplyr)
library(UNIVERSITY)
entities <- c('advisedBy', 'taughtBy', 'person',
'course')
module_name <- 'university'
write_data <- lapply(entities, function(ent) {
today_date <- as.character(Sys.Date())
land_db <- base::eval(parse(text = base::paste(tolower(module_name),tolower(ent),'landdb()',sep = '.')))
data <- dataops.readparquet(land_db, "tempTable") %>%
dataops.sdo.deselectcolumns("row_id") %>%
dplyr::rename_with(tolower) %>%
mutate(as_of_date = today_date) %>%
dataops.overwritedata(land_db, "tempTable")
}) |
Step 2: Meta Model: We can now set up the Meta Model in AdminUI. This is just for one entity “advisedBy“ and have to do the same set up for other 3 entities.
Add the column names and data types from the teams dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically
Step 3: Ingest Model: This is just for one entity have to do same set up for other entities individually.
Create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.
Now, defining ingest model for FDL to BDL Layer
Step 4: Running Import Workloads: Batch Management → Run Import workload: This is just for one entity have to do same set up for other entities individually.
Step 5: Running Workloads SDL to FDL and FDL to BDL: Batch Management → Run Workload: This is just for one entity have to do same set up for other entities individually.
Step 6: Creating Data Table in Hive
Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.
Step 7: Importing Database and Dataset into Superset and creating a dashboard for the charts
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
Below screenshot of dataset is for one entity, can create for other 3 remaining entities.
Database:
Step 8: Establishing parent child relationship.
Converting the entities into R data frames and merging (join()) data frames as per the parent-child relationship.
Code Block |
---|
data <- dataops.readparquet(university.person.factdb(), "person")
data1 <- dataops.readparquet(university.advisedby.factdb(), "advisedBy")
data2 <- dataops.readparquet(university.taughtby.landdb(), "taughtBy")
data3 <- dataops.readparquet(university.course.factdb(), "course")
rdata1 <- dataops.sdo.copytordo(data)
rdata2 <- dataops.sdo.copytordo(data1)
rdata3 <- dataops.sdo.copytordo(data2)
rdata4 <- dataops.sdo.copytordo(data3)
people <- dataops.rdo.innerjoin(rdata1, coursedetails, c("p_id"))
advisor <- merge(rdata1, rdata2, by = "p_id", all = TRUE) |
Step 9: Creating two directories into HDFS to store the data into it using hive queries and then can be accessed into superset.
Write below code in your terminal to create two different directories into HDFS:
Code Block |
---|
hdfs dfs -mkdir -p /BigAnalytixsPlatform/UNIVERSITY/people
hdfs dfs -mkdir -p /BigAnalytixsPlatform/UNIVERSITY/advisor |
Write below code in R to add data into two different HDFS directories:
Code Block |
---|
data <- dataops.overwritedata(people, "/BigAnalytixsPlatform/UNIVERSITY/people/BDL/Fact", "people")
data <- dataops.overwritedata(advisor, "/BigAnalytixsPlatform/UNIVERSITY/advisor/BDL/Fact", "advisor") |
Write below code in your terminal, its the hive queries to create tables which will be accessed through superset:
Code Block |
---|
hive> create external table UNIVERSITY.people ( course_id int, courselevel string, hasposition string, inphase string, p_id int, professor string, student string, yearsinprogram string, hash_code string, row_count int, batch_id string, report_date string, year int, year_start_date string, year_end_date string, month_name string, month_number int, month_start_date string, month_end_date string, quarter_number string, quarter_start_date string, quarter_end_date string, days_in_month int, days_in_quarter int, days_in_year int, day_of_week_long string, day_of_week_abbr string, day_of_month int, day_of_quarter int, day_of_year int, week_day_ind string, week_end_ind string, week_of_month int, week_of_year int, week_start_date string, week_end_date string ) STORED AS PARQUET LOCATION 'hdfs://localhost:9000/BigAnalytixsPlatform/UNIVERSITY/people/BDL/Fact' ;
hive> create external table UNIVERSITY.advisor ( p_id_dummy int, hasposition string, inphase string, p_id int, professor string, student string, yearsinprogram string, hash_code string, row_count int, batch_id string, report_date string, year int, year_start_date string, year_end_date string, month_name string, month_number int, month_start_date string, month_end_date string, quarter_number string, quarter_start_date string, quarter_end_date string, days_in_month int, days_in_quarter int, days_in_year int, day_of_week_long string, day_of_week_abbr string, day_of_month int, day_of_quarter int, day_of_year int, week_day_ind string, week_end_ind string, week_of_month int, week_of_year int, week_start_date string, week_end_date string ) STORED AS PARQUET LOCATION 'hdfs://localhost:9000/BigAnalytixsPlatform/UNIVERSITY/advisor/BDL/Fact' ;
|
Step 2: Meta Model: We can now set up the Meta Model in AdminUI. This is just for one entity “advisedBy“ and have to do the same set up for other 3 entities.
Upload the csv for the given entity and then click submit. Note that you do not add the as_of_date column, as that will be added automatically
Step 3: Ingest Model: This is just for one entity have to do same set up for other entities individually.
Create the ingest model in AdminUI. You can refer to this page to understand what each of the processors do.
Step 4: Running Import Workloads: Batch Management → Run Import workload: This is just for one entity have to do same set up for other entities individually.
Importing Database and Dataset into Superset and creating a dashboard for the charts
Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.
(see this page for further reference)
Below screenshot of dataset is for one entity, can create for other 3 remaining entities.