Page Comparison

This document details steps to setup new applications using MySpringy Sprngy Admin UI, defining models and running workloads.

Table of Contents

minLevel	1
maxLevel	3

Demo Case: Finance

Example 1: Analyzing Inflation Dataset

Objective

Determining how one can live an inflation proof lifestyle by analyzing the percentage change in the consumer price index.
To read data file from google drive and loading the data directly to BDL fact using analytical model.

Overview

Inflation dataset has one entity i.e. inflation and the dataset exhibits the percentage change value (every year) in consumer price index from the year 1947 to 2022.

...

Quality of Data

Data Lake

Info
Data Lake is a centralized repository to store large amount of raw data

Data Lakehouse / Data Warehouse

Info
Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features

Curated

Info
Curating data involves creating or preparing data to make it usable for business analysis.

✔

Correlated

Info
Correlated data means running Algorithms to discover patterns and relationships within the data.

✔

Normalized

Info
Normalizing data involves structuring the data to enable rapid access.

✔

Analyze

Info
Data Analysis involves identifying useful information to support decision-making, often using visualizations

✔

Modelling

Info
This involves building statistical models and testing those.

Step 1: Setting up the application

Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. Since we have just one layer file system we will have just one entity in it.

...

Step 2: Creating Meta Model

We can now set up the Meta Model in AdminUI:

Add the column names and data types from the dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically.

...

Step 3: Creating Analytical Model

The data flow from the source i.e. google drive to directly to BDL is achieved using analytical model. Pre-processing of data is also done using analytical model.

...

Step 4: Running the Analytical Workloads

Running the analytical workload for INFLATION application will load the data directly from your local to the BDL fact db.

...

Step 5: Creating Data Table in Hive

Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.

...

View file

name	INFLATION_inflation.hql

Step 6: Importing Database and Dataset into Superset and creating a dashboard for the charts

Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.

...

A Datetime column is supposed to be added into csv dataset file to get the time series visualization for Superset. Here, for Inflation application we have created "year_new" column which is the copy of "year1" column, but have just added "yyyy-01-01" to get datetime column. And then run the analytical workloads again.
Initially while adding that column it's datatype can be String and then later on in Superset, click on edit symbol beside your dataset name and then under CALCULATED COLUMNS you need to enter the SQL Query "from_unixtime(unix_timestamp(year_new, 'yyyy-MM-dd'))" and select the datatype as DATETIME, click on Save. To plot the time series graph it is necessary to have a Datetime column and this can be done by following the document of /wiki/spaces/BIGANALYTI/pages/1147181.

...

Example 2: Predicting Wealth Inequalities between Black, White and Hispanic people.

Objective

Predicting Wealth Inequalities between Black, White and Hispanic people for the upcoming years, using analytical model we are training the machine learning model on the given dataset and dataset has target column as mean_net_worth.

Overview

Wealth Inequalities dataset comprises of every 3 years of the data from the year 1989 to 2019, representing the year, race, mean and median of income, savings, debts, investments and net worth.

...

Quality of Data

Data Lake

Info
Data Lake is a centralized repository to store large amount of raw data

Data Lakehouse / Data Warehouse

Info
Data Lakehouse combines concepts of Data Lake and Datawarehouse providing large storage capacity combined with data management features

Curated

Info
Curating data involves creating or preparing data to make it usable for business analysis.

✔

Correlated

Info
Correlated data means running Algorithms to discover patterns and relationships within the data.

✔

Normalized

Info
Normalizing data involves structuring the data to enable rapid access.

✔

Analyze

Info
Data Analysis involves identifying useful information to support decision-making, often using visualizations

✔

Modelling

Info
This involves building statistical models and testing those.

✔

Step 1: Setting up the application

Now that the business use and classification of the application is established, the application can be created using the UI. In AdminUI, set up the application by going to the Set-up Application tab, select Create New, and filling out the file structure. ce weSince we have just one layer file system we will have just one entity in it.

...

Step 2: Creating Meta Model

We can now set up the Meta Model in AdminUI:

Add the column names and data types from the teams dataset into the Create Meta Model page and then click submit. Note that you do not add the as_of_date column, as that will be added automatically.

...

Step 3: Creating Ingest Model

Next, create the ingest model in AdminUI. The first part will be defining which processors to use from SDL to FDL. These are the SDL-FDL processors we select for our Teams data. You can refer to this page to understand what each of the processors do.

Step 4: Running Workloads

Once we submit the Ingest Model, we can run the workloads under the Batch Management/Run Workloads page.

...

Once you confirm in HDFS that the SDL-FDL workload ran correctly, run the FDL-BDL workload next. This will apply the processors we selected in the Ingest Model for the FDL to BDL layer. You can see if the workload ran correctly by going to the BDL/Fact directory in HDFS.

...

Step 5: Creating Analytical Model

The data is taken from the BDL fact db and is used for training the ML model onto it and making predictions on the target column using analytical model.

...

Step 5: Running the Analytical Workloads

Running the analytical workload for INFLATION application will load the data directly from your local to the BDL fact db.

...

Step 6: Creating Data Table in Hive

Go to Home->BAPCode->utilityscripts->master->R, use the create_hive_ddl_using_spark_df.R script to generate the Hive SQL statement by adding your module name and entity name in it, which will help to create a data table in Hive. Once you run the script, a .hql file will be created in Home; open this file and copy the statement that was generated. Run it in the hive terminal.

...

View file

name	WEALTHINEQUALITIESRACE_wealthinequalities_race.hql

Step 7: Importing Database and Dataset into Superset and creating a dashboard for the charts

Once you are in Superset, select the Datasets option from the Data dropdown in the top menu. From there, select the add Dataset option. Set the Database to Apache Hive, select your database from Schema, and select which table you would like to add. Superset will only allow you to add one table at a time, but you can add as many tables as you want one by one.

...

Versions Compared

Old Version 1

New Version 2

Key

Demo Case: Finance

Example 1: Analyzing Inflation Dataset

Objective

Overview

Step 1: Setting up the application

Step 2: Creating Meta Model

Step 3: Creating Analytical Model

Step 4: Running the Analytical Workloads

Step 5: Creating Data Table in Hive

Step 6: Importing Database and Dataset into Superset and creating a dashboard for the charts

Example 2: Predicting Wealth Inequalities between Black, White and Hispanic people.

Objective

Overview

Step 1: Setting up the application

Step 2: Creating Meta Model

Step 3: Creating Ingest Model

Step 4: Running Workloads

Step 5: Creating Analytical Model

Step 5: Running the Analytical Workloads

Step 6: Creating Data Table in Hive

Step 7: Importing Database and Dataset into Superset and creating a dashboard for the charts