current position:Home>Dataworks data modeling - package data model management solution

Dataworks data modeling - package data model management solution

2021-08-31 16:05:13 Heart of machine

author :DataWorks The product manager Liu Tianyuan

In the current business environment , Correct data governance strategy is very important for data value-added . According to statistics , The data of enterprises has been increasing every year 50% The rate of growth , Therefore, the difficulty of enterprise data governance and integration is increasing .

DataWorks Has been committed to becoming more convenient for users 、 A good helper for faster data development and data governance . Data modeling for this release , It is a supplement to the existing capabilities in the field of data governance , It brings users before data development , Ability to implement ex ante governance .

One 、 Why data modeling

quote 《 The road to big data : Alibaba big data practice 》 The content in :“ If you think of data as books in a library , We want them to be placed in categories on the bookshelf ; If you think of data as the architecture of a city , We hope the layout of the city is reasonable ; If you think of data as computer files and folders , We want to have a good folder organization according to our own habits , It's not a messy desktop , I'm always at a loss for finding a document ”.

From these words, we can draw a conclusion , Standardized and reasonable organization is essential , Only standardized and reasonable organization can avoid the intersection of things and phenomena , Finally, it forms a complex problem , The data model stored in the computing engine ( Data sheet ) No exception .

( One ) Common problems of traditional model management

From the common data center and data governance project planning , After sorting out the business and requirements , Model design is an unavoidable process . From experience , The average number of tables in the middle office construction and data warehouse construction projects is 1000-5000 Zhang , There are even hundreds of thousands or millions of cases . therefore , Draw a design for thousands of tables , Or relationship network diagram is very important . So how to draw , What tools are used to draw ?

 Slide 5.PNG

( Two ) Original model management

Now most enterprises are through Excel Let's draw this picture , During the implementation of many projects , The models are recorded in Excel Upper . When there are few models , This method is indeed a lightweight and easy to implement choice , But when the enterprise's data business is developing rapidly , When it reaches hundreds of thousands , Such management will be inefficient . The data model of an enterprise may be the same as the business , growth 、 Changes are very fast , Data personnel have no time to care about maintenance 、 modify Excel The content in , Even if you can spare time and energy to do it , Operability is also very low , And there is likely to be human error . Once managers need to look at the whole data system from a global perspective , The result will be no timeliness , Information with extremely high error rate .

( 3、 ... and ) Difficult to implement data standards

in the majority of cases , If there is a lack of powerful and operable standard tools , Then the indexing and dropping stage that should have been directly skipped before the actual data output , It is very difficult for enterprises to implement data standards . Most enterprises are adhering to the idea of development before governance to build their own data system , It directly leads to the unavailability of subsequent data . Slide 7.PNG

If the enterprise does not have clear data standards , Then there will be differences in the understanding of the business between developers , Different definitions will be constructed 、 Different caliber data models , As a result, it will cause quality problems after data output , As a result, many expected needs cannot be realized . And the data team needs to spend more time and energy to repair these dirty data . The quality of enterprise data , Most of them are due to the lack of standard data , Or the standard landing is not complete .

About data governance , Especially prior management , There are two main points perplexing enterprises :

  • First of all , Lack of a unified view to manage models , The data model of the enterprise is scattered in different engines , What exactly does the data model look like in each engine , Business personnel cannot perceive in real time .
  • second , There is no way to effectively drop the standard of data items, resulting in quality problems of data production , As a result, many data requirements cannot be implemented .

Two 、DataWorks Introduction to data modeling

According to the interpretation of major websites and works , Data model is a method of data organization and storage , It's the abstraction of data characteristics , Emphasize from business 、 Reasonable access to data from the perspective of data access and use . If there is no data model , It's hard for stakeholders to see the structure of the existing database 、 Understand key concepts .

Data models include conceptual models 、 Logical model and physical model .

The conceptual model is obtained by combing the business processes of enterprises , Used to describe the relationship between important entities . For example, an enterprise has multiple departments , There are many employees in each department , Each employee has a different salary , The conceptual model can be passed through the Department 、 staff 、 Salary represents the relationship between these three entities .

Logical model is a further refinement of conceptual model , Determine the attributes and description of each entity . For example, the employee model contains the name 、 ID number 、 Multiple attributes such as gender and birthday .

The physical model is an engine oriented model , It is a model considering various specific technical implementation factors based on physical characteristics , It can only be stored in the engine .

 Slide 11.PNG

DataWorks Data modeling also supports relationships (ER、3NF) Modeling and dimensional modeling ( Star type , snow ). Different types of models have no best , Only better for .

Relationship modeling is ER or 3NF modeling , It abstracts many entities in the enterprise from the perspective of the whole enterprise rather than business analysis , Design a set to meet 3NF Model of , Describe the business architecture of an enterprise in the form of entity plus relationship . This method is very normative , Basically no redundancy , Suitable for strategic planning of large enterprises , The disadvantage is that it is not conducive to docking BI Drilling and tripping . And the physical model design usually does not match the model design required by the business , The later project implementation cycle will be very long , The cost is also very high , At the same time, the ability requirements for modelers are also very high .

 Slide 13.PNG

Dimensional modeling is mainly based on the analysis of decision requirements . Generally, it has better response performance for large-scale complex queries , Can be directly business oriented . Typical representatives are the well-known star model and the snowflake model suitable for some special scenes . Dimensional modeling is relatively fast to start 、 Fast delivery , But it is difficult to query across business segments , At the same time, there are a large number of auxiliary tables and garbage tables , Easy to cause table explosion problem , The subsequent maintenance cost is high .

 Slide 14.PNG

Users should choose the modeling method based on the actual scenario of the enterprise . Based on experience , Most enterprises will have the above two modeling methods at the same time , The underlying model is modeled with relationships , Strive to simplify data , Upward dimensional modeling is more suitable for , Bring availability by data redundancy 、 Analysis and operability .

DataWorks Data modeling is an open and flexible tool , Users can freely choose modeling theory to plan and design models . quote Linux The founder's words about describing what a good programmer is , Poor programmers care about code , Good programmers are concerned with data structures and their relationships , The same principle can also be migrated to the data model

The importance of type , Only according to the understanding of enterprise business process scenarios , Organize and store the data model orderly , Big data can get high performance 、 Low cost 、 Efficient and high-quality use .

( One ) The life cycle of a data model

First , No matter what project , Conduct business research and demand analysis before the project starts . Business research is to better understand the company's business . such as , What are the similarities and differences between the businesses of various business areas and business lines ? Which business modules can each business line of the company be subdivided into ? What is the specific process of each business module ? This information is very important , It will directly determine the success or failure of data warehouse construction .

Demand research needs to start with the actual work scenario . For example, what reports do analysts and operators look at every day ? The company is building a recommendation business system KPI What is it like ? What kind of data does the recommendation system need to be based on to achieve KPI?

 Slide 15.PNG

second , It's the outline design stage . At this stage, we should sort out and describe the entity relationships in the enterprise business process from a high dimension , That is, a graphical description of the dimension table and the fact table .

Third , Is to determine the attributes of each dimension and the measurement of each fact table . Determine how attributes and dimensions should be filled into the summary model in the previous step .

Fourth , Encoding phase , Is to transform the physical model into DDL Statement procedure .

The fifth , Is to complete the conversion DDL Statement is issued to the development environment to verify whether it conforms to the design specifications of the model . After passing the test , You can publish the model online , Data system serving the whole enterprise .

The sixth , The operation and maintenance phase is long-term and continuous . At this stage , The model has become a real table in the engine . In order to find unexpected changes to the engine from other channels in time , It is necessary to regularly check, repair and verify the model differences between the model library and the engine .

( Two ) Basic capabilities of data modeling tools in the industry

In order to better manage the life cycle of the data model , The following capabilities are provided by the common model management tools on the market .

The first is the design capability of conceptual model and physical model , It is generally supported in a visual way , To create a conceptual model , Build logical and physical models

The second is version control capability , Support user management of historical versions , Roll back the model if necessary .

The third is the ability to import and export models , It supports all kinds of files of the model , Import into the model tool , It also supports exporting models in the model tool as database scripts , Then create the planned model in the database .

( 3、 ... and ) be based on DataWorks Data modeling (DataBlau DDM) The modeling process of

First , For enterprises with existing offline models , You can use the model file ( such as PD、ERWin file ) Import directly to the client , Or import the model in the big data engine directly into the client through reverse engineering capability .

secondly , In the design stage , Users can further plan and design the model through client or web client tools , The construction of the model is completed by standard reference .

Third , Development testing phase , After model design , The model can be submitted to the development environment for testing , At the same time, users can also branch and manage historical versions , So that you can roll back to the historical version at any time .

Last , The model test is finished , Once reviewed by relevant personnel, it can be released to the online production environment , And began to serve the enterprise data business .

 Slide 18.PNG

Case study

Baihua movie online (Baihua) It's a fictional online movie website , Currently in MaxCompute Maintain its sales data in the database .

The company recently decided to implement the data center strategy throughout the enterprise , To improve business intelligence and get more reliable business decision support . The data team has experience in developing wide table marts on demand ,IT Teams and technical experts warn them of the professionals and funds needed to execute and maintain the entire data center , The industry has also widely spread the rumor that the data platform overturned . In order to establish fast and effective reuse , High quality middle stage model , Baihua decided to use Datablau Data

Modeler To design 、 Development 、 Deploy and maintain their data center model . Let's take a look at the process we follow to build the data model for them .

1. planning : Model layering

The current situation of Baihua is , Strictly speaking, there is no concept of counting positions : No stratification , There is no subject field , There are no norms . The data team is small , Facing the data demand , Data analysts start demand analysis 、 Index dismantling 、 Dictionary Selection 、 Wide table creation 、 Report development and other one-stop operations , Finally, a wide table is formed, and various versions are confused , The temporary table and result table are unclear , Follow up problems such as dimensional confusion , At the same time, because team members feel that they have become data retrieval machines and SQL Boy, Can't learn anything new , There is a gradual loss of personnel , Code is becoming harder to maintain , Data quality problems occur frequently ,... All in all , There are signs of unsustainability .

Draw lessons , Research work has been done on the planning of Zhongtai , Big data platform selection MaxCompute, Data standards , indicators , dimension , Data specifications , Naming specification , Data quality has been planned , Although there is not much content , But then the project process continues to enrich and follow .

for fear of “ A wide watch is cool for a while ” The problem of , The planning scheme layers the data model , Use DDM

Mapping Managed the mapping logic , Improve the maintainability and quality of the system .

 Case study 1.jpg

ODS( Raw data layer ):ODS Layer is the data warehouse preparation area , by DWD Layer provides basic raw data . In terms of naming , Whether it's table naming or field naming, try to be consistent with the business system , But additional identification is needed to distinguish between incremental and full scales ,”_delta” To identify the table as an incremental table .

DW( Data warehouse ), Perform redundant processing for some tables :

  • DWD( Detail data layer ): and ODS Detailed data with consistent granularity , De duplication of data , Dirty data filtering , Empty processing , Ensure data quality .
  • DWS( Service data layer ): Light summary data and width table ( By theme ) Storing data .
  • DIM( The public dimension layer ): Light summary data and width table ( By theme ) Storing data .
  • TMP( Temporary data layer ): Light summary data and width table ( By theme ) Storing data .
  • DM( Application Mart layer ): Store the wide table data of the application class and the indicator result set for the report analysis class .

Naming rules for inter layer table names :

data

layered

System

name

System

Abbreviation

Schema

database

Table role

name

ODS

Office system

oa

dam

Zipper increment inc

o_oa_dam_ Table name _inc

DWD

Participants

ip

 

 

wd_ip_ Table name _db

DWS

Participants

ip

 

 

ws_ip_ Table name _db

DWDIM

Participants

ip

 

 

wdim_ip_ Table name _db

DWTMP

Project name

ip

 

 

wt_ip_ Table name _db

DM

Project name

rp

 

 

dm_rp_ Table name _db

2. Business needs : Count the number of regional customers in the year

The customer operation department needs to count the number of customer registrations and age distribution in which regions in previous years , So as to tilt the advertising .

Demand analysis mainly requires annual order data , Count the number of users in the region dimension ; According to the age distribution of customers , Make secondary Statistics .

This requirement is not complicated , We mainly need the sales database to analyze the data source Baihua_Sale The data of , At the same time, portrait data of region and customer age are required .

3. step 1: Create source data model (ODS)

Use Datablau Data Modeler The first step in building a data stage model is to identify and model the source data . We need to create a data model project , Use... On the data model toolbar 【 Reverse database 】 Icon pair

Baihua_Sale Sales database 【 Reverse engineering 】.

 Case study 2.png

This is our right to Baihua_Sale What the data source looks like after reverse engineering :

 Case study 3.png

Be careful : The entity boxes in this model correspond to those from Baihua_Sale Table of data source .

By checking the model , We find that the dictionary of entities in the model is seriously missing ( This is also one of the symptoms of the absence of enterprise data governance ), Of course, the relationship between data is also missing , How to in these scattered data tables , Find the data we need to analyze ?

Of course, our method is to model the source system data , Build the data model of the business system , This allows us to quickly understand the business , Accurately analyze business requirements .

First step , We complete the data dictionary , This is the first step in healing , Although the physical naming of this library is standard , I guess your situation is worse than mine . By accessing the source system development team , Missing data dictionary collected , We organized it into DDM Of Excel Data dictionary format , Import , Very lucky , We have completed

90% The data of . For missing 10%, We sent a help email to the original system development team , They are all very good , Gave me feedback soon , At the same time, the business module of the system is informed , Complete the data dictionary .

The second step , Business subject modeling is carried out for this model , Three themes have been built :

Product data (Product), Sort out the master data and reference data related to the product and the business relationship between the data .

 Case study 4.png

Customer data (Customer), Sort out the customer related master data and reference data , And the business relationship between data .

 Case study 5.png

Purchase transaction (Business), Sort out the transaction data of online ordering and leasing of relevant customers , This is the source of the transaction fact sheet of the project .

 Case study 6.png

 Case study 7.png

( Topic domain logical model )

We have successfully created 、 Validation and deployment Baihua_Sale Source data model .

4. step 2: Build a common dimension model (DWD)

The next step in this process is to design a dimensional model , The model will be used as Baihua_Sale The common model of the stage model in the data . You can use the Entity Design the model from scratch .

According to the demand , We combine order form and payment form into order fact form ; The customer and movie tables are moderately redundant , The address table is also redundant , Finally, the star dimension model is designed .

 Case study 8.png

Right click the entity , Hover over the entity type in the context menu , Then select the appropriate type from the given options , You can easily change the type to fact or dimension .

 Case study 9.png

According to the analysis requirements , In the order form , The field... Is derived from the order date “ year ”, The customer form is based on the customer ID card , Derived constellations , This tab allows users to analyze the impact of constellations on orders .

4.1 Create a map

Open the data mapping manager .

 Case study 10.png

Step1. Create a new data map .

 Case study 11.png


Step2: Select the target table in the model library , The default here is when entering Film surface .

 Case study 12.png

Step3: Source side model , Select... In the model library ODS Side model , Choice and Film Related association tables .

 Case study 13.png

Step4: Tools follow up relationships between model entities , Built connection entities and associated fields ; It can also be adjusted manually ;

 Case study 14.png

Step5: Target field mapping , The tool automatically maps by name , Empty items that cannot be mapped , You need to map manually in the editor .

 Case study 15.png

Step6: Generate SQL Statement and mapping correspondence diagram .

 Case study 16.png

You can preview and export SQL sentence .

 Case study 17.png

According to the above steps , For the other four DWD Establish a mapping relationship with the table .

5. step 3: Customer preference analysis (ADS)

The next step in this process is to design an analytical dimension model , The model will be used as the target model for customer preference analysis .

According to the demand , We will city , The constellation , Three attributes of the year are used as dimension fields , Take the number of customers and the number of amounts as metrics . Build a wide table for customer analysis in this area .

 Case study 18.png

Right click the entity , Hover over the field type in the context menu , Then select the appropriate type from the given options , You can easily change the type to measure or dimension .

5.1 Create a map

according to 4.1 Method , Establish a mapping relationship , As shown in the figure below :

 Case study 19.png

 Case study 20.png

For dimension entities , The dimension role column in the layout builder provides a complete list of options . These include the following :

  • Proxy and business keys .
  • Slowly changing dimension types (SCD1、SCD2、SCD3 and SCD6).
  • Record identifier ( Effective and expiry dates 、 Current record indicator and version number ) To track historical data .
  • Placeholder dimensions to track late and early facts and dimensions .

Now the dimension model is ready , We will validate and deploy it for further use .

6. step 4: Deploy the data model

In this step , We will design ETL The pipeline loads the relevant source data into each table to populate Baihua_Sale Data center model . stay Datablau Data Modeler in , You can go to Mapping The designer generates DML SQL sentence , Ranging from ODS To DWD Of SQL sentence , from DWD To ADS Of SQL sentence , Stored as two files .

New SQL Statement import ETL Scheduling tools , Distributed loading and running of data .

Last , Through the built-in... Of the scheduling tool Job Scheduler Automatically perform the process of refreshing these data . stay

Scheduler Tab , You can create a new plan , Automatically execute the scheduling process at a given scheduling frequency . under these circumstances , We have arranged to refresh the sales data every day .

7. Last : Visualization and Analysis

adopt BI Tools effectively analyze their sales data and gain valuable business insight .

 Case study 21.png

( Four )DataWorks Intelligent data modeling ( Dimensional modeling )

DataWorks Intelligent data modeling module , Covering warehouse planning 、 Data standards 、 Data modeling 、 The ability of data indicators in four aspects , It can accelerate data warehouse design and dimensional modeling , Improve the standardization and standardization of the data center , Complete the design and development of data warehouse in one stop .

 Slide 3.PNG

Warehouse planning

The basic planning of data warehouse design includes data layering 、 Business classification 、 Data fields 、 Business process .

  • Data tiering : Provide industry wide five tier data tiering (ODS、DIM、DWD、DWS、ADS), And support users to add and modify , Support the function of table name checker .
  • Business classification : Support multi-level business classification and customization , It can be divided according to the business status of the enterprise .
  • Data fields : A collection of business processes , It can be designed in combination with enterprise business division , It is convenient for businesses to quickly filter data . Such as transaction domain 、 Commodity domain 、 Logistics domain, etc .
  • Business process : An inseparable behavior class 、 Stock analysis 、 Special custom business process , Such as additional purchase 、 Collection 、 Evaluation, etc ;

 Slide 4.PNG

Data modeling

DataWorks Provides visual dimension modeling tools , Support the forward and reverse auxiliary modeling of a variety of big data engines , It not only supports the direct distribution of the designed model in the tool to the engine , Moreover, the existing models in the engine can be extracted into the tool for re editing 、 Then send out , Provide one-stop management 、 modeling 、 Ability to publish , Bid farewell to the tedious operation of manual import and export of traditional tools . In terms of model falling standard monitoring , Support baseline checking of models that have fallen into the engine , Help users easily find the inconsistency between the table structure and the physical model structure , So that it can be corrected in time .DataWorks While providing professional local clients, it also provides online lightweight web version clients , It is convenient for users to choose under different working scenarios .

  • management : The model supports management and search from the perspective of data domain and business classification , Clear and easy to use .
  • modeling : Three modeling methods are supported ,Excel、 Visual modeling 、 Script modeling , Meet a variety of modeling preferences .
  • Release : No need for secondary development , Support one click publishing to MaxCompute、Emr、Hologres Production or development environment for such engines

 Slide 5.PNG

Data standards

Data standard is the basic guarantee of data model and index standardization , It can provide unified specification constraints , Ensure the consistency of data definition and use 、 Accuracy and completeness .DataWorks The data standard module supports data dictionary 、 Standard code 、 Unit of measure, etc , It not only supports enterprise managers to define their own data standards , At the same time, it supports data modelers to reference these standards when building models , To solve the problem of inconsistent caliber caused by different modelers' inconsistent understanding of data .

  • The data dictionary : Define the type of field 、 Value range 、 Unit of measure 、 Standard code and other constraints , It can be referenced when defining the fields of tables in data modeling .
  • Standard code : Set the enumeration value content of the selectable data of a data standard . It mainly refers to... In the data dictionary , Define the value constraints of the field .
  • Unit of measure : Quantity unit of field parameter ( Such as 、 element 、 Mi et al ). It can be referenced in indicator definition and field definition of table in data modeling .

 Slide 6.PNG

Data indicators

Data indicators include atomic indicators 、 Modifiers 、 cycle time 、 Derived indicators 、 The summary table .DataWorks Support atomic indicators 、 Design and definition of derived indicators , Ensure uniform business caliber . Convergence through summary table , Automatically generate indicator query code , Realize business aggregation and management

  • Atomic index : The basic measurement that cannot be split under a business process or activity , It is used to clarify the statistical caliber and calculation logic of the business , Statistics of business status .
  • Modifiers : Indicates the business scope of statistics , It is the business limit of statistical indicators .
  • cycle time : Represents the time range of statistics , It is the time limit for statistical indicators .
  • Derived indicators : By atomic index 、 cycle time 、 Modifiers form , It is used to reflect the business status of a business activity in the specified time period and target range .
  • The summary table : Used to organize the same time period under the data field 、 Multiple derived indicators of the same dimension , Form business aggregation logic , For subsequent business queries 、OLAP analysis 、 Data distribution, etc .

 Slide 8.PNG

 Slide 7.PNG

DataWorks Modeling ability and DataWorks Perfect combination of development system , Support the publishing of models and DataWorks Associate with the development process , It can standardize the process from model design to model release and launch .

3、 ... and 、DataWorks Development 、 The evolution of governance model

( One ) Macro evolution

Before accessing data modeling , The data will be synchronized to the big data engine offline or in real time . Then start data production 、 Development 、 Dispatch , And regularly output data . Last , The data will flow back to some databases or OLAP In the engine , Provide queries or directly through data services API Construct a API Directly serve the business .

and DataWorks The functions of all data governance related fields also run through these three main processes . Including checking the source data writing when the data is on the cloud 、 Detection of data quality and timeliness of output during data production, and authority control of using data .DataWorks It has provided users with perfect data governance capabilities . These governance capabilities focus more on Governance in and after the event .

 Slide 20.PNG

The newly released data modeling module adds the main use process of defining data form , The purpose is to improve the prior management ability for users , At the same time, it brings a one-stop model management solution . In this step , Users can conduct research based on their understanding of enterprise business processes and needs , Define enterprise business standards and specifications . And modeling is based on data standards , Conduct indexing 、 fail in a tender 、 Generate table structure , Realize the unified management of the model , Let advance management be implemented .

 Slide 21.PNG

For models already in the service providing engine , You can also load the model into the data modeling module through reverse modeling , Make unified correction and improvement, and then submit it back to the engine . The whole process follows DataWorks The defined development process specification .

( Two ) Micro evolution

From a microscopic point of view ,DataWorks Support authorization mode based on decision system , Allow people in different positions to play different roles . Everyone does his job , To jointly complete the safety 、 Standardized data development process .

stay DataWorks In standard mode space , Developers will first complete the code development in the development environment 、 Submit and smoke test . After the test is correct, find a third party or third person who is familiar with the business , For example 、 Deployment or administrator , Review the submitted code . If you confirm that the submitted code will not affect the stability of the business system , And meet business expectations , It can be released to the production environment , Start periodic scheduling run .

 Slide 23.PNG

After publishing the data modeling ,DataWorks Add the role of model designer . People granted this role can log in DataWorks Data modeling module , Responsible for model development .

First, the data management team leads , That is, the space administrator role defines the data standard first . secondly , The data modeler designs the model , At the same time, the standard defined by the data supervisor is used to index and drop the bid , Create the table structure and each field of the table , Form a physical model that can be distributed to the engine .

then , The developer is responsible for transforming the physical model into DDL Statement and submit it to the engine of the development environment . Finally, the operation and maintenance department will 、 Deployment or administrator role , To the engine DDL Statement audit , Approve and release to the production environment after no error .

thus , A most basic and relatively professional model has been designed and implemented . Of course , The old development process can still be reused . The difference is , Data table creation 、 Modification and other management work will be implemented in a more rigorous way .

 Slide 25.PNG

Last , You can refer to our open course on data modeling :https://developer.aliyun.com/article/782517

DataWorks Official website :https://www.aliyun.com/product/bigdata/ide

big data &AI Experience Hall :https://workbench.data.aliyun.com/experience.htm


copyright notice
author[Heart of machine],Please bring the original link to reprint, thank you.
https://en.fheadline.com/2021/08/20210831160509701B.html

Random recommended