From the perspective of technical implementation, it mainly includes the five aspects of "management", "acquisition", "storage", "management" and "use", namely business and data resource sorting, data collection and cleaning, database design and storage, Data management, data usage.
Data resource sorting: The first step of data governance is to clarify the organization's data resource environment and data resource list from a business perspective, including organizational structure, business matters, information systems, and databases, web pages, Data item resources exist in the form of files and API interfaces. The output of this step is a classified data resource list.
Data collection and cleaning: the process of extracting, transforming, and loading data from the source to the destination through visual ETL tools (such as Alibaba's DataX, Pentaho Data Integration) , the purpose is to centrally store scattered and fragmented data.
Construction of basic database and theme database: Generally speaking, data can be divided into basic data, business theme data and analysis data. Basic data generally refers to core entity data, or master data, such as population, legal persons, geographical information, credit, electronic certificates and other data in smart cities. Theme data generally refers to a certain business theme data, such as food supervision, quality supervision and inspection, comprehensive enterprise supervision data of the Market Supervision Administration. The analysis data refers to the analysis result data based on the comprehensive analysis of business subject data, such as the comprehensive evaluation of enterprises by the Market Supervision Administration, industrial regional distribution, high-risk enterprise distribution, etc. Then the construction of the basic library and the theme library is based on the understanding of the business and the abstract data storage structure based on the principles of easy storage, easy management, and easy use. To put it bluntly, it is to design the database table structure based on certain principles, and then according to The data resource list designs the data collection and cleaning process and stores neat and clean data in a database or data warehouse.
Metadata management: Metadata management is the management of the attributes of data items in the basic library and theme library. At the same time, it associates the business meaning of the data items with the data items, making it easier for business personnel to understand. The meaning of data fields in the database, and metadata is the basis for automated data sharing, data exchange and business intelligence (BI) mentioned later. It should be noted that metadata management generally refers to the management of data item attributes in the basic database and theme database (i.e. core data assets), while the data resource list is the management of data items from various data sources.
Lineage tracking: When data is used in business scenarios, data errors are found, and the data governance team needs to quickly locate the data source and repair the data errors. Then the data governance team needs to know which core library the business team's data comes from, and which data source the core library data comes from. Our practice is to establish an association between metadata and data resource lists, and the data items used by the business team are configured from a combination of metadata. In this way, a blood relationship between the data usage scenario and the data source is established. Data resource catalog: Data resource catalog is generally used in data sharing scenarios, such as data sharing between government departments. The data resource catalog is created based on business scenarios and industry specifications, and relies on metadata and infrastructure. Library themes enable automated data application and use.
Quality management: The successful exploration of data value must rely on high-quality data. Only accurate, complete, and consistent data can be valuable. Therefore, the quality of data needs to be analyzed from multiple dimensions, such as: offset, non-null check, range check, normative check, repeatability check, correlation check, outlier check, fluctuation check, etc. It should be noted that the design of an excellent data quality model must rely on a deep understanding of the business. Technically, it is also recommended to use big data-related technologies to ensure detection performance and reduce the performance impact on business systems, such as Hadoop, MapReduce, and HBase. wait.
Business Intelligence (BI): The purpose of data governance is to use. For a large data warehouse, the scenarios and requirements for data usage are changeable, so you can use BI products to quickly obtain the needs. The data is analyzed and formed into reports. For example, Pico Data is a professional BI vendor.
Data sharing and exchange: Data sharing includes data sharing within the organization and between organizations. The sharing methods are also divided into three types: library table, file and API interface. In the sharing method, library table sharing is more direct and crude, while in the file sharing method, a reverse data exchange can be achieved through ETL tools. What we recommend is the API interface sharing method. In this method, the central data warehouse can retain data ownership and transfer the data usage rights through the API interface. API interface sharing can be implemented using API gateway. Common functions are automated interface generation, application review, current limit, concurrency limit, multi-user isolation, call statistics, call audit, black and white list, call monitoring, quality monitoring, etc. .