Data science. Data governance and master data

This project is composed of the following topics:
  • Data science. Definition, Chief Data Officer and Big Data link
  • Data science. Data governance and master data link

Case study

Let us suppose that we are the CDO, or Chief Data Officer, of a multinational company headquartered in our country. The organization is dedicated to the production and distribution of telephony, computing and multimedia products.

In this scenario, our main responsibility is to manage the data governance platform, whose objective is to plan, supervise and manage the use of data.

As the person responsible for data governance, we hold regular follow-up meetings with members of other departments in the organization, such as sales and marketing, where reporting about the situation and evolution of products, brand performance and other internal company data is presented.

At present, the online marketing department is the one that contributes the largest amount of data to the organization through the online marketing tools and techniques it uses: SEO, SEM, email marketing, product aggregators and publications in third-party media.

In addition to those online marketing techniques, the organization also uses marketing analytics techniques which, based on all the available data, make it possible to assess different strategies and make the decisions that are most beneficial for the business.

For this purpose, the most important metrics are defined and analyzed, and predictive analysis techniques are applied in order to build analytical models capable of evaluating possible future scenarios.

In this sense, the use of geographical measurement and visualization tools can help us perform precise analyses and identify, for example, correlations between different marketing campaigns and the distribution of our product sales across the territory, taking into account variables such as customer age, gender and location.

On the other hand, the organization uses social networks both to inform customers about new products and to receive feedback about customer satisfaction with company products.

However, the multinational's use of social networks does not currently go much further than that, and in the latest company management committee meeting we were informed of the need to strengthen this use in order to gain a competitive advantage over the rest of the sector.

The main objective is to detect people with high influence and impact on social networks, whose opinions about our products can help us attract new customers and even new markets. These people will be considered brand ambassadors on social networks.

Consequently, this is a social analytics project that the multinational wants to carry out in all the countries where it operates, with the aim of spreading the company's values, increasing the number of customers in its target audience, increasing sales and services in current markets and boosting the brand's online presence.

As it is a global project, the data and analyses obtained will need to be shared by all branches.

In this context, the company is aware of the limitations of current computing systems:
  • They are not fast enough to capture and store this information.
  • They cannot host the volume of data related to the organization's products that is generated daily on the different social networks.
  • They cannot manage the multiple data sources and the heterogeneity of the information, such as messages and photos on Twitter, Facebook and Instagram, or videos on YouTube.

1. As the person responsible for the data governance area, and specifically for the master data management function (MDM):

  • What stages or phases should a master data management program include in an organization such as the one described in the case? What is the objective of each of them in this context?
  • What maturity model in master data management do you think an organization like the one in the case study should have? What capabilities should it possess?
  • What impact, in terms of advantages and disadvantages, do you think the ambassador-detection project on social networks may have on our organization's MDM?

1.1 What stages or phases should a master data management program include in an organization such as the one described? What is the objective of each of them in the context of this organization?

The stages that a master data management program should include for the organization described in the statement are:
  • Identify master data sources: at this point we would create a catalogue of all the sources that contain data about materials, products, suppliers, customers, employees and assets. These sources may come from:
    • Enterprise applications such as ERP for financial management, HRM for employee management, PLM for product management and others, from which we can obtain information about sales, marketing, product evolution, brand performance and other internal company data.
    • Online marketing tools and techniques from which we would obtain data related to advertising campaigns such as SEO, SEM, email marketing, product aggregators and third-party media publications.
    • Social networks, where we can obtain customer satisfaction information about the company's products.
    These tools would provide entities such as products, departments, employees, countries and campaigns, as well as related categories such as SEO, SEM and email marketing.
    • Digital marketing tools that produce information related to SEO, SEM, email marketing, product aggregators and third-party media publications.
    • Social networks.
    • Enterprise applications such as ERP, HRM and PLM.
    On the consumer side we would have:
    • Enterprise applications such as ERP, HRM and PLM, because they will consume campaign results, purchasing trends and similar internal information.
    • Social networks where campaigns and user acquisition actions will be carried out.
    • Marketing tools, which need internal company data in order to generate new information.
    • Marketing analytics tools or platforms.
    • Geographical measurement and visualization tools.
  • Collect and analyze metadata about the master data: from the data obtained in the case, we could define for example the value campaign as a master datum made up of the following reference attributes:
    • SEO, which would have its own related entity including values such as Trust Flow and Citation Flow.
    • SEM, which would include metrics such as CPC, CPM, CPA, CTR, CPV and CPL.
    • Email marketing, with values such as open rate and click-through rate.
    • Product aggregators.
    • Publications in third-party media.
    • Social network target audience.
    • Product.
    In addition to this information from the case, we would also obtain:
    • Financial data: obtained from the ERP, which is a system intended for resource administration within an organization.
    • Customer data: obtained from the CRM, although this information can be extended with social and third-party sources.
    • Employee data: obtained from the HRM and possibly from project management systems, again with the possibility of incorporating third-party information.
    • Product data: obtained from the PLM system. Within product data, the case especially highlights sales and product distribution.
    • Location data: including geopolitical data such as countries and provinces and business locations such as office and warehouse addresses.
    A very important transactional datum will be the sale, because depending on its location it will feed the reporting required through the geographical measurement and visualization tool. A sale will include:
    • time,
    • date,
    • location,
    • product,
    • customer.
    Social networks are also very important in the case study, especially the users who use them. We can store them as master data with:
    • identifier,
    • characteristics and interests,
    • country,
    • social network.
    Users will have an associated transactional datum called rating, reflecting their opinion about a campaign, a product or the company itself:
    • user,
    • rated object, campaign or company,
    • rating.
    Finally, marketing analytics techniques can provide information about views, interaction time per user, location, access device and so on. This transactional datum would be generated when a user interacts with a campaign and we can call it Campaign View.
  • Appoint data stewards:
    • Sales: managed by the ERP under the supervision of the accounting department.
    • Products: managed by the PLM under the supervision of the product manager.
    • Employees: managed by the HRM under the supervision of the accounting department.
    • Customers: managed by the CRM under the supervision of customer service and invoicing.
    • Campaign: managed by marketing tools under the supervision of the marketing department.
    • Campaign View: managed by the marketing analytics tool under the supervision of marketing.
    • Rating: managed by social media management applications in the social media department.
    • Users: managed by social media management applications in the social media department.
  • Implement a program and a data governance council: implementing MDM requires clear governance rules such as:
    • Sales: stored for 10 years; after that period the data would be removed from the ERP and archived on a backup server.
    • Products: stored for 10 years in the different applications and then archived on a backup server.
    • Employees: stored for 10 years and later removed from the source programs and archived.
    • Customers: stored for 10 years and then reduced and archived on a backup server.
    • Campaign: stored for 5 years; afterwards only reports or results would be kept in order to reduce the amount of managed data.
    • Campaign View: retained for 3 years.
    • Rating: stored for 2 years; afterwards only reports or aggregated results would remain.
    • Users: fully stored for 2 years. If the user does not interact again with our products or campaigns, the record would gradually be reduced until deletion in year 10.
  • Develop a master data model: an example would be the following:
    • Campaign.
      • SEO as a referenced entity.
      • SEM.
      • Email marketing.
      • Product aggregators.
      • Publications in third-party media.
      • Social network target audience.
      • Product.
    • SEO.
      • Trust Flow. Numeric value.
      • Citation Flow. Numeric value.
      • Etc.
    • SEM.
      • CPC. Numeric value.
      • CPM. Numeric value.
      • CPA. Numeric value.
      • CTR. Numeric value.
      • CPV. Numeric value.
      • CPL.
      • Etc.
    • Email marketing.
    • Product aggregators.
    • Publications in third-party media.
    • Social network target audience.
    • Product.
      • price,
      • location,
      • cost,
      • information.
    • Financial data.
      • sales,
      • location,
      • employee data,
      • balances,
      • invoices.
    • Employee data.
      • tax identification number,
      • name,
      • IBAN account.
    • Location data.
      • coordinates,
      • country,
      • state or province.
    • sale
      • time,
      • date,
      • location,
      • product,
      • customer.
    • users
      • identifier,
      • characteristics and interests,
      • country,
      • social network.
    • rating
      • user,
      • rated object, campaign or company,
      • rating.
    • Campaign View
      • information about views,
      • interaction time per user,
      • location,
      • access device,
      • etc.
  • Choose a set of tools: implementing master data management requires support tools for storage, treatment, cleansing and governance. In this case we would rely on data storage, quality, hierarchy management and data exploration techniques in order to integrate the data and measure its quality.

    The toolset could include:
    • Python as a programming language because it is especially well adapted to data processing.
    • Big Data technologies such as Hadoop and other NoSQL technologies because of data variety, international distribution and performance needs.
    • Knowledge of API-access tools to obtain information from social networks.
    • Predictive languages such as PMML, using analytics data to model whether a user may buy a product depending on variables such as age, interests or browsing history.
    • Knowledge of open data sources to obtain locations and demographic data for campaigns.
    • Data mining tools such as R, SAS, SPSS or Knime to study buying intention from parameters such as interests or navigation history.
    • Mathematical analysis tools such as MATLAB or Mathematica to work with variables such as gender, age, marital status, employment status and interests.
    • Business intelligence tools such as IBM Cognos, SAP Business Objects, Pentaho or Jaspersoft for geographical measurement and visualization and for campaign-versus-sales analysis.
    • Cloud analytics deployment tools such as AWS or Azure to enable efficient access from multiple branches.
    • Reporting tools such as iReport or Actuate Birt.
    • Relational databases such as Oracle, PostgreSQL or MySQL for simple data or CRM and ERP information.
    • Marketing and analytics tools.
  • Design the infrastructure: according to my interpretation of the case, a continuously combined model is the best infrastructure approach because the branches in different countries can modify their own copy of master data and then send it to the master copy, where it is merged with the central model. This solution is complex, but NoSQL databases are designed for these kinds of characteristics.
  • Generate and test the master data: in this step the selected tools are used and the source data is combined in order to confirm the master data lists. During the process, business rules frequently need to be adjusted as exceptions are discovered. Although tools have improved considerably, manual inspection may still be necessary to ensure the results are correct and meet project requirements.
  • Modify producer and consumer systems: that is, create the new structures where the data will be stored and prepare the systems for the new data life cycle.
  • Implement maintenance processes: this phase will ensure that the data coming from the different tools remains valid and that its structure is not altered unexpectedly.

1.2 What maturity model in master data management do you think an organization like the one in the case study should have?

A governed model, because data governance is supported at an executive level within the organization. There is a CDO whose main objective is to plan, supervise and manage the use of data.

In addition, the project seeks to monitor and capture data continuously from all the information generated by the organization, making use of the existing structure with the objective of increasing the company's value.

1.3 What impact, in terms of advantages and disadvantages, do you think the ambassador-detection project on social networks may have on our organization's MDM?

Among the advantages, the main one is economic. Through social networks we can obtain publicity and visibility for our campaigns in a simple way, at a lower cost than traditional media or many web channels, and with a much more focused target audience.

It also helps improve the company's reputation and spread its values, which is one of the needs described in the statement. This project can generate millions of views for our campaigns or communications, increasing the visibility of our products and helping social network users become more familiar with them.

However, the disadvantages include the computational cost of the project. To detect an ambassador, we first need to define what type of profile qualifies as an ambassador, including tastes, interests, number of followers and similar criteria, and then analyze the information provided by different social networks through the available APIs.

That has a very high computational cost, because finding that information, storing it and later analyzing it is not a simple task. It requires a significant investment in hardware capable of performing the analysis efficiently, databases to store all that information over time and, if we want quick access, premium API plans to obtain data at higher speed.

Even after ambassadors are selected, the work does not end there. We would still need to monitor, store, supervise and analyze all the communications they publish, and those communications may be videos, text or images. In other words, we need a solution capable of working with diverse data types, distributed across different countries and at a very high scale.