In the absence of data, Bayesian theorem is used to design knowledge driven model

Without data but with expert knowledge, knowledge can also be transformed into computer-aided model.

Data is the basis of the model, but without data, only domain experts can well describe or even predict the "situation" of a given environment. I will summarize the concept of knowledge driven model based on Bayesian probability, and then a practical tutorial to demonstrate the steps of transforming expert knowledge into Bayesian model for reasoning. I will use the Sprinkler system to conceptually explain the steps in the process: from knowledge to model. Finally, I will discuss the challenges of complex knowledge driven models and possible system errors due to questioning and extracting knowledge. All the examples were created using python's bnlean library.

Can we apply expert knowledge to the model?

When we talk about knowledge, it is not just descriptive knowledge and facts. Knowledge is also the familiarity, understanding or understanding of someone or something, procedural knowledge (skills), or acquaintance knowledge [1].

No matter what knowledge you have or want to use, if you want to use this knowledge to establish a computer-aided knowledge model, it needs to be presented in a computer interpretable way. This means designing a system built on a series of process stages. Or in other words, a pipeline from the output of the process to the input of the next process, and multiple simple pipelines can be combined into a complex system. We can use a graph with nodes and edges to represent such a system. Each node corresponds to a variable, and each edge represents the conditional dependency between variable pairs. In this way, we can define a model according to the knowledge of experts, and the best way is to use Bayesian model.

To answer our question, 'can we apply expert knowledge to the model‘ This depends on the accuracy of representing knowledge as a graph and the accuracy with which you glue knowledge together with the theorem of probability theory (that is, Bayesian graph model). In addition, there may be some limitations.

Baystu model is an ideal choice for creating knowledge driven model

The use of machine learning technology has become a standard toolkit for obtaining useful conclusions and making predictions in many fields. However, many models are data-driven, and it is impossible and difficult to combine the knowledge of experts in the data-driven model. However, a branch of machine learning is Bayesian graph model (also known as Bayesian network, Bayesian belief network, causal probability network and influence graph), which can be used to integrate expert knowledge into the model and reasoning. Please refer to the following points that have the advantages of Bayesian graphical model, which I will emphasize in this article.

  • It is possible to integrate domain / expert knowledge into the diagram.
  • It has a modular concept.
  • A complex system is built by combining simpler parts.
  • Graph theory provides an intuitive set of highly interactive variables.
  • Probability theory provides the glue that binds the parts.

To make a Bayesian graph model, you need two components: 1. Directed acyclic graph (DAG) and 2. Conditional probability table (CPT). Only by combining can we form the representation of expert knowledge.

Baystu is a directed acyclic graph (DAG)

As mentioned above, knowledge can be represented as a systematic process and can be regarded as a graph. In the case of Bayesian model, the graph is represented as DAG. But what is DAG? Firstly, it represents a directed acyclic graph, which is a network (or graph) with nodes (variables) and directed edges. Figure 1 depicts three unique patterns that can be formed by three variables (X, Y, Z). Nodes correspond to variables X, Y, Z, and directed edges (arrows) represent dependencies or conditional distributions. The network is acyclic, which means that (reverse) loops are not allowed.

With DAG, you can create complex systems by combining (simpler) parts.

All DAGs (large or small) are built according to the following 3 rules:

  1. Edge is causality.
  2. Edges are directional.
  3. Reverse circulation is not allowed.

These rules are important because if the directivity (or arrow) is removed, the three DAGs will become the same. In other words, through directivity, we can make DAG recognizable [2]. There are many articles and Wikipedia pages describing the statistics and causality behind DAG. Each Bayesian network can be designed by these three unique patterns and should represent the process you want to model. Designing DAG is the first step in creating a knowledge driven model. The second part is to define the conditional probability table, which uses (conditional) probability to describe the relationship strength of each node.

A conditional probability table is defined to describe the strength of the node relationship.

Probability theory (also known as Bayesian theorem or Bayesian rule) is the basis of Bayesian network. Although this theorem also applies here, there are some differences. First, in the knowledge driven model, CPT does not learn from data (because there is no data). On the contrary, the probability needs to be obtained through expert questions and then stored in the so-called conditional probability table (CPT) (also known as conditional probability distribution, CPD). In this article, I'll alternate between CPT and CPD.

CPT describes the relationship strength of each node with conditional probability or a priori.

CPT is then used with Bayesian rules to update the model information that allows inference. In the next section, I will use cases to demonstrate how to accurately populate CPT with expert knowledge. But first, let's introduce the challenges in transforming expert knowledge into probability.

Transforming expert knowledge into probability

When we want to create a knowledge driven model, it is very important to extract the correct information from experts. Domain experts will inform the probability of successful process and the risk of side effects. Through this information, we can minimize the risk. But when talking to experts, many estimated probabilities are expressed in words, such as "very likely" rather than exact percentages.

One of our tasks is to ensure that the oral probability phrase is the same in probability or percentage for the sender and the receiver.

In some areas, there are guidelines that define the scope of some common terms, such as the risk of "common" is 1-10%. However, without background knowledge in this field, the word "common" can easily be interpreted as a different number [4]. In addition, the interpretation of probability phrases will also be affected by context [4]. Be careful of the misunderstanding of context, because it can also lead to systematic errors, resulting in the wrong model. The probability phrase overview is shown in Figure 2.

"Impossible" doesn't always seem impossible!

Bnlean Library

For this article, we use the bnlearn library. The bnlean library is designed to solve the following problems:

  1. Structural learning: given data: estimate a DAG that captures the dependencies between variables.
  2. Parameter learning: given data and DAG: estimate the (conditional) probability distribution of a single variable.
  3. Reasoning: given the learning model: determine the exact probability value of the query.

What are the advantages of bnlearn over other Bayesian analysis implementations?

  1. Based on pgmpy Library
  2. Contains common pipeline operations
  3. Simple and intuitive
  4. Open Source

Build the system according to the knowledge of experts

Let's start with a simple and intuitive example to demonstrate the process of building a real-world model based on expert knowledge. In this use case, I will play the role of expert in sprinkler system field.

Suppose I have a sprinkler system in my backyard. In the past 1000 days, I have witnessed its working mode and time. I didn't collect any data, but I had a theoretical idea about my work. We call it expert opinion or domain knowledge. Please note that sprinkler system is a well-known example in Bayesian network.

From my expert's point of view, I know some facts about the system; It's on and off sometimes (that's for sure). If the sprinkler system is turned on, the grass - May - be wet. But rain - almost certainly - also causes the grass to get wet, and then the sprinkler system - most of the time - turns off. I know that clouds usually appear before it begins to rain. Finally, I noticed a weak interaction between sprinkler system and cloudy, but I'm not completely sure.

From this point on, you need to transform expert knowledge into models. This can be done systematically by first creating the graph and then defining the CPT connecting the nodes in the graph.

The sprinkler system consists of four nodes, and each node has two states.

There are four nodes in the sprinkler system, which can be extracted from the perspective of experts. Each node has two states: Rain: Yes or no, cloudy: Yes or no, sprinkler system: on or off, wet grass: true or false.

Define a simple one-to-one relationship.

A complex system is built by combining simpler parts. This means that you don't need to create or design the entire system immediately, but define simpler parts first. The simpler part is the one-to-one relationship. In this step, we will translate the views of experts into relationships. We know from experts that rain depends on cloudy state, wet grass depends on rain state, and wet grass also depends on sprinkler state. Finally, we know that watering depends on cloudy days. We can establish the following four directed one-to-one relationships.

cloudy → rain
 rain → Wet grass
 Watering → Wet grass
 cloudy → Watering

It is important to realize that there are differences in the strength of the relationship between one-to-one parts, which needs to be defined by CPT. But before entering CPT, let's first make DAG with bnlearn.

DAG based on one-to-one relationship

These four directed relationships can now be used to construct graphs with nodes and edges. Each node corresponds to a variable, and each edge represents the conditional dependency between variable pairs. In bnlearn, we can assign values to the relationships between variables and show them graphically.

import bnlearn as bn

# Define the causal dependencies based on your expert/domain knowledge.
# Left is the source, and right is the target node.
edges = [('Cloudy', 'Sprinkler'),
         ('Cloudy', 'Rain'),
         ('Sprinkler', 'Wet_Grass'),
         ('Rain', 'Wet_Grass')]


# Create the DAG
DAG = bn.make_DAG(edges)

# Plot the DAG (static)
bn.plot(DAG)

# Plot the DAG (interactive)
bn.plot(DAG, interactive=True)

# DAG is stored in an adjacency matrix
DAG["adjmat"]

The following figure shows the final DAG. We call it causal DAG because we assume that the edges we encode represent our causal assumptions about the sprinkler system.

At this time, DAG does not know the underlying dependencies. We can check cpt with bn.print(DAG), and the result is "no CPD can be print". We need to add knowledge to DAG with the so-called conditional probability table (cpt), and we will rely on expert knowledge to fill cpt.

Knowledge can be added to DAG through conditional probability table (cpt).

Establish conditional probability table

The sprinkler system is a simple Bayesian network, in which wet grass (child node) is affected by parent nodes (Rain and springler) (see Figure 1). cloudy. Cloudy nodes are not affected by any other nodes.

We need to associate each node with a probability function, which takes a set of specific values of the parent variable of the node as the input and gives (as the output) the probability of the variable represented by the node. Let's calculate these four nodes.

CPT:Cloudy

Cloudy nodes have two states (yes or no) and have no dependencies. When a single random variable is used, the calculation of probability is relatively simple. From my expert point of view, I have witnessed 70% of the cloudy weather in the past 1000 days. Because the probability should add up to 1, the probability of non cloudy should be 30%. CPT is as follows:

# Import the library
from pgmpy.factors.discrete import TabularCPD

# Cloudy
cpt_cloudy = TabularCPD(variable='Cloudy', variable_card=2, values=[[0.3], [0.7]])
print(cpt_cloudy)

+-----------+-----+
| Cloudy(0) | 0.3 |
+-----------+-----+
| Cloudy(1) | 0.7 |
+-----------+-----+

CPT:Rain

The Rain node has two states, and Cloudy has two states under the condition of Cloudy. In general, we need to specify four conditional probabilities, that is, the probability of one event when another event occurs. In our example, the probability of Rain in Cloudy conditions. Therefore, the evidence is Cloudy and the variable is Rain. From my expert's point of view, when it rains, it is Cloudy 80% of the time. I also see Rain 20% of the time. There are no visible clouds.

cpt_rain = TabularCPD(variable='Rain', variable_card=2,
                      values=[[0.8, 0.2],
                              [0.2, 0.8]],
                      evidence=['Cloudy'], evidence_card=[2])
print(cpt_rain)

+---------+-----------+-----------+
| Cloudy  | Cloudy(0) | Cloudy(1) |
+---------+-----------+-----------+
| Rain(0) | 0.8       | 0.2       |
+---------+-----------+-----------+
| Rain(1) | 0.2       | 0.8       |
+---------+-----------+-----------+

CPT:Sprinkler

The Sprinkler node has two states and is restricted by the two states of Cloudy. In general, we need to specify four conditional probabilities. Here we need to define the probability of Sprinkler in the case of Cloudy. Therefore, the evidence is Cloudy and the variable is rain. I can see that when the Sprinkler is off, it is Cloudy 90% of the time. Therefore, the corresponding value of Sprinkler is true and Cloudy is true is 10%. I'm not sure about other probabilities, so I set it to 50% of the time.

cpt_sprinkler = TabularCPD(variable='Sprinkler', variable_card=2,
                           values=[[0.5, 0.9],
                                   [0.5, 0.1]],
                           evidence=['Cloudy'], evidence_card=[2])
print(cpt_sprinkler)

+--------------+-----------+-----------+
| Cloudy       | Cloudy(0) | Cloudy(1) |
+--------------+-----------+-----------+
| Sprinkler(0) | 0.5       | 0.9       |
+--------------+-----------+-----------+
| Sprinkler(1) | 0.5       | 0.1       |
+--------------+-----------+-----------+

CPT: wet grass

The wet grass node has two states, which are restricted by two parent nodes; Rain and sprinklers. Here we need to define the probability of wet grass for a given rain and sprinkler. In general, we must specify 8 conditional probabilities (2 states ^ 3 nodes).

As an expert, I'm sure that 99% of people see wet grass after rain or watering: P(wet grass=1 | rain=1, sprinkler =1) = 0.99. Therefore, the corresponding P(wet grass=0|rain=1, springler = 1) = 1 - 0.99 = 0.01

As an expert, I'm absolutely sure that the grass won't get wet when it doesn't rain or turn on the sprinkler: P(wet grass=0 | rain=0, sprinkler =0)= 1. The corresponding is: P(wet grass=1 |rain=0, springkler = 0) = 1 - 1 = 0

As an expert, I know that wet grass almost always happens when it rains when the sprinkler is off (90%). P(wet grass=1 | rain=1,sprinkler =0)= 0.9. The corresponding is: P(wet grass=0 | rain=1, springler = 0) = 1 - 0.9 = 0.1.

As an expert, I know that when the grass is wet and it doesn't rain, the sprinkler is always on (90%). P(wet grass=1 | rain=0,sprinkler =1)= 0.9. The corresponding is: P(wet grass=0 | rain=0, springler = 1) = 1 - 0.9 = 0.1.

cpt_wet_grass = TabularCPD(variable='Wet_Grass', variable_card=2,
                           values=[[1, 0.1, 0.1, 0.01],
                                   [0, 0.9, 0.9, 0.99]],
                           evidence=['Sprinkler', 'Rain'],
                           evidence_card=[2, 2])
print(cpt_wet_grass)

+--------------+--------------+--------------+--------------+--------------+
| Sprinkler    | Sprinkler(0) | Sprinkler(0) | Sprinkler(1) | Sprinkler(1) |
+--------------+--------------+--------------+--------------+--------------+
| Rain         | Rain(0)      | Rain(1)      | Rain(0)      | Rain(1)      |
+--------------+--------------+--------------+--------------+--------------+
| Wet_Grass(0) | 1.0          | 0.1          | 0.1          | 0.01         |
+--------------+--------------+--------------+--------------+--------------+
| Wet_Grass(1) | 0.0          | 0.9          | 0.9          | 0.99         |
+--------------+--------------+--------------+--------------+--------------+

We define the strength of the relationship between DAG and cpt. Now we need to connect DAG and cpt.

Update DAG with CPT:

All CPTs have been created and we can now connect them to the DAG. As an integrity check, you can use print_DAG function check CPT.

# Update DAG with the CPTs
model = bn.make_DAG(DAG, CPD=[cpt_cloudy, cpt_sprinkler, cpt_rain, cpt_wet_grass])

# Print the CPTs
bn.print_CPD(model)

"""
[bnlearn] >No changes made to existing Bayesian DAG.
[bnlearn] >Add CPD: Cloudy
[bnlearn] >Add CPD: Sprinkler
[bnlearn] >Add CPD: Rain
[bnlearn] >Add CPD: Wet_Grass
[bnlearn] >Checking CPDs..
[bnlearn] >Check for DAG structure. Correct: True
CPD of Cloudy:
+-----------+-----+
| Cloudy(0) | 0.3 |
+-----------+-----+
| Cloudy(1) | 0.7 |
+-----------+-----+
CPD of Sprinkler:
+--------------+-----------+-----------+
| Cloudy       | Cloudy(0) | Cloudy(1) |
+--------------+-----------+-----------+
| Sprinkler(0) | 0.5       | 0.9       |
+--------------+-----------+-----------+
| Sprinkler(1) | 0.5       | 0.1       |
+--------------+-----------+-----------+
CPD of Rain:
+---------+-----------+-----------+
| Cloudy  | Cloudy(0) | Cloudy(1) |
+---------+-----------+-----------+
| Rain(0) | 0.8       | 0.2       |
+---------+-----------+-----------+
| Rain(1) | 0.2       | 0.8       |
+---------+-----------+-----------+
CPD of Wet_Grass:
+--------------+--------------+--------------+--------------+--------------+
| Sprinkler    | Sprinkler(0) | Sprinkler(0) | Sprinkler(1) | Sprinkler(1) |
+--------------+--------------+--------------+--------------+--------------+
| Rain         | Rain(0)      | Rain(1)      | Rain(0)      | Rain(1)      |
+--------------+--------------+--------------+--------------+--------------+
| Wet_Grass(0) | 1.0          | 0.1          | 0.1          | 0.01         |
+--------------+--------------+--------------+--------------+--------------+
| Wet_Grass(1) | 0.0          | 0.9          | 0.9          | 0.99         |
+--------------+--------------+--------------+--------------+--------------+
[bnlearn] >Independencies:
(Wet_Grass ⟂ Cloudy | Rain, Sprinkler)
(Rain ⟂ Sprinkler | Cloudy)
(Cloudy ⟂ Wet_Grass | Rain, Sprinkler)
(Sprinkler ⟂ Rain | Cloudy)
[bnlearn] >Nodes: ['Cloudy', 'Sprinkler', 'Rain', 'Wet_Grass']
[bnlearn] >Edges: [('Cloudy', 'Sprinkler'), ('Cloudy', 'Rain'), ('Sprinkler', 'Wet_Grass'), ('Rain', 'Wet_Grass')]
"""

DAG with cpt is shown in the figure below.

Reasoning using causal models

We have created a model to describe the data structure and a cpt to quantitatively describe the statistical relationship between each node and its parent node. Let's ask some questions to our model and make inferences!

How likely is the grass to be wet with the sprinkler off?

P(Wet_grass=1 |Sprinkler=0)= 0.6162

If the sprinkler stops and the weather is cloudy, how likely is it to rain?

P(Rain=1 |Sprinkler=0,Cloudy=1)= 0.8

import bnlearn as bn

# Make inference on wet grass given sprinkler is off
q1 = bn.inference.fit(model, variables=['Wet_Grass'], evidence={'Sprinkler':0})
print(q1.df)
"""
+--------------+------------------+
| Wet_Grass    |   phi(Wet_Grass) |
+==============+==================+
| Wet_Grass(0) |           0.3838 |
+--------------+------------------+
| Wet_Grass(1) |           0.6162 |
+--------------+------------------+
"""

# Make inference on Rain, given sprinkler is off and cloudy is true
q2 = bn.inference.fit(model, variables=['Rain'], evidence={'Sprinkler':0, 'Cloudy':1})
print(q2.df)
"""
+---------+-------------+
| Rain    |   phi(Rain) |
+=========+=============+
| Rain(0) |      0.2000 |
+---------+-------------+
| Rain(1) |      0.8000 |
+---------+-------------+
"""

# Inferences with two or more variables can also be made such as:
q3 = bn.inference.fit(model, variables=['Wet_Grass','Rain'], evidence={'Sprinkler':1})
print(q3.df)
"""
+---------+--------------+-----------------------+
| Rain    | Wet_Grass    |   phi(Rain,Wet_Grass) |
+=========+==============+=======================+
| Rain(0) | Wet_Grass(0) |                0.0609 |
+---------+--------------+-----------------------+
| Rain(0) | Wet_Grass(1) |                0.5482 |
+---------+--------------+-----------------------+
| Rain(1) | Wet_Grass(0) |                0.0039 |
+---------+--------------+-----------------------+
| Rain(1) | Wet_Grass(1) |                0.3870 |
+---------+--------------+-----------------------+
"""

summary

One advantage of Bayesian network is that it is easier for humans to understand the direct dependence and local distribution than the complete joint distribution. In order to create a knowledge driven model, we need two elements; DAG and conditional probability table (cpt). Both are derived from expert questions. Dag describes the structure of data, and cpt is used to quantitatively describe the statistical relationship between each node and its parent node. Although this approach seems reasonable, by asking experts about possible systematic errors and limitations in building complex models.

How do I know my causal model is correct?

In the example of sprinkler, we extract the knowledge of domain experts through personal experience. Although we have created a causality diagram, it is difficult to fully verify the effectiveness and integrity of the causality diagram. For example, you may have different views on probability and charts and be right. For example, it may be reasonable to argue that I do see rain and no visible clouds 20% of the time. On the contrary, there may be multiple real knowledge models at the same time. In this case, you may need to combine these probabilities or decide who is right.

The knowledge used is as rich as the expert's experience and as biased as the expert

In other words, the probability we get by asking experts is subjective probability [5]. In the example of sprinkler, we can accept that the concept of probability is personal, which reflects a person's belief at a specific time and place. If experts lived in Africa instead of Britain, would the model change?

If you want to use such a process to design a knowledge driven model, it is important to understand how people (experts) get probability estimates. In the literature, people seldom follow the probability principle when reasoning about uncertain events, but use limited heuristics [6,7], such as representativeness and availability, to replace the probability law. This may lead to systematic errors and, to some extent, wrong models. In addition, to ensure accurate probability or percentage, it is necessary to uniformly describe the caliber for the sender and receiver.

Complexity is the main limitation.

The sprinkler system proposed in this paper has only a few nodes, but Bayesian network can contain more nodes and have multi-level parent-child dependencies. The number of probability distributions required to fill the conditional probability table (CPT) in the Bayesian network increases exponentially with the number of parent nodes associated with the table. If the table is filled with knowledge obtained from domain experts, the scale of the task will form a considerable cognitive impairment [8].

For domain experts, too much parent-child dependence will form considerable cognitive impairment.

For example, if M parent nodes represent Boolean variables, the probability function is represented by a table of 2^m items, with one item for each possible parent node combination. Creating a large graph (more than 10-15 nodes) will be very troublesome, because the number of parent-child dependencies may pose a considerable cognitive barrier to domain experts. If you have the data of the system you want to model, you can also use structure learning [3] to learn the structure (DAG) and / or its parameters (cpt).

Can we apply expert knowledge to the model?

I will repeat my previous statement: "it depends on the accuracy with which you express your knowledge in charts and the accuracy with which you glue them together in probability theory."

Final summary

Creating a knowledge driven model is not easy. It is not only about data modeling, but also about human psychology. Be prepared for expert discussion. Many short communications are better than one long one. Ask questions systematically: first design a graph with nodes and edges, and then enter cpt. Be careful when discussing possibilities. Understand how experts derive their probabilities and standardize them when needed. Check whether the time and place will lead to different results. Integrity checks are performed after the model is built.

quote

  1. Wikipedia, Knowledge

  2. Pearl, Judea (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. ISBN 978 – 0 – 521 – 77362 – 1. OCLC 42291253.

  3. E.Taskesen, A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python, Medium, 2021

  4. Sanne Willems, et al, Variability in the interpretation of probability phrases used in Dutch news articles — a risk for miscommunication, JCOM, 24 March 2020

  5. R. Jeffrey, Subjective Probability: The Real Thing, Cambridge University Press, Cambridge, UK, 2004.

  6. A. Tversky and D. Kahneman, Judgment under Uncertainty: Heuristics and Biases, Science, 1974

  7. Tversky, and D. Kahneman, 'Judgment under uncertainty: Heuristics and biases,' in Judgment under Uncertainty: Heuristics and Biases, D. Kahneman, P. Slovic, and A. Tversky, eds., Cambridge University Press, Cambridge, 1982, pp 3–2

  8. Balaram Das, Generating Conditional Probabilities for Bayesian Networks: Easing the Knowledge Acquisition Problem. Arxiv

By Erdogan Taskesen

Tags: Machine Learning AI

Posted on Tue, 21 Sep 2021 02:25:12 -0400 by meritre