The Power of Bayesian Causal Inference: A Comparative Evaluation of Libraries to Reveal Hidden Causality in Your Dataset.

Artificial Intelligence

The Power of Bayesian Causal Inference: A Comparative Evaluation of Libraries to Reveal Hidden Causality in Your Dataset.

admin

May 22, 2023

The Power of Bayesian Causal Inference: A Comparative Evaluation of Libraries to Reveal Hidden Causality in Your Dataset.

Library 1: Bnlearn for Python.

Bnlearn is a Python package that’s suited to creating and analyzing Bayesian Networks, for discrete, mixed, and continuous data sets [2, 3]. It’s designed to be ease-of-use and comprises the most-wanted Bayesian pipelines for causal learning when it comes to structure learning, parameter learning, and making inferences. There’s a variety of statistical tests that may be utilized by simply specifying the parameters during initialization. Bnlearn also comprises various helper functions to rework data sets, derive the topological ordering of the (entire) graph, comparison of two graphs, and to make various insightful plots, amongst others. More details about structure learning with bnlearn may be found here:

Certainly one of the nice functionalities of Bnlearn is that it could actually learn the causal structure based on only the info set. There are six algorithms implemented for this task; hillclimbsearch, exhaustivesearch, constraintsearch, chow-liu, naivebayes and TAN and may be combined with scoring types BIC, K2, BDEU. Some methods require setting a root node, resembling Tree-augmented Naive Bayes (TAN), that are really useful in case you realize the final result (or goal) value. This will even dramatically lower the computational burden and is really useful in case you could have many features. As well as, with the independence test, spurious edges may be easily pruned from the model. In the next example, I’ll use the hillclimbsearch method with scoring type BIC for structure learning. In this instance, we’ll not define a goal value but let the Bnlearn determine your entire causal structure purely on the info itself.

# Load library
import bnlearn as bn# Structure learning
model = bn.structure_learning.fit(df, methodtype='hillclimbsearch', scoretype='bic')
# Test edges significance and take away.
model = bn.independence_test(model, df, test="chi_square", alpha=0.05, prune=True)
# Make plot
G = bn.plot(model, interactive=False)
# Make plot interactive
G = bn.plot(model, interactive=True)
# Show edges
print(model['model_edges'])
# [('education', 'salary'),
# ('marital-status', 'relationship'),
# ('occupation', 'workclass'),
# ('occupation', 'education'),
# ('relationship', 'salary'),
# ('relationship', 'occupation')]

To find out the Directed Acyclic Graph (DAG), we’d like to specify the input data frame as shown within the code section above. After fitting a model, the outcomes are stored within the model dictionary, which may be used for further evaluation. An interactive plot of the causal structure is shown in Figure 1.

Figure 1. Interactive plot for structure learning using Bnlearn for the *Census Income data* *set. In case the CPDs are learned, the tooltip will describe the estimated CPDs (image by writer).*

With the learned DAG (Figure 1), we are able to estimate the conditional probability distributions (CPDs, see code section below), and make inferences using do-calculus. Or in other words, we are able to start asking inquiries to our data.

# Learn the CPDs using the estimated edges.
# Note that we also can customize the sides or manually provide a DAG.
# model = bn.make_DAG(model['model_edges'])# Learn the CPD by providing the model and dataframe
model = bn.parameter_learning.fit(model, df)
# Print the CPD
CPD = bn.print_CPD(model)

Query 1: What the probability is of getting a salary > 50k given the education is Doctorate: P(salary | education=Doctorate)

Intuitively, we may expect a high probability since the education is “doctorate”. Let’s discover the posterior probability from our Bayesian model. Within the code section below we derived a probability of P=0.7093. This confirms that when education is doctorate, there’s the next probability of getting a salary of >50K in comparison with not having a doctorate education.

# Start making inferences
query = bn.inference.fit(model, variables=['salary'], evidence={'education':'Doctorate'})
print(query)
+---------------+---------------+
| salary        |   phi(salary) |
+===============+===============+
| salary(<=50K) |        0.2907 |
+---------------+---------------+
| salary(>50K)  |        0.7093 |
+---------------+---------------+

Let’s now ask the query of whether lower education also ends in a lower probability of getting a salary of >50K. We will easily change education in HS-grad and again ask the query.

Query 2: What the probability is of getting a salary > 50k given the education is HS-grad: P(salary | education=HS-grad)

This ends in a posterior probability of P=0.1615 . Studying is thus very helpful for the next salary in line with this data set. Nevertheless, bear in mind that we didn’t use every other constraints as evidence that may influence the final result.

query = bn.inference.fit(model, variables=['salary'], evidence={'education':'HS-grad'})
print(query)
+---------------+---------------+
| salary        |   phi(salary) |
+===============+===============+
| salary(<=50K) |        0.8385 |
+---------------+---------------+
| salary(>50K)  |        0.1615 |
+---------------+---------------+

Until this part, we used a single variable but all variables within the DAG may be used for evidence. Let’s make one other more complex query.

Query 3: What’s the probability of being in a certain workclass on condition that education is Doctorate and the marital status is never-married. P(workclass| education=Doctorate, marital-status=never-married).

Within the code section below may be seen that this returns the probability for every workclass, with workclass being private having the very best probability: P=0.5639.

# Start making inferences
query = bn.inference.fit(model, variables=['workclass'], evidence={'education':'Doctorate', 'marital-status':'Never-married'})
print(query)
+-----------------------------+------------------+
| workclass                   |   phi(workclass) |
+=============================+==================+
| workclass(?)                |           0.0420 |
+-----------------------------+------------------+
| workclass(Federal-gov)      |           0.0420 |
+-----------------------------+------------------+
| workclass(Local-gov)        |           0.1326 |
+-----------------------------+------------------+
| workclass(Never-worked)     |           0.0034 |
+-----------------------------+------------------+
| workclass(Private)          |           0.5639 |
+-----------------------------+------------------+
| workclass(Self-emp-inc)     |           0.0448 |
+-----------------------------+------------------+
| workclass(Self-emp-not-inc) |           0.0868 |
+-----------------------------+------------------+
| workclass(State-gov)        |           0.0810 |
+-----------------------------+------------------+
| workclass(Without-pay)      |           0.0034 |
+-----------------------------+------------------+

Summary

Input data: The input data may be discrete, continuous, or mixed data sets.
Benefits: Accommodates probably the most wanted Bayesian pipelines for structure learning, parameter learning, and making inferences using do-calculus. Plots may be easily created and may be CPDs explored. Great for starters and experts that are not looking for to construct the pipeline themselves.