I recently received an email from a colleague who is active in specification mining (software specifications from execution traces and code artifacts) with the following question.

Do you know of process mining works that deal with the confidence one may have in the mined specifications given the set of traces, i.e., how do we know we have seen “enough” traces? Can we quantify our confidence that the model we built from the traces we have seen is a good one?

The property the colleague asked about is called *generalization* in process mining. As my reply to this question summarized some insights that I gained recently, I though it was time to share this information further.

A system S can produce a language L. In reality we only see a subset of the behaviors that is recorded in a log K. Ideally, we want that a model M that was discovered from K can reproduce L. We say that M *generalizes* K *well* if M also accepts traces from L that are not in K (the more, the better).

This measure is in contradiction with 3 other measures (fitness, precision, and simplicity) as the most general model that accepts all traces is not very precise (M should not accept traces that are not in L). These measures are described informally in The Need for a Process Mining Evaluation Framework in Research and Practice (doi: http://dx.doi.org/10.1007/978-3-540-78238-4_10) and the paper On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery (doi: http://dx.doi.org/10.1007/978-3-642-33606-5_19) shows how they influence each other in practice.

There is currently no generic mathematical definition to compute for a given log K that it was general enough (contains enough information to infer entire L from K). This usually depends on the algorithm, the kind of original system S/the language L, and the kind of model one would like to discover.

The most general result that I am aware of is that the log K has to be *directly follows-complete*. Action A of the system S *directly follows* action B if there is some execution trace …AB… of S where first A occurs and then directly B occurs. The log K is *directly follows-complete* iff for any two actions A, B that directly follow each other, there is a trace …AB… in L. Every system S with a finite number of actions has a finite log K that is directly follows-complete, even if the language L of S is infinite. For such logs, there are algorithms that ensure that S (or a system that is trace-equivalent to S) can be rediscovered. See for instance Discovering Block-Structured Process Models from Event Logs – A Constructive Approach (doi: http://dx.doi.org/10.1007/978-3-642-38697-8_17).

In general, if you have the original system S (or some finite characterization of L), then it will be possible to compute whether the log K is directly follows-complete. If you do not have S or L, then we currently do not know any means to estimate how complete K is. This is an open question that we are currently looking into. Yet, in essence, you have to estimate the probabilities that the information in log K suffices to explain particular languages constructs that the original system has/may have.

We are currently looking deeper into these questions. If you have more pointers on this topic, feel free to drop a comment or an email.

Pingback: A reply on “Some thoughts on Generalization” | Blog of Joos Buijs

I came across this post yesterday; today I notice that I am way out of date…

My PhD (http://etheses.bham.ac.uk/4911/) looked at similar questions. How big an event log do we need to use for mining with a particular algorithm, to be confident (e.g. 95%) of mining a model with a specified accuracy (e.g. 95%, by some metric)?

We suggested that system S is producing a probabilistic language L. The behaviours recorded in the log K are then a stochastic sample from the distribution(s) governing S. This allows the question to be framed in the context of machine learning theory.

We investigated using this conceptual framework to analyse and compare process mining algorithms. For this analysis, we assumed knowledge of the `true’ or ground truth’ model.

However I believe the same ideas could be applied to mining an unknown model (i.e. the usual process mining case). For example based on knowledge of the probabilistic behaviour of the algorithm, and convergence of the mined model as the log grows. (One could never be 100% confident.) Unfortunately the PhD is of fixed duration and I am not at present funded to work in this area…

At the time, the process mining community seemed not ready for these ideas. It is great to see that they are being considered again (or were in 2014!).