Is my log big enough for process mining? Some thoughts on generalization

I recently received an email from a colleague who is active in specification mining (software specifications from execution traces and code artifacts) with the following question.

Do you know of process mining works that deal with the confidence one may have in the mined specifications given the set of traces, i.e., how do we know we have seen “enough” traces? Can we quantify our confidence that the model we built from the traces we have seen is a good one?

The property the colleague asked about is calledĀ generalization in process mining. As my reply to this question summarized some insights that I gained recently, I though it was time to share this information further.

A system S can produce a language L. In reality we only see a subset of the behaviors that is recorded in a log K. Ideally, we want that a model M that was discovered from K can reproduce L. We say that M generalizes K well if M also accepts traces from L that are not in K (the more, the better).

This measure is in contradiction with 3 other measures (fitness, precision, and simplicity) as the most general model that accepts all traces is not very precise (M should not accept traces that are not in L). These measures are described informally in The Need for a Process Mining Evaluation Framework in Research and Practice (doi: http://dx.doi.org/10.1007/978-3-540-78238-4_10) and the paper On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery (doi: http://dx.doi.org/10.1007/978-3-642-33606-5_19) shows how they influence each other in practice.

There is currently no generic mathematical definition to compute for a given log K that it was general enough (contains enough information to infer entire L from K). This usually depends on the algorithm, the kind of original system S/the language L, and the kind of model one would like to discover.

The most general result that I am aware of is that the log K has to beĀ directly follows-complete. Action A of the system S directly follows action B if there is some execution trace …AB… of S where first A occurs and then directly B occurs. The log K is directly follows-complete iff for any two actions A, B that directly follow each other, there is a trace …AB… in L. Every system S with a finite number of actions has a finite log K that is directly follows-complete, even if the language L of S is infinite. For such logs, there are algorithms that ensure that S (or a system that is trace-equivalent to S) can be rediscovered. See for instance Discovering Block-Structured Process Models from Event Logs – A Constructive Approach (doi: http://dx.doi.org/10.1007/978-3-642-38697-8_17).

In general, if you have the original system S (or some finite characterization of L), then it will be possible to compute whether the log K is directly follows-complete. If you do not have S or L, then we currently do not know any means to estimate how complete K is. This is an open question that we are currently looking into. Yet, in essence, you have to estimate the probabilities that the information in log K suffices to explain particular languages constructs that the original system has/may have.

We are currently looking deeper into these questions. If you have more pointers on this topic, feel free to drop a comment or an email.

Advertisements