In this post, I will try to highlight key differences between process mining and data mining, and explain why we should be paying greater attention than we currently are to the former.
Process mining vs. data mining: what’s in a name?
Many today are at least superficially familiar with data mining, in the context of big data, business intelligence, and analytics. Data mining refers to the exploitation of vast amounts of data via search algorithms that are able to detect data patterns and correlations, which may be useful for developing needed insights. Exploratory data analysis (EDA) is a part of the initial familiarization with data and is sometimes contrasted with the more ‘classical’ approach involving formal model building and parameter estimation.
The larger the datasets, and the greater the velocity at which data are collected, the more difficult it is for humans to gain insights unaided, hence the need for robust computing platforms, much memory, and other strategies. The latter include powerful ‘divide and conquer’ schemes — such as Map/Reduce — that allow for work to be allocated to distributed computing nodes with the results later combined to yield a useful answer. One downside to all this is that the level of complexity and the nuances of the algorithms, the volume of data, and the degree of automation involved in its processing make it almost a given that results coming out of this ‘black box’ need to be accepted at face value by the majority of users.
Data mining focuses on data. And these data are typically in the form of records, although they can also be and indeed are increasingly unstructured (ex. ‘tweets’, ‘likes’, movie reviews, customer rants on a vendor’s site, a physician’s notes on a patient.) For simplicity, think of a typical data record as consisting of fields in adjacent columns, and a data table as made up of rows of records. A database then consists of a group of inter-dependent tables.
In healthcare, a row/record may contain information about a patient. The first column/field may serve to identify the patient, while subsequent columns/fields may contain health information and/or demographic or billing data. If a field is associated with a particular visit to a physician, the other fields may contain clinical info as to patient age, sex, smoking status, symptoms manifested, auscultation, results of lab work, and diagnosis, say. These records are typically known as observations or examples and the specific data fields are referred to as attributes.
Process mining basics
While clearly all these data can be ‘mined’ with a view to, perhaps, finding out if patients of a certain age who also smoke tend to have commonalities in the results of their blood work, the data may prove far less useful from an operations management perspective, and specifically for optimizing the processes that occurred during patients’ visits to the clinic, the execution of their lab orders, and the medical team’s arrival at a diagnosis.
Indeed, for process mining to be possible, certain data elements need to be present at the time of data collection, for example and most importantly, accurate time stamps. If a time stamp is recorded when an activity on behalf of the patient is started and one when it is concluded, duration can be derived, both for how long activities last as well as for the length of idle/wait times between activities. In between an activity starting and concluding in ‘normal’ fashion, other things, such as suspension, resumption, or cancellation of said activity, can also occur and often do in the form of interruptions due to priorities being shuffled in real life. Recording these events is crucial to deriving better models later, from which one can proceed to study bottlenecks, say, and enhance human performance or streamline and parallelize sequences of activities. Without these raw data, however, modeling and analysis are essentially not doable. If this happens, an important opportunity for an organization to know itself better is missed, about which little can be done later.
Time-oriented data are rarely found in one place, and one must be prepared to ferret it out from a variety of legacy systems, workflow applications, and data repositories and warehouses. It is important to focus on raw data that are as close to the originating source as possible rather than on aggregate data that have already been massaged and filtered for other purposes.
The process perspective is essentially time and resource-oriented. Many improvements can be attempted if one has raw data as to when events occurred and by whom they were performed. In process mining terms, what is normally known as an observation (a data record) with attributes (fields) becomes an event endowed with attributes such as time-stamps, resources and resource affiliations (functions and departments), and costs, among others. Several events (an event log) then go to form a trace or case, and a number of such cases become representative of a broader, ‘umbrella’ process.
Process mining is done for a variety of purposes, including discovery, i.e. the derivation of a process model from event logs, but also conformance checking, to see whether a model and an event log agree, or to compare two models, one of which could be normative (illustrating how things should be done.) Once a model exists, and with currently available tools, it can be animated, past and current events can be visualized, and the likelihood of future ones can be studied (prediction.) Process mining can also lead to model enhancement, when examining event logs that may be more complex than the limited set from which the original model was derived.
An important distinction between animation and simulation is that the former is done by replaying traces of actual events on a model, whereas the latter is based on studying collections of events pseudo-randomly generated from probability distributions. While these distributions are typically selected to reflect an expectation of reality, it is not the same as having actual data to work with. I discuss some aspects of simulation here and here.
Regardless of the specific objective of a process mining initiative, without time data as to when events occurred, it is impossible to derive a relative ordering for the events in question, and hence a process model cannot be arrived at, making everything else moot.
All of this is independent of the methodology or approach used to do modeling (ex. Petri nets, BPMNs) or perform analyses (ex. Six Sigma) or of the platforms and tools employed in doing so.
A takeaway for anyone wishing to embark on the path of process improvement (PI), it is that it is never too soon to start collecting time-related data about their workflow. Of course, because the eventual PI goal — and over time there could be many of these — may become clear only at some point later in this continuum of data collection, enough data of the right kind, meeting the right standards (ex. accuracy), and at the correct level of granularity must be recorded and safeguarded through proper data stewardship so that it will be useful for a broad spectrum of analyses later, including evidence-based audits. Imagine having years worth of data as to patients’ visits, say, but recorded only to show on which day they came in for their check-up. Clearly, with a granularity that coarse, one would be severely limited as to the scope of time and performance studies, as well as audits. Specifically, it would be impossible to analyze events affecting patients that came in on the same day and related workflows.
In all this, one should keep in mind that one can only analyze what one has logged or discovered. This is by necessity a partial, limited view of reality. What about those events that have not happened but could, as opposed to those that haven’t because they cannot? This is a difficult distinction to make in the absence of certain information, and information is always incomplete. Recall the earlier comment about enhancing models as more events are logged. The main takeaway here is that one is always at least one layer removed from reality and looking at it through the ‘peephole’ of limited observations, hence any conclusions drawn must account for it.
Finally, while there is some overlap between data mining and process mining, the former focuses on finding patterns and correlations in data, while the latter focuses on events and the discovery and analysis of fitting time-based models, perhaps the most powerful enabler of PI-oriented analysis.