Skip to content

Event log preprocessing for privacy-aware process discovery

License

Notifications You must be signed in to change notification settings

samadeusfp/PRETSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRETSA-Algorithm Family

In this repository we provide the implementations for the PRETSA-algorithm-family. They provide algorithms to generate privatizied event logs that comply with k-anonymity and t-closeness. These event logs can be used for Process Mining. We provide an implementation of PRETSA in Python 3. Our code is available under the MIT license. If you use it for academic purposes please cite our paper:

@article{DBLP:journals/dke/FahrenkrogPetersenAW23,
  author       = {Stephan A. Fahrenkrog{-}Petersen and
                  Han van der Aa and
                  Matthias Weidlich},
  title        = {Optimal event log sanitization for privacy-preserving process mining},
  journal      = {Data Knowl. Eng.},
  volume       = {145},
  pages        = {102175},
  year         = {2023},
  url          = {https://doi.org/10.1016/j.datak.2023.102175},
  doi          = {10.1016/J.DATAK.2023.102175},
  timestamp    = {Sun, 25 Jun 2023 22:03:53 +0200},
  biburl       = {https://dblp.org/rec/journals/dke/FahrenkrogPetersenAW23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
}

You can access the corresponding research paper here: https://doi.org/10.1016/j.datak.2023.102175

Requirements

To run our algorithm you need the following Python packages:

We did run our algorithm only with Python 3, so we can not guarantee that it works with Python 2.

How to run PRETSA

The algorithm PRETSA itself is implemented in the file pretsa.py. To run the algorithm you first have to initiate the Pretsa class and hand over an event log represented as a pandas dataframe:

eventLog = pd.read_csv(filePath, delimiter=";")
pretsa = Pretsa(eventLog)

As a next step you run the PRETSA algorithm with your choosen k-anonymity(an integer) and t-closesness(a float) parameter. The algorithm then returns the cases that have been modified:

cutOutCases = pretsa.runPretsa(k,t)

Note that the privacy constraint k-anonymity gets stronger with a higher value, while t-closeness can have values between 1.0 and 0.0 with the lowest value giving the strongest privacy guarantee.

Finally we can return our privatizied event log as a pandas dataframe:

privateEventLog = pretsa.getPrivatisedEventLog()

Please consider that your original event log must contain at least the following attributes(column names), so that PRETSA can process it:

  • Case Id
  • Activity
  • Duration

If you want to use different attribute column names you can change the following variables in pretsa.py:

  • caseIDColName
  • activityColName
  • annotationColName

How to repeat our experiments

We will describe in this section how we conducted our experiments for our ICPM 2019 submission:

First we generated the duration annotation with the following script:

python add_annotation_duration.py <fileName> <dataset>

Next by running the script runPretsa.py we generated the event log's generated by PRETSA with choosen parameters for k and t:

python runPretsa.py <fileName> <k> <t>

To generate privatizied event logs with our baseline approach we run the script generate_baseline_log.py:

python generate_baseline_log.py <fileName> <k> <t>

To compare the fitness and precision of the event logs we used ProM. The calculate the statistcs of event logs( e.g. number of variants in the log) we run the script calculateDatasetStatistics.py. Alternatively we can run the script calculateBaselineStatistics.py and calculatePRETSAEventLogStatistics.py to save the number of variants in the PRETSA/baseline event logs into a csv-file:

python calculateDatasetStatustics.py <fileName>
python calculatBaselineEventLogStatistics.py <dictName>
python calculatePRETSAEventLogStatistics.py <dictName>

To calculate the annotation error we did the scripts calculateAnnotationsEventLog_baseline.py and calculateAnnotationsEventLog_pretsa.py to calculate the average annotation for the privatizied event logs:

python calculateAnnotationsEventLog_baseline.py <dictName>
python calculateAnnotationsEventLog_pretsa.py <dictName>

With generateAnnotationOriginalDataset.py we generate the statistics of the original event logs:

python generateAnnotationOriginalDataset.py <dictName>

Finally we run calculateAnnotationError.py to calculate the relative error of the annotations for each activity:

python calculateAnnotationError.py <dictName>

PRETSA*/BF-PRETSA

Furthermore this repository contains the implementation of improved versions of the PRETSA-algorithm currently under reviews as a journal extension. These algorithms are namely:

  • PRETSA* -> An algorithm that guarantees optimal event log sanitazation through application of A*-search
  • BF-PRETSA -> An algorithm using best-first search

How to repeat our experiments

We will describe in this section how we conducted our experiments for our journal extension:

python startExperimentsForJournalExtension_<algorithmName>.py <filePath>

That the parallel execution of all anonymization settings for the algorithm specified in . Please note, that this starts 25 processes at the same time. All of them potentially need intensive computional resources. Therefore, we recommend only executing these scripts on a powerful server.

The evaluation metrics can be derived by running the the following scripts:

python getResultsJournalExtension_<evaluation_metric>.py <dirPath> <dataset> 

How to contact us

PRETSA was developed at the Process-driven Architecture group of Humboldt-Universität zu Berlin. If you want to contact us, just send us a mail at: fahrenks || hu-berlin.de

Find out more about our research

If you wamt find out more about our research, you can visit the following website: https://sites.google.com/view/sfahrenkrog-petersen/home

Releases

No releases published

Packages

No packages published

Languages