This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-07-29 16:23:42 +02:00
parent 243b9df134
commit c060e88996
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
17 changed files with 264 additions and 96 deletions

View file

@ -9,10 +9,10 @@ On the contrary, we take as hypothesis that the provided tools compute the inten
This chapter intends to show that sharing the software artifacts of a paper may not be sufficient to ensure that the provided software would be reusable.
Thus, our contributions are the following.
We carefully retrieved static analysis tools for Android applications that were selected by Li #etal@Li2017 between 2011 and 2017.
We carefully retrieved static analysis tools for Android applications that were selected by Li #etal~@Li2017 between 2011 and 2017.
We contacted the authors, whenever possible, for selecting the best candidate versions and to confirm the good usage of the tools.
We rebuild the tools in their original environment and we plan to share our Docker images with this paper.
We evaluated the reusability of the tools by measuring the number of successful analysis of applications taken in the Drebin dataset@Arp2014 and in a custom dataset that contains more recent applications (#NBTOTALSTRING in total).
We evaluated the reusability of the tools by measuring the number of successful analysis of applications taken in the Drebin dataset~@Arp2014 and in a custom dataset that contains more recent applications (#NBTOTALSTRING in total).
The observation of the success or failure of these analysis enables us to answer the following research questions:
/ RQ1: What Android static analysis tools that are more than 5 years old are still available and can be reused without crashing with a reasonable effort?

View file

@ -12,27 +12,27 @@
We review in this section the past existing contributions related to static analysis tools reusability.
Several papers have reviewed Android analysis tools produced by researchers.
Li #etal@Li2017 published a systematic literature review for Android static analysis before May 2015.
Li #etal~@Li2017 published a systematic literature review for Android static analysis before May 2015.
They analyzed 92 publications and classified them by goal, method used to solve the problem and underlying technical solution for handling the bytecode when performing the static analysis.
In particular, they listed 27 approaches with an open-source implementation available.
Nevertheless, experiments to evaluate the reusability of the pointed out software were not performed.
We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endanger both the reproducibility and reusability of research.
As we saw in @sec:bg-datasets that the need for a ground truth to test analysis tools leads test datasets to often be handcrafted.
The few datasets composed of real-world application confirmed that some tools such as Amandroid@weiAmandroidPreciseGeneral2014 and Flowdroid@Arzt2014a are less efficient on real-world applications@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
The few datasets composed of real-world application confirmed that some tools such as Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a are less efficient on real-world applications~@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
Unfortunatly, those real-world applications datasets are rather small, and a larger number of applications would be more suitable for our goal, #ie evaluating the reusability of a variety of static analysis tools.
Pauck #etal@pauckAndroidTaintAnalysis2018 used DroidBench@@Arzt2014a, ICC-Bench@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench@@bosuCollusiveDataLeak2017 to compare Amandroid@weiAmandroidPreciseGeneral2014, DIAL-Droid@bosuCollusiveDataLeak2017, DidFail@klieberAndroidTaintFlow2014, DroidSafe@DBLPconfndssGordonKPGNR15, FlowDroid@Arzt2014a and IccTA@liIccTADetectingInterComponent2015 -- all these tools will be also compared in this chapter.
Pauck #etal~@pauckAndroidTaintAnalysis2018 used DroidBench~@Arzt2014a, ICC-Bench~@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench~@bosuCollusiveDataLeak2017 to compare Amandroid~@weiAmandroidPreciseGeneral2014, DIAL-Droid~@bosuCollusiveDataLeak2017, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, FlowDroid~@Arzt2014a and IccTA~@liIccTADetectingInterComponent2015 -- all these tools will be also compared in this chapter.
To perform their comparison, they introduced the AQL (Android App Analysis Query Language) format.
AQL can be used as a common language to describe the computed taint flow as well as the expected result for the datasets.
It is interesting to notice that all the tested tools timed out at least once on real-world applications, and that Amandroid@weiAmandroidPreciseGeneral2014, DidFail@klieberAndroidTaintFlow2014, DroidSafe@DBLPconfndssGordonKPGNR15, IccTA@liIccTADetectingInterComponent2015 and ApkCombiner@liApkCombinerCombiningMultiple2015 (a tool used to combine applications) all failed to run on applications built for Android API 26.
It is interesting to notice that all the tested tools timed out at least once on real-world applications, and that Amandroid~@weiAmandroidPreciseGeneral2014, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, IccTA~@liIccTADetectingInterComponent2015 and ApkCombiner~@liApkCombinerCombiningMultiple2015 (a tool used to combine applications) all failed to run on applications built for Android API 26.
These results suggest that a more thorough study of the link between application characteristics (#eg date, size) should be conducted.
Luo #etal@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid@weiAmandroidPreciseGeneral2014 and Flowdroid@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world android malware.
Luo #etal~@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world android malware.
They found out that those tools have a low recall on real-world malware, and are thus over adapted to micro-datasets.
Unfortunately, because AQL is only focused on taint flows, we cannot use it to evaluate tools performing more generic analysis.
A first work about quantifying the reusability of static analysis tools was proposed by Reaves #etal@reaves_droid_2016.
Seven Android analysis tools (Amandroid@weiAmandroidPreciseGeneral2014, AppAudit@xiaEffectiveRealTimeAndroid2015, DroidSafe@DBLPconfndssGordonKPGNR15, Epicc@octeau2013effective, FlowDroid@Arzt2014a, MalloDroid@fahlWhyEveMallory2012 and TaintDroid@Enck2010) were selected to check if they were still readily usable.
A first work about quantifying the reusability of static analysis tools was proposed by Reaves #etal~@reaves_droid_2016.
Seven Android analysis tools (Amandroid~@weiAmandroidPreciseGeneral2014, AppAudit~@xiaEffectiveRealTimeAndroid2015, DroidSafe~@DBLPconfndssGordonKPGNR15, Epicc~@octeau2013effective, FlowDroid~@Arzt2014a, MalloDroid~@fahlWhyEveMallory2012 and TaintDroid~@Enck2010) were selected to check if they were still readily usable.
For each tool, both the usability and results of the tool were evaluated by asking auditors to install and use it on DroidBench and 16 real world applications.
The auditors reported that most of the tools require a significant amount of time to setup, often due to dependencies issues and operating system incompatibilities.
Reaves #etal propose to solve these issues by distributing a Virtual Machine with a functional build of the tool in addition to the source code.
@ -41,7 +41,7 @@ Reaves #etal also report that real world applications are more challenging to an
We will confirm and expand this result in this chapter with a larger dataset than only 16 real-world applications.
// Indeed, a more diverse dataset would assess the results and give more insight about the factors impacting the performances of the tools.
Finally, our approach is similar to the methodology employed by Mauthe #etal for decompilers@mauthe_large-scale_2021.
Finally, our approach is similar to the methodology employed by Mauthe #etal for decompilers~@mauthe_large-scale_2021.
To assess the robustness of android decompilers, Mauthe #etal used 4 decompilers on a dataset of 40 000 applications.
The error messages of the decompilers were parsed to list the methods that failed to decompile, and this information was used to estimate the main causes of failure.
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failure are from third parties library rather than the core code of the application.

View file

@ -66,18 +66,18 @@
*documentation*: #okk: excellent, MWE, #ok: few inconsistencies, #bad: bad quality, #ko: not available\
*decision*: #ok: considered; #bad: considered but not built; #ko: out of scope of the study
]},
caption: [Considered tools@Li2017: availability and usage reliability],
caption: [Considered tools~@Li2017: availability and usage reliability],
) <tab:rasta-tools>
We collected the static analysis tools from@Li2017, plus one additional paper encountered during our review of the state-of-the-art (DidFail@klieberAndroidTaintFlow2014).
We collected the static analysis tools from~@Li2017, plus one additional paper encountered during our review of the state-of-the-art (DidFail~@klieberAndroidTaintFlow2014).
They are listed in @tab:rasta-tools, with the original release date and associated paper.
We intentionally limited the collected tools to the ones selected by Li #etal@Li2017 for several reasons.
We intentionally limited the collected tools to the ones selected by Li #etal~@Li2017 for several reasons.
First, not using recent tools enables to have a gap of at least 5 years between the publication and the more recent APK files, which enables to measure the reusability of previous contribution with a reasonable gap of time.
Second, collecting new tools would require to describe these tools in depth, similarly to what have been performed by Li #etal@Li2017, which is not the primary goal of this paper.
Second, collecting new tools would require to describe these tools in depth, similarly to what have been performed by Li #etal~@Li2017, which is not the primary goal of this paper.
Additionally, selection criteria such as the publication venue or number of citations would be necessary to select a subset of tools, which would require an additional methodology.
These possible contributions are left for future work.
Some tools use hybrid analysis (both static and dynamic): A3E@DBLPconfoopslaAzimN13, A5@vidasA5AutomatedAnalysis2014, Android-app-analysis@geneiatakisPermissionVerificationApproach2015, StaDynA@zhauniarovichStaDynAAddressingProblem2015.
Some tools use hybrid analysis (both static and dynamic): A3E~@DBLPconfoopslaAzimN13, A5~@vidasA5AutomatedAnalysis2014, Android-app-analysis~@geneiatakisPermissionVerificationApproach2015, StaDynA~@zhauniarovichStaDynAAddressingProblem2015.
They have been excluded from this paper.
We manually searched the tool repository when the website mentioned in the paper is no longer available (#eg when the repository have been migrated from Google code to GitHub) and for each tool we searched for:
@ -89,7 +89,7 @@ In @tab:rasta-tools we rated the quality of these artifacts with "#ok" when avai
Results show that documentation is often missing or very poor (#eg Lotrack), which makes the rebuild process very complex and the first analysis of a MWE.
We finally excluded Choi #etal@CHOI2014620 as their tool works on the sources of Android applications, and Poeplau #etal@DBLPconfndssPoeplauFBKV14 that focus on Android hardening.
We finally excluded Choi #etal~@CHOI2014620 as their tool works on the sources of Android applications, and Poeplau #etal~@DBLPconfndssPoeplauFBKV14 that focus on Android hardening.
As a summary, in the end we have #nbtoolsselected tools to compare.
Some specificities should be noted.
The IC3 tool will be duplicated in our experiments because two versions are available: the original version of the authors and a fork used by other tools like IccTa.
@ -255,11 +255,11 @@ Probleme 2: pour sampler, on utilise les deciles de taille d'apk, mais pour nos
*/
Two datasets are used in the experiments of this section.
The first one is *Drebin*@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases@Pendlebury2018.
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
We intend to compare the rate of success on this old dataset with a more recent one.
The second one, *Rasta*, we built to cover all dates between 2010 to 2023.
This dataset is a random extract of Androzoo@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
This dataset is a random extract of Androzoo~@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.
For checking the maliciousness of an Android application we rely on the VirusTotal detection indicators.

View file

@ -24,7 +24,7 @@
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out of memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit status represented in grey (Other) are considered as unknown errors and not as failure of the tool.
#todo[We discuss further errors for which we have information in the logs in /*@*/sec:rasta-failure-analysis.]
#todo[We discuss further errors for which we have information in the logs in @sec:rasta-failure-analysis.]
Results on the Drebin datasets shows that 11 tools have a high success rate (greater than 85%).
The other tools have poor results.

View file

@ -331,13 +331,13 @@ Our attempts to upgrade those dependencies led to new errors appearing: we concl
=== State of the art comparison
Luo #etal released TaintBench@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark.
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark.
These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications.
// Pauck #etal@pauckAndroidTaintAnalysis2018
// Reaves #etal@reaves_droid_2016
We finally compare our results to the conclusions and discussions of previous papers@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016.
First we confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets@luoTaintBenchAutomaticRealworld2022.
We finally compare our results to the conclusions and discussions of previous papers~@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016.
First we confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
Even if Drebin is not hand-crafted, it is quite old and we obtained really good results compared to the Rasta dataset.
When considering real-world applications, the size is rather different from hand crafted application, which impacts the success rate.
We believe that it is explained by the fact that the complexity of the code increases with its size.
@ -354,10 +354,10 @@ We believe that it is explained by the fact that the complexity of the code incr
=== State-of-the-art comparison
Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016.
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
/*

View file

@ -3,7 +3,7 @@
== Conclusion <sec:rasta-conclusion>
This paper has assessed the suggested results of the literature@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016 about the reliability of static analysis tools for Android applications.
This paper has assessed the suggested results of the literature~@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016 about the reliability of static analysis tools for Android applications.
With a dataset of #NBTOTALSTRING applications we established that #resultunusable of #nbtoolsselectedvariations tools are not reusable, when considering that a tool that has more than 50% of time a failure is unusable.
In total, the analysis success rate of the tools that we could run for the entire dataset is #resultratio.
The characteristics that have the most influence on the success rate is the bytecode size and min SDK version. Finally, we showed that malware APKs have a better finishing rate than goodware.
@ -11,4 +11,4 @@ The characteristics that have the most influence on the success rate is the byte
In future works, we plan to investigate deeper the reported errors of the tools in order to analyze the most common types of errors, in particular for Java based tools.
We also plan to extend this work with a selection of more recent tools performing static analysis.
Following Reaves #etal recommendations@reaves_droid_2016, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files. This will allow the research community to use directly the tools without the build and installation penalty.
Following Reaves #etal recommendations~@reaves_droid_2016, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files. This will allow the research community to use directly the tools without the build and installation penalty.