fix 'typos' (yesss, they are definitely typos)
All checks were successful
/ test_checkout (push) Successful in 1m49s

This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-09-26 04:21:05 +02:00
parent fede0bd9b2
commit 0d87fae9da
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
11 changed files with 302 additions and 304 deletions

View file

@ -6,45 +6,45 @@
#pb1-text
In the past fifteen years, the research community released many tools to detect or analyse malicious behaviors in applications.
The first steps to anwser this question is to list those previously published tools.
The number of publication related to static analysis can make it difficult to find the right tool for the right task.
In the past fifteen years, the research community has released many tools to detect or analyse malicious behaviours in applications.
The first step to answer this question is to list those previously published tools.
The number of publications related to static analysis can make it difficult to find the right tool for the right task.
Li #etal~@Li2017 published a systematic literature review for Android static analysis before May 2015.
They analysed 92 publications and classified them by goal, method used to solve the problem and underlying technical solution for handling the bytecode when performing the static analysis.
In particular, they listed 27 approaches with an open-source implementation available.
Interestingly, a lot of the tools listed rely on common tools to interact with Android applications/#DEX bytecode.
Reccuring examples of such support tools are Appktool (#eg Amandroid~@weiAmandroidPreciseGeneral2014, Blueseal~@shenInformationFlowsPermission2014, SAAF~@hoffmannSlicingDroidsProgram2013), Androguard (#eg Adagio~@gasconStructuralDetectionAndroid2013, Appareciumn~@titzeAppareciumRevealingData2015, Mallodroid~@fahlWhyEveMallory2012) or Soot (#eg Blueseal~@shenInformationFlowsPermission2014, DroidSafe~@DBLPconfndssGordonKPGNR15, Flowdroid~@Arzt2014a).
This strengthens our idea that behing able to reuse previous tools in important.
This strengthens our idea that being able to reuse previous tools is important.
Those tools are built incrementally, on top of each other.
Nevertheless, experiments to evaluate the reusability of the pointed out software were not performed by Li #etal
#jfl-note[We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endanger both the reproducibility and reusability of research.][A mettre en avant?]
#jfl-note[We believe that the effort of reviewing the literature for making a comprehensive overview of available approaches should be pushed further: an existing published approach with a software that cannot be used for technical reasons endangers both the reproducibility and reusability of research.][A mettre en avant?]
//Data-flow analysis is the subject of many contribution~@weiAmandroidPreciseGeneral2014 @titzeAppareciumRevealingData2015 @bosuCollusiveDataLeak2017 @klieberAndroidTaintFlow2014 @DBLPconfndssGordonKPGNR15 @octeauCompositeConstantPropagation2015 @liIccTADetectingInterComponent2015, the most notable tool being Flowdroid~@Arzt2014a.
We will now explore this direction further by looking at the work that has been done to evaluate different analysis tools.
Works that perform benchmaks of tools follow a similar method.
They start by selecting a set of tools with similar goals.
Usually, those contribusions are comparing existing tools to their own, but some contributions do not introduce a new tool and focus on surveying the state of the art for some technique.
They then selected a dataset of application to analyse.
We will see in @sec:bg-datasets that those dataset are often and crafted, even if some studdies select a few read-world application that they manually reverse engineer to get a ground truth to compare to the tools result.
Once the tools and test dataset are selected, the tools are run on the application dataset, and the results of the tools are compared to the ground truth to determine the accuracy of each tools.
Works that perform benchmarks of tools follow a similar method.
They start by selecting a set of tools with similar goals.
Usually, those contributions are comparing existing tools to their own, but some contributions do not introduce a new tool and focus on surveying the state of the art for some technique.
They then selected a dataset of applications to analyse.
We will see in @sec:bg-datasets that those datasets are often hand-crafted, except for some studies that select a few real-world applications that they manually reverse-engineered to get a ground truth to compare to the tool's result.
Once the tools and test dataset are selected, the tools are run on the application dataset, and the results of the tools are compared to the ground truth to determine the accuracy of each tool.
Several factors can be considered to compare the results of the tools:
the number of false positives, false negatives, or even the time it took to finish the analysis.
Occasionally, the number of application a tool simply failled to analyse are also compared.
Occasionally, the number of applications a tool simply failed to analyse is also compared.
In @sec:bg-datasets we will look at the dataset used in the community to compare analysis tools.
In @sec:bg-datasets, we will look at the dataset used in the community to compare analysis tools.
Then in @sec:bg-bench> we will go through the contributions that benchmarked those tools #jm-note[to see if they can be used as an indication as to which tools can still be used today.][Mettre en avant]
==== Application Datasets <sec:bg-datasets>
Research contributions often rely on existing datasets or provide new ones in order to evaluate the developed software.
Raw datasets such as Drebin@Arp2014 contain few information about the provided applications.
Raw datasets such as Drebin@Arp2014 contain little information about the provided applications.
As a consequence, dataset suites have been developed to provide, in addition to the applications, meta information about the expected results.
For example, taint analysis datasets should provide the source and expected sink of a taint.
In some cases, the datasets are provided with additional software for automatizing part of the analysis.
One such dataset is DroidBench, that was released with the tool Flowdroid~@Arzt2014a.
In some cases, the datasets are provided with additional software for automating part of the analysis.
One such dataset is DroidBench, which was released with the tool Flowdroid~@Arzt2014a.
Later, the dataset ICC-Bench was introduced with the tool Amandroid~@weiAmandroidPreciseGeneral2014 to complement DroidBench by introducing applications using Inter-Component data flows.
These datasets contain carefully crafted applications containing flows that the tools should be able to detect.
These hand-crafted applications can also be used for testing purposes or to detect any regression when the software code evolves.
@ -52,43 +52,43 @@ The drawback to using hand-crafted applications is that these datasets are not r
Contrary to DroidBench and ICC-Bench, some approaches use real-world applications.
Bosu #etal~@bosuCollusiveDataLeak2017 use DIALDroid to perform a threat analysis of Inter-Application communication and published DIALDroid-Bench, an associated dataset.
Similarly, Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world dataset and the associated recommendations to build such a dataset.
These datasets are useful for carefully spotting missing taint flows, but contain only a few dozen of applications.
Similarly, Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world dataset, and the associated recommendations to build such a dataset.
These datasets are useful for carefully spotting missing taint flows, but contain only a few dozen applications.
In addition to those datasets, AndroZoo~@allixAndroZooCollectingMillions2016 collect applications from several application market places, including the Google Play store (the official Google application store), Anzhi and AppChina (two chinese stores), or FDroid (a store dedicated to free and open source applications).
Currently, Androzoo contains more than 25 millions applications, that can be downloaded by researchers from the SHA256 hash of the application.
Androzoo also provide additionnal information about the applications, like the date the application was detected for the first time by Androzoo or the number of antivirus from VirusTotal that flaged the application as malicious.
In addition to providing researchers with an easy access to real world applications, Androzoo make it a lot easier to share datasets for reproducibility: instead of sharing hundreds of #APK files, the list of SHA256 is enough.
In addition to those datasets, AndroZoo~@allixAndroZooCollectingMillions2016 collect applications from several application marketplaces, including the Google Play store (the official Google application store), Anzhi and AppChina (two Chinese stores), or FDroid (a store dedicated to free and open source applications).
Currently, Androzoo contains more than 25 million applications that can be downloaded by researchers from the SHA256 hash of the application.
Androzoo also provide additional information about the applications, like the date the application was detected for the first time by Androzoo or the number of antiviruses from VirusTotal that flagged the application as malicious.
In addition to providing researchers with easy access to real-world applications, Androzoo make it a lot easier to share datasets for reproducibility: instead of sharing hundreds of #APK files, the list of SHA256 is enough.
==== Benchmarking <sec:bg-bench>
The few datasets composed of real-world application confirmed that some tools such as Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a are less efficient on real-world applications~@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
Unfortunatly, those real-world applications datasets are rather small, and a larger number of applications would be more suitable for our goal, #ie evaluating the reusability of a variety of static analysis tools.
The few datasets composed of real-world applications confirmed that some tools, such as Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a, are less efficient on real-world applications~@bosuCollusiveDataLeak2017 @luoTaintBenchAutomaticRealworld2022.
Unfortunately, those real-world applications datasets are rather small, and a larger number of applications would be more suitable for our goal, #ie evaluating the reusability of a variety of static analysis tools.
Pauck #etal~@pauckAndroidTaintAnalysis2018 used DroidBench~@Arzt2014a, ICC-Bench~@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench~@bosuCollusiveDataLeak2017 to compare Amandroid~@weiAmandroidPreciseGeneral2014, DIAL-Droid~@bosuCollusiveDataLeak2017, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, FlowDroid~@Arzt2014a and IccTA~@liIccTADetectingInterComponent2015. //-- all these tools will be also compared in this chapter.
Pauck #etal~@pauckAndroidTaintAnalysis2018 used DroidBench~@Arzt2014a, ICC-Bench~@weiAmandroidPreciseGeneral2014 and DIALDroid-Bench~@bosuCollusiveDataLeak2017 to compare Amandroid~@weiAmandroidPreciseGeneral2014, DIAL-Droid~@bosuCollusiveDataLeak2017, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, FlowDroid~@Arzt2014a and IccTA~@liIccTADetectingInterComponent2015. //-- all these tools will also be compared in this chapter.
To perform their comparison, they introduced the AQL (Android App Analysis Query Language) format.
AQL can be used as a common language to describe the computed taint flow as well as the expected result for the datasets.
It is interesting to notice that all the tested tools timed out at least once on real-world applications, and that Amandroid~@weiAmandroidPreciseGeneral2014, DidFail~@klieberAndroidTaintFlow2014, DroidSafe~@DBLPconfndssGordonKPGNR15, IccTA~@liIccTADetectingInterComponent2015 and ApkCombiner~@liApkCombinerCombiningMultiple2015 (a tool used to combine applications) all failed to run on applications built for Android API 26.
These results suggest that a more thorough study of the link between application characteristics (#eg date, size) should be conducted.
Luo #etal~@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world android malware.
They found out that those tools have a low recall on real-world malware, and are thus over adapted to micro-datasets.
Luo #etal~@luoTaintBenchAutomaticRealworld2022 used the framework introduced by Pauck #etal to compare Amandroid~@weiAmandroidPreciseGeneral2014 and Flowdroid~@Arzt2014a on DroidBench and their own dataset TaintBench, composed of real-world Android malware.
They found out that those tools have a low recall on real-world malware, and are thus over-adapted to micro-datasets.
Unfortunately, because AQL is only focused on taint flows, we cannot use it to evaluate tools performing more generic analysis.
A first work about quantifying the reusability of static analysis tools was proposed by Reaves #etal~@reaves_droid_2016.
Seven Android analysis tools (Amandroid~@weiAmandroidPreciseGeneral2014, AppAudit~@xiaEffectiveRealTimeAndroid2015, DroidSafe~@DBLPconfndssGordonKPGNR15, Epicc~@octeau2013effective, FlowDroid~@Arzt2014a, MalloDroid~@fahlWhyEveMallory2012 and TaintDroid~@Enck2010) were selected to check if they were still readily usable.
For each tool, both the usability and results of the tool were evaluated by asking auditors to install and use it on DroidBench and 16 real world applications.
The auditors reported that most of the tools require a significant amount of time to setup, often due to dependencies issues and operating system incompatibilities.
For each tool, both the usability and results of the tool were evaluated by asking auditors to install and use it on DroidBench and 16 real-world applications.
The auditors reported that most of the tools require a significant amount of time to set up, often due to dependency issues and operating system incompatibilities.
Reaves #etal propose to solve these issues by distributing a Virtual Machine with a functional build of the tool in addition to the source code.
Regrettably, these Virtual Machines were not made available, preventing future researchers to take advantage of the work done by the auditors.
Reaves #etal also report that real world applications are more challenging to analyse, with tools having lower results, taking more time and memory to run, sometimes to the point of not being able to run the analysis.
This result is worrying considering it was noticed on a dataset of only 16 real-world application.
A more diverse dataset would be needed to better assess the extend of the issue and give more insight about the factor impacting the perfomances of the tools.
Regrettably, these Virtual Machines were not made available, preventing future researchers from taking advantage of the work done by the auditors.
Reaves #etal also report that real-world applications are more challenging to analyse, with tools having lower results, taking more time and memory to run, sometimes to the point of not being able to run the analysis.
Considering it was noticed on a dataset of only 16 real-world applications, this result is worrying.
A more diverse dataset would be needed to better assess the extent of the issue and give more insight into the factors impacting the performance of the tools.
//We will confirm and expand this result in @sec:rasta with a larger dataset than only 16 real-world applications.
Mauthe #etal present an interresting methodology to asses the robustness of Android decompilers~@mauthe_large-scale_2021.
Mauthe #etal present an interesting methodology to assess the robustness of Android decompilers~@mauthe_large-scale_2021.
They used 4 decompilers on a dataset of 40 000 applications.
The error messages of the decompilers were parsed to list the methods that failed to decompile, and this information was used to estimate the main causes of failure.
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failure are from third parties library rather than the core code of the application.
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failures are from third-party libraries rather than the core code of the application.
They also concluded that malware are easier to entirely decompile, but have a higher failure rate, meaning that the ones that are hard to decompile are substantially harder to decompile than goodware.
/*
@ -144,12 +144,12 @@ DroidBench@Arzt2014a
#v(2em)
To summariz, Li #etal made a systematic literature review of static analysis for Android that listed 27 opensourced tools.
However, they did not tested those tools.
To summarise, Li #etal made a systematic literature review of static analysis for Android that listed 27 open-sourced tools.
However, they did not test those tools.
Reaves #etal did so for some of them and analysed the difficulty of using them.
They raised two major concern for the use of Android static analysis tools.
First, they can be quite difficult to setup, and second, they appear to have difficulties analysing read-world applications.
This is problematic for a reverser engineer, not only do they need to invest a significant amont of work to setup a tool properly, they do not have any guarantees that the tool will actually manage to analyse the application they are investigating.
They raised two major concerns about the use of Android static analysis tools.
First, they can be quite difficult to set up, and second, they appear to have difficulties analysing real-world applications.
This is problematic for a reverse engineer, not only do they need to invest a significant amount of work to set up a tool properly, but they also do not have any guarantees that the tool will actually manage to analyse the application they are investigating.
In @sec:rasta, we will try to setup the tools listed by Li #etal and test them on a large number of real-world applications to see wich can be used today.
We will also aim at identify what caracteristic of real-world applications make them harder to analyse.
In @sec:rasta, we will try to set up the tools listed by Li #etal and test them on a large number of real-world applications to see which can be used today.
We will also aim at identifying what characteristics of real-world applications make them harder to analyse.