wip
Some checks failed
/ test_checkout (push) Failing after 22s

This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-09-29 15:48:31 +02:00
parent f23390279c
commit 87195a483a
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
5 changed files with 170 additions and 168 deletions

View file

@ -5,20 +5,19 @@
In this chapter, we study the reusability of open source static analysis tools that appeared between 2011 and 2017, on a recent Android dataset.
The scope of our study is *not* to quantify if the output results are accurate to ensure reproducibility, because all the studied static analysis tools have different goals in the end.
On the contrary, we take as hypothesis that the provided tools compute the intended result but may crash or fail to compute a result due to the evolution of the internals of an Android application, raising unexpected bugs during an analysis.
This chapter intends to show that sharing the software artifacts of a paper may not be sufficient to ensure that the provided software would be reusable.
On the contrary, we take the hypothesis that the provided tools compute the intended result, but may crash or fail to compute a result due to the evolution of the internals of an Android application, raising unexpected bugs during an analysis.
This chapter intends to show that sharing the software artefacts of a paper may not be sufficient to ensure that the provided software will be reusable.
Thus, our contributions are the following.
We carefully retrieved static analysis tools for Android applications that were selected by Li #etal~@Li2017 between 2011 and 2017.
#jm-note[Many of those tools where presented in @sec:bg-static.][Yes but not really, @sec:bg-static do not present the contributions in detail \ FIX: develop @sec:bg-static]
We contacted the authors, whenever possible, for selecting the best candidate versions and to confirm the good usage of the tools.
We contacted the authors whenever possible to select the best candidate versions and to confirm the good usage of the tools.
We rebuild the tools in their original environment and share our Docker images.#footnote[on Docker Hub as `histausse/rasta-<toolname>:icsr2024`]
We evaluated the reusability of the tools by measuring the number of successful analysis of applications taken in the Drebin dataset~@Arp2014 and in a custom dataset that contains more recent applications (#NBTOTALSTRING in total).
The observation of the success or failure of these analysis enables us to answer the following research questions:
We evaluated the reusability of the tools by measuring the number of successful analyses of applications taken in the Drebin dataset~@Arp2014 and in a custom dataset that contains more recent applications (#NBTOTALSTRING in total).
The observation of the success or failure of these analyses enables us to answer the following research questions:
/ RQ1: What Android static analysis tools that are more than 5 years old are still available and can be reused without crashing with a reasonable effort? <rq-1>
/ RQ2: How the reusability of tools evolved over time, especially when analyzing applications that are more than 5 years far from the publication of the tool? <rq-2>
/ RQ3: Does the reusability of tools change when analyzing goodware compared to malware? <rq-3>
/ RQ1: Which Android static analysis tools that are more than 5 years old are still available and can be reused without crashing with a reasonable effort? <rq-1>
/ RQ2: How has the reusability of tools evolved over time, especially when analysing applications that are more than 5 years away from the publication of the tool? <rq-2>
/ RQ3: Does the reusability of tools change when analysing goodware compared to malware? <rq-3>
/*
As a summary, the contributions of this chapterare the following:
@ -30,8 +29,8 @@ As a summary, the contributions of this chapterare the following:
*/
The chapter is structured as follows.
@sec:rasta-methodology presents the methodology employed to build our evaluation process and @sec:rasta-xp gives the associated experimental results.
@sec:rasta-methodology presents the methodology employed to build our evaluation process, and @sec:rasta-xp gives the associated experimental results.
@sec:rasta-failure-analysis investigates the reasons behind the observed failures of some of the tools.
We then compare in @sec:rasta-soa-comp our results with the contributions presented in @sec:bg.
In @sec:rasta-reco, we give recommendations for tool development we drawn from our experience running our experiment.
Finally, @sec:rasta-limit list the limit of our approach, @sec:rasta-futur present further avenues that did not had time to pursue and @sec:rasta-conclusion concludes the chapter.
In @sec:rasta-reco, we give recommendations for tool development that we drew from our experience running our experiment.
Finally, @sec:rasta-limit lists the limit of our approach, @sec:rasta-futur presents further avenues that did not have time to pursue and @sec:rasta-conclusion concludes the chapter.

View file

@ -8,9 +8,9 @@
In this section, we describe our methodology to evaluate the reusability of Android static analysis tools.
@fig:rasta-methodo-collection and @fig:rasta-overview summarize our approach.
We collected tools listed as open source by Li #etal, checked if that the tools where only using static analysis technique, and selected the most rescent version of the tool.
Whe then packaged the tools inside containers and check our choices with the authors.
We then run those tools on a large dataset that we sampled, and collected the exit status of the run (wether the tool completed the analysis or not).
We collected tools listed as open source by Li #etal, checked if the tools were only using static analysis techniques, and selected the most recent version of the tool.
We then packaged the tools inside containers and checked our choices with the authors.
We then run those tools on a large dataset that we sampled, and collected the exit status of the run (whether the tool completed the analysis or not).
=== Collecting Tools
@ -80,33 +80,32 @@ We then run those tools on a large dataset that we sampled, and collected the ex
We collected the static analysis tools from~@Li2017, plus one additional paper encountered during our review of the state-of-the-art (DidFail~@klieberAndroidTaintFlow2014).
They are listed in @tab:rasta-tools, with the original release date and associated publication.
We intentionally limited the collected tools to the ones selected by Li #etal~@Li2017 for several reasons.
First, not using recent tools enables to have a gap of at least 5 years between the publication and the more recent APK files, which enables to measure the reusability of previous contribution with a reasonable gap of time.
Second, collecting new tools would require to inspect these tools in depth, similarly to what have been performed by Li #etal~@Li2017, which is not the primary goal of this chapter.
First, not using recent tools enables a gap of at least 5 years between the publication and the more recent APK files, which enables measuring the reusability of previous contributions with a reasonable gap of time.
Second, collecting new tools would require inspecting these tools in depth, similarly to what has been performed by Li #etal~@Li2017, which is not the primary goal of this chapter.
Additionally, selection criteria such as the publication venue or number of citations would be necessary to select a subset of tools, which would require an additional methodology.
Some tools use hybrid analysis (both static and dynamic): A3E~@DBLPconfoopslaAzimN13, A5~@vidasA5AutomatedAnalysis2014, Android-app-analysis~@geneiatakisPermissionVerificationApproach2015, StaDynA~@zhauniarovichStaDynAAddressingProblem2015.
They have been excluded from this study.
We manually searched the tool repository when the website mentioned in the paper is no longer available (#eg when the repository have been migrated from Google code to GitHub) and for each tool we searched for:
We manually searched the tool repository when the website mentioned in the paper is no longer available (#eg when the repository has been migrated from Google code to GitHub), and for each tool we searched for:
- an optional binary version of the tool that would be usable as a fall back (if the sources cannot be compiled for any reason).
- an optional binary version of the tool that would be usable as a fallback (if the sources cannot be compiled for any reason).
- the source code of the tool.
- the documentation for building and using the tool with a #MWE.
- the documentation for building and using the tool with an #MWE.
In @tab:rasta-tools we rated the quality of these artifacts with "#ok" when available but may have inconsistencies, a "#bad" when too much inconsistencies (inaccurate remarks about the sources, dead links or missing parts) have been found, a "#ko" when no documentation have been found, and a double "#okk" for the documentation when it covers all our expectations (building process, usage, #MWE).
Results show that documentation is often missing or very poor (#eg Lotrack), which makes the rebuild process very complex and the first analysis of a #MWE.
Results show that documentation is often missing or very poor (#eg Lotrack), which makes the rebuild process very complex and the first analysis of an #MWE.
We finally excluded Choi #etal~@CHOI2014620 as their tool works on the sources of Android applications, and Poeplau #etal~@DBLPconfndssPoeplauFBKV14 that focus on Android hardening.
As a summary, in the end we have #nbtoolsselected tools to compare.
As a summary, in the end, we have #nbtoolsselected tools to compare.
Some specificities should be noted.
The IC3 tool will be duplicated in our experiments because two versions are available: the original version of the authors and a fork used by other tools like IccTa.
For Androguard, the default task consists of unpacking the bytecode, the resources, and the Manifest.
Cross-references are also built between methods and classes.
Because such a task is relatively simple to perform, we decided to duplicate this tool and ask to Androguard to decompile an APK and create a control flow graph of the code using its decompiler: DAD.
Because such a task is relatively simple to perform, we decided to duplicate this tool and ask Androguard to decompile an APK and create a control flow graph of the code using its decompiler: DAD.
We refer to this variant of usage as androguard_dad.
For Thresher and Lotrack, because these tools cannot be built, we excluded them from experiments.
Finally, starting with #nbtools tools of @tab:rasta-tools, with the two variations of IC3 and Androguard, we have in total #nbtoolsselectedvariations static analysis tools to evaluate in which two tools cannot be built and will be considered as always failing.
Finally, starting with #nbtools tools of @tab:rasta-tools, with the two variations of IC3 and Androguard, we have in total #nbtoolsselectedvariations static analysis tools to evaluate, of which two tools cannot be built and will be considered as always failing.
=== Source Code Selection and Building Process <sec:rasta-src-select>
@ -175,32 +174,32 @@ We refer to this variant of usage as androguard_dad.
In a second step, we explored the best sources to be selected among the possible forks of a tool.
We reported some indicators about the explored forks and our decision about the selected one in @tab:rasta-sources.
For each source code repository called "Origin", we reported in @tab:rasta-sources the number of GitHub stars attributed by users and we mentioned if the project is still alive (#ok in column Alive when a commit exist in the last two years).
For each source code repository called "Origin", we reported in @tab:rasta-sources the number of GitHub stars attributed by users, and we mentioned if the project is still alive (#ok in column Alive when a commit exists in the last two years).
Then, we analysed the fork tree of the project.
We searched recursively if any forked repository contains a more recent commit than the last one of the branch mentioned in the documentation of the original repository.
If such a commit is found (number of such commits are reported in column Alive Forks Nb), we manually looked at the reasons behind this commit and considered if we should prefer this more up-to-date repository instead of the original one (column "Alive Forks Usable").
As reported in @tab:rasta-sources, we excluded all forks, except IC3 for which we selected the fork JordanSamhi/ic3, because they always contain experimental code with no guarantee of stability.
For example, a fork of Aparecium contains a port for Windows 7 which does not suggest an improvement of the stability of the tool.
We searched recursively for any forked repository that contains a more recent commit than the last one of the branch mentioned in the documentation of the original repository.
If such a commit is found (the number of such commits is reported in column Alive Forks Nb), we manually looked at the reasons behind this commit and considered whether we should prefer this more up-to-date repository instead of the original one (column "Alive Forks Usable").
As reported in @tab:rasta-sources, we excluded all forks, except IC3, for which we selected the fork JordanSamhi/ic3, because they always contain experimental code with no guarantee of stability.
For example, a fork of Aparecium contains a port for Windows 7, which does not suggest an improvement in the stability of the tool.
For IC3, the fork seems promising: it has been updated to be usable on a recent operating system (Ubuntu 22.04 instead of Ubuntu 12.04 for the original version) and is used as a dependency by IccTa.
We decided to keep these two versions of the tool (IC3 and IC3_fork) to compare their results.
Then, we self-allocated a maximum of four days for each tool to successfully read and follow the documentation, compile the tool and obtain the expected result when executing an analysis of a #MWE.
We sent an email to the authors of each tool to confirm that we used the more suitable version of the code, that the command line we used to analyse an application is the most suitable one and, in some cases, requested some help to solve issues in the building process.
We reported in @tab:rasta-sources the authors that answered our request and confirmed our decisions.
Then, we self-allocated a maximum of four days for each tool to successfully read and follow the documentation, compile the tool and obtain the expected result when executing an analysis of an #MWE.
We sent an email to the authors of each tool to confirm that we used the most suitable version of the code, that the command line we used to analyse an application is the most suitable one and, in some cases, requested some help to solve issues in the building process.
We reported in @tab:rasta-sources the authors who answered our request and confirmed our decisions.
From this building phase, several observations can be made.
Using a recent operating system, it is almost impossible in a reasonable amount of time to rebuild a tool released years ago.
Too many dependencies, even for Java based programs, trigger compilation or execution problems.
Too many dependencies, even for Java-based programs, trigger compilation or execution problems.
Thus, if the documentation mentions a specific operating system, we use a Docker image of this OS.
// For example, Dare is a dependency of several tools (Didfail, IC3) and depends on 32 bits libraries such as lib32stdc++ and ia32-libs.
// For example, Dare is a dependency of several tools (Didfail, IC3) and depends on 32-bit libraries such as lib32stdc++ and ia32-libs.
// Those libraries are only available on Ubuntu 12 or previous versions.
//
Most of the time, tools require additional external components to be fully functional.
It could be resources such as the `android.jar` file for each version of the #SDK, a database, additional libraries or tools.
Depending of the quality of the documentation, setting up those components can take hours to days.
This is why we automatized in a Dockerfile the setup of the environment in which the tool is built and run#footnote[
Depending on the quality of the documentation, setting up those components can take hours to days.
This is why we automated in a Dockerfile the setup of the environment in which the tool is built and run#footnote[
#set list(indent: 1em) // avoid having the bullet align with the footnot numbering
To guarantee reproducibility we published the results, datasets, Dockerfiles and containers:
To guarantee reproducibility, we published the results, datasets, Dockerfiles and containers:
- https://github.com/histausse/rasta .
- https://zenodo.org/records/10144014 .
- https://zenodo.org/records/10980349 .
@ -257,7 +256,7 @@ To guarantee reproducibility we published the results, datasets, Dockerfiles and
edges: (
//"ST": ("Drop0": align(center, block(inset: 1em)[Tool no longer\ available])),
"TS": ("Drop1": align(center, block(inset: 1em)[Uses Dynamic\ Analysis])),
"Pack": ("Drop2": align(center, block(inset: 1em)[Could Not Setup\ in 4 days])),
"Pack": ("Drop2": align(center, block(inset: 1em)[Could Not Set Up\ in 4 days])),
),
width: 100%,
alt: "",
@ -265,14 +264,11 @@ To guarantee reproducibility we published the results, datasets, Dockerfiles and
caption: [Tool selection methodology overview],
) <fig:rasta-methodo-collection>
@fig:rasta-methodo-collection summarizes our tool selection process.
@fig:rasta-methodo-collection summarises our tool selection process.
We first looked for the tools listed as open source by Li #etal.
For the tools still available, we checked if they used dynamic analysis and removed them.
We then checked if they where more rescent updates of the tools and select the most relevent version.
Finally, we marked as non-reusable the tools that we could not setup within a period of 4 days, even with the help of the authors.
We then checked if there were more recent updates of the tools and selected the most relevant version.
Finally, we marked as non-reusable the tools that we could not set up within a period of 4 days, even with the help of the authors.
=== Runtime Conditions
@ -285,26 +281,32 @@ Finally, we marked as non-reusable the tools that we could not setup within a pe
caption: [Experiment methodology overview],
) <fig:rasta-overview>
As shown in @fig:rasta-overview, before benchmarking the tools, we built and installed them in a Docker containers for facilitating any reuse of other researchers.
As shown in @fig:rasta-overview, before benchmarking the tools, we built and installed them in Docker containers to facilitate any reuse by other researchers.
We converted them into Singularity containers because we had access to such a cluster and because this technology is often used by the #HPC community for ensuring the reproducibility of experiments.
//The Docker container allows a user to interact more freely with the bundled tools.
//Then, we converted this image to a Singularity image.
We performed manual tests using these Singularity images to check:
- the location where the tool is writing on the disk. For the best performances, we expect the tools to write on a mount point backed by an SSD. Some tools may write data at unexpected locations which required small patches from us.
- the amount of memory allocated to the tool. We checked that the tool could run a #MWE with a #ramlimit limit of RAM.
- the network connection opened by the tool, if any. We expect the tool not to perform any network operation such as the download of Android #SDKs. Thus, we prepared the required files and cached them in the images during the building phase. In a few cases, we patched the tool to disable the download of resources.
- the location where the tool writes on the disk.
For the best performances, we expect the tools to write on a mount point backed by an SSD.
Some tools may write data at unexpected locations, which require small patches from us.
- the amount of memory allocated to the tool.
We checked that the tool could run an #MWE with a #ramlimit limit of RAM.
- the network connection opened by the tool, if any.
We expect the tool not to perform any network operation, such as the download of Android #SDKs.
Thus, we prepared the required files and cached them in the images during the building phase.
In a few cases, we patched the tool to disable the download of resources.
A campaign of tests consists in executing the #nbtoolsvariationsrun selected tools on all #APKs of a dataset.
The constraints applied on the clusters are:
A campaign of tests consists of executing the #nbtoolsvariationsrun selected tools on all #APKs of a dataset.
The constraints applied to the clusters are:
- No network connection is authorized in order to limit any execution of malicious software.
- No network connection is authorised to limit any execution of malicious software.
- The allocated RAM for a task is #ramlimit.
- The allocated maximum time is 1 hour.
- The allocated object space / stack space is 64 GB / 16 GB if the tool is a Java based program.
- The allocated object space/stack space is 64 GB / 16 GB if the tool is a Java-based program.
For the disk files, we use a mount point that is stored on a SSD disk, with no particular limit of size.
Note that, because the allocation of #ramlimit could be insufficient for some tool, we evaluated the results of the tools on 20% of our dataset (described later in @sec:rasta-dataset) with 128 GB of RAM and #ramlimit of RAM and checked that the results were similar.
For the disk files, we use a mount point that is stored on an SSD disk, with no particular size limit.
Note that, because the allocation of #ramlimit could be insufficient for some tools, we evaluated the results of the tools on 20% of our dataset (described later in @sec:rasta-dataset) with 128 GB of RAM and #ramlimit of RAM and checked that the results were similar.
With this confirmation, we continued our evaluations with #ramlimit of RAM only.
@ -336,16 +338,16 @@ Probleme 2: pour sampler, on utilise les deciles de taille d'apk, mais pour nos
*/
Two datasets are used in the experiments of this section.
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieve) for comparison purposes only.
It is a well-known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
We intend to compare the rate of success on this old dataset with a more recent one.
The second one, *Rasta*, we built to cover all dates between 2010 to 2023.
The second one, *Rasta*, we built to cover all dates between 2010 and 2023.
This dataset is a random extract of Androzoo~@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.
For checking the maliciousness of an Android application we rely on the VirusTotal detection indicators.
If more than 5 antiviruses have flagged the application as malicious, we consider it as a malware.
If no antivirus has reported the application as malicious, we consider it as a goodware.
For checking the maliciousness of an Android application, we rely on the VirusTotal detection indicators.
If more than 5 antiviruses have flagged the application as malicious, we consider it malware.
If no antivirus has reported the application as malicious, we consider it goodware.
Applications in between are dropped.
For computing the release date of an application, we contacted the authors of Androzoo to compute the minimum date between the submission to Androzoo and the first upload to VirusTotal.

View file

@ -74,41 +74,40 @@
caption: [Exit status for the Rasta dataset],
) <fig:rasta-exit>
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out of memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit status represented in grey (Other) are considered as unknown errors and not as failure of the tool.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out-of-memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit statuses represented in grey (Other) are considered as unknown errors and not as failures of the tool.
We discuss further errors for which we have information in the logs in @sec:rasta-failure-analysis.
Results on the Drebin datasets shows that 11 tools have a high success rate (greater than 85%).
Results on the Drebin datasets show that 11 tools have a high success rate (greater than 85%).
The other tools have poor results.
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
On the Rasta dataset, we observe a global increase of the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are of course bad result on Rasta.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin surprisingly fall below the bar of 50% of failure.
On the Rasta dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on Rasta.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin, surprisingly fall below the bar of 50% of failure.
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
Two tools should be discussed in particular.
//Androguard and Flowdroid have a large community of users, as shown by the numbers of GitHub stars in @tab:rasta-sources.
Androguard has a high success rate which is not surprising: it used by a lot of tools, including for analyzing application uploaded to the Androzoo repository.
//Because of that, it should be noted that our dataset is biased in favour of Androguard. // Already in discution
//Androguard and Flowdroid have a large community of users, as shown by the number of GitHub stars in @tab:rasta-sources.
Androguard has a high success rate, which is not surprising: it is used by a lot of tools, including for analysing applications uploaded to the Androzoo repository.
//Because of that, it should be noted that our dataset is biased in favour of Androguard. // Already in discussion
Nevertheless, when using Androguard decompiler (DAD) to decompile an APK, it fails more than 50% of the time.
This example shows that even a tool that is frequently used can still run into critical failures.
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)) which was unexpected: in our exchanges, Flowdroid's author were expecting a higher rate of timeout and fewer crashes.
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)), which was unexpected: in our exchanges, Flowdroid's authors were expecting a higher rate of timeout and fewer crashes.
As a summary, the final ratio of successful analysis for the tools that we could run
// and applications of Rasta dataset
// and applications of the RASTA dataset
is #mypercent(54.9, 100).
When including the two defective tools, this ratio drops to #mypercent(49.9, 100).
#highlight-block()[
*#rq1 answer:*
On a recent dataset we consider that #resultunusable of the tools are unusable.
For the tools that we could run, #resultratio of analysis are finishing successfully.
//(those with less than 50% of successful execution and including the two tools that we were unable to build).
On a recent dataset, we consider that #resultunusable of the tools are unusable.
For the tools that we could run, #resultratio of analyses are finishing successfully.
//(those with less than 50% of successful execution, and including the two tools that we were unable to build).
]
=== #rq2: Size, #SDK and Date Influence
@ -122,7 +121,7 @@ For the tools that we could run, #resultratio of analysis are finishing successf
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-exit-evolution-java>],
[#figure(
@ -131,24 +130,26 @@ For the tools that we could run, #resultratio of analysis are finishing successf
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-exit-evolution-not-java>]
), caption: [Exit status evolution for the Rasta dataset]
)
For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
Such a computation is more reliable than using the dex date that is often obfuscated when packaging the application.
Such a computation is more reliable than using the #DEX date, which is often obfuscated when packaging the application.
Then, for the sake of clarity of our results, we separated the tools that have mainly Java source code from those that use other languages.
Among the ones that are Java based programs, most of them use the Soot framework which may correlate the obtained results.
@fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java based tools (resp. non Java based tools).
For Java based tools, a clear decrease of finishing rate can be observed globally for all tools.
For non-Java based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
Among the ones that are Java-based programs, most of them use the Soot framework, which may correlate the obtained results.
@fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java-based tools (resp. non Java-based tools).
For Java-based tools, a clear decrease in finishing rate can be observed globally for all tools.
For non-Java-based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
The result is expected for Androguard, because the analysis is relatively simple and the tool is largely adopted, as previously mentioned.
Mallodroid being a relatively simple script leveraging Androgard, it benefit from Androguard resilience.
It should be noted that Saaf keeps a high success ratio until 2014 and then quickly decreases to less than 20% after 2014. This example shows that, even with an identical source code and the same running platform, a tool can change of behavior among time because of the evolution of the structure of the input files.
Mallodroid, being a relatively simple script leveraging Androguard, benefits from Androguard's resilience.
It should be noted that Saaf kept a high success ratio until 2014 and then quickly decreased to less than 20% after 2014.
This example shows that, even with an identical source code and the same running platform, a tool can change its behaviour over time because of the evolution of the structure of the input files.
An interesting comparison is the specific case of Ic3 and Ic3_fork. Until 2019, the success rate is very similar. After 2020, ic3_fork is continuing to decrease whereas Ic3 keeps a success rate of around 60%.
An interesting comparison is the specific case of IC3 and Ic3_fork.
Until 2019, the success rate was very similar. After 2020, ic3_fork is continuing to decrease, whereas IC3 keeps a success rate of around 60%.
/*
```
@ -175,7 +176,7 @@ sqlite> SELECT apk1.first_seen_year, (COUNT(*) * 100) / (SELECT 20 * COUNT(*)
```
*/
To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying an other.
To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying another.
#todo[Alt text for fig rasta-decorelation-size]
#figure(stack(dir: ltr,
@ -185,7 +186,7 @@ To compare the influence of the date, #SDK version and size of applications, we
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-2022>],
[#figure(
@ -194,17 +195,17 @@ To compare the influence of the date, #SDK version and size of applications, we
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-2022>]
), caption: [Finishing rate by bytecode size for APK detected in 2022]
), caption: [Finishing rate by bytecode size for #APK detected in 2022]
) <fig:rasta-decorelation-size>
#paragraph[Fixed application year. (#num(5000) APKs)][
We selected the year 2022 which has a good amount of representatives for each decile of size in our application dataset.
@fig:rasta-rate-evolution-java-2022 (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java based tools (resp. non Java based tools) analyzing applications of 2022.
We can observe that all Java based tools have a finishing rate decreasing over years.
50% of non Java based tools have the same behavior.
#paragraph[Fixed application year. (#num(5000) #APKs)][
We selected the year 2022, which has a good amount of representatives for each decile of size in our application dataset.
@fig:rasta-rate-evolution-java-2022 (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java-based tools (resp. non-Java-based tools) analysing applications of 2022.
We can observe that all Java-based tools have a finishing rate that decreases over the years.
50% of non-Java-based tools have the same behaviour.
]
#todo[Alt text for fig rasta-decorelation-size]
@ -215,7 +216,7 @@ We can observe that all Java based tools have a finishing rate decreasing over y
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-decile-year>],
[#figure(
@ -224,15 +225,16 @@ We can observe that all Java based tools have a finishing rate decreasing over y
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-decile-year>]
), caption: [Finishing rate by discovery year with a bytecode size $in$ [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>
#paragraph[Fixed application bytecode size. (#num(6252) APKs)][We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending of the year at a fixed bytecode size.
We observe that 9 tools over 12 have a finishing rate dropping below 20% for Java based tools, which is not the case for non Java based tools.
#paragraph[Fixed application bytecode size. (#num(6252) APKs)][
We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending on the year at a fixed bytecode size.
We observe that 9 tools out of 12 have a finishing rate dropping below 20% for Java-based tools, which is not the case for non-Java-based tools.
]
#todo[Alt text for fig rasta-decorelation-min-sdk]
@ -243,7 +245,7 @@ We observe that 9 tools over 12 have a finishing rate dropping below 20% for Jav
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-decile-min-sdk>],
[#figure(
@ -252,22 +254,22 @@ We observe that 9 tools over 12 have a finishing rate dropping below 20% for Jav
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-decile-min-sdk>]
), caption: [Finishing rate by min #SDK with a bytecode size $in$ [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>
We performed similar experiments by variating the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
We found that contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after #SDK 16.
We performed similar experiments by varying the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
We found that, contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java-based tools: 8 tools over 12 are below 50% after #SDK 16.
It is not surprising, as the min #SDK is highly correlated to the year.
#highlight-block(breakable: false)[
*#rq2 answer:*
For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
The success rate varies based on the size of bytecode and #SDK version.
The date is also correlated with the success rate for Java based tools only.
For the #nbtoolsselected tools that can be used partially, a global decrease in the success rate of tools' analysis is observed over time.
Starting at a 78% success rate, after five years, tools have 61% success; after ten years, 45% success.
The success rate varies based on the size of the bytecode and #SDK version.
The date is also correlated with the success rate for Java-based tools only.
]
@ -290,8 +292,8 @@ The date is also correlated with the success rate for Java based tools only.
//table.vline(end: 3),
//table.vline(start: 4),
//table.cell(rowspan:2)[*Average date*],
[*APK*],
[*DEX*],
[*#APK*],
[*#DEX*],
),
table.cell(colspan: 3/*4*/, inset: 3pt)[],
table.hline(),
@ -305,13 +307,12 @@ The date is also correlated with the success rate for Java based tools only.
table.hline(),
)},
placement: none, // floating figure makes this table go in the previous section :grim:
caption: [Average size and date of goodware/malware parts of the Rasta dataset],
caption: [Average size and date of goodware/malware parts of the RASTA dataset],
) <tab:rasta-sizes>
We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.
Looking at @tab:rasta-sizes, we can see that although malware are, on average, bigger #APKs, they contain less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflects that.
/*
```
@ -386,9 +387,9 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
*/
In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
We can see that malware and goodware seam to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.
We can see that malware and goodware seem to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware being harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augmented by more than 20 points.
#figure({
show table: set text(size: 0.80em)
@ -403,7 +404,7 @@ Some tools, like DAD or perfchecker, show the finishing rate ratio augment by mo
table.cell(rowspan: 2)[*Decile*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[*Average DEX size (MB)*],
table.cell(colspan:2)[*Average #DEX size (MB)*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[* Finishing Rate: #FR*],
@ -438,15 +439,15 @@ Some tools, like DAD or perfchecker, show the finishing rate ratio augment by mo
caption: [#DEX size and Finishing Rate (#FR) per decile],
) <tab:rasta-sizes-decile>
We saw the the bytecode size may be an explanation for this increase.
We saw that the bytecode size may be an explanation for this increase.
To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size.
We also computed the ratio of the bytecode size and finishing rate for the two populations.
We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
We observe that while the bytecode size ratio between goodware and malware stays close to one in each decile (excluding the two extremes), the goodware/malware finishing rate ratio decreases for each decile.
It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.
We conclude from this table that, at equal size, analysing malware still triggers fewer errors than for goodware, and that the difference in errors generated between when analysing goodware and analysing malware increases with the bytecode size.
#highlight-block()[
*#rq3 answer:*
Analyzing malware applications triggers less errors for static analysis tools than analyzing goodware for comparable bytecode size.
Analysing malware applications triggers fewer errors for static analysis tools than analysing goodware for comparable bytecode size.
]

View file

@ -3,9 +3,9 @@
#import "X_var.typ": *
#import "X_lib.typ": *
== Failure Analysis <sec:rasta-failure-analysis>
== Failures Analysis <sec:rasta-failure-analysis>
In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp.
In this section, we investigate the reasons behind the high failure ratio presented in @sec:rasta-xp.
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file.
#figure({
@ -99,7 +99,7 @@ In this section, we investigate the reasons behind the high ratio of failures pr
=== Error Detected //<sec:rasta-errors>
=== Detected Errors //<sec:rasta-errors>
/*
capture erreurs
@ -109,7 +109,7 @@ stdout, stderr
android.jar en version 9 qui génère des erreurs
*/
During the running of our experiments we parse the standard output and error to capture:
During the running of our experiment, we parsed the standard output and error to capture:
- Java errors and stack traces
- Python errors and stack traces
@ -118,15 +118,15 @@ During the running of our experiments we parse the standard output and error to
- XSB error messages
- Ocaml errors
For example, Dialdroid reports in average #num(55.9) errors for one successful analysis.
On the contrary, some tools such as Blueseal report very few error at a time, making it easier to identify the cause of the failure.
For example, Dialdroid reports an average of #num(55.9) errors for one successful analysis.
On the contrary, some tools, such as Blueseal report very few errors at a time, making it easier to identify the cause of the failure.
Because some tools send back a high number of errors in our logs (up to #num(46698) for one execution), we tried to determine the error that is linked to the failed status.
Unfortunately, our manual investigations confirmed that the last error of a log output is not always the one that should be attributed to the global failure of the analysis.
The error that seems to generate the failure can occur in the middle of the execution, be caught by the code and then other subsequent parts of the code may generate new errors as consequences of the first one.
Similarly, the first error of a log is not always the cause of a failure.
Similarly, the first error in the logs is not always the cause of a failure.
Sometimes errors successfully caught and handled are logged anyway.
Thus, it is impossible to extract accurately the error responsible for a failed execution.
Thus, it is impossible to accurately extract the error responsible for a failed execution.
Therefore, we investigated the nature of errors globally, without distinction between error messages in a log.
#todo()[alt text for rasta-heatmap]
@ -137,23 +137,23 @@ Therefore, we investigated the nature of errors globally, without distinction be
width: 100%,
alt: "",
),
caption: [Heatmap of the ratio of errors reasons for all tools for the Rasta dataset],
caption: [Heatmap of the ratio of error reasons for all tools for the Rasta dataset],
) <fig:rasta-heatmap>
@fig:rasta-heatmap draws the most frequent error objects for each of the tools.
A black square is an error type that represents more than 80% of the errors raised by the considered tool.
In between, gray squares show a ratio between 20% and 80% of the reported errors.
In between, grey squares show a ratio between 20% and 80% of the reported errors.
First, the heatmap helps us to confirm that our experiments is running in adequate conditions.
First, the heatmap helps us to confirm that our experiment is running in adequate conditions.
Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`.
The first one only appears for gator with a low ratio.
Several tool have a low ratio of errors concerning the stack.
These results confirm that the allocated heap and stack is sufficient for running the tools with the Rasta dataset.
Regarding errors linked to the disk space, we observe few ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
Manual inspections revealed that those errors are often a consequence of a failed apktool execution.
The first one only appears for Gator with a low ratio.
Several tools have a low ratio of errors concerning the stack.
These results confirm that the allocated heap and stack are sufficient for running the tools with the Rasta dataset.
Regarding errors linked to the disk space, we observe small ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
Manual inspections revealed that those errors are often a consequence of a failed Apktool execution.
Second, the black squares indicate frequent errors that need to be investigated separately.
In the next subsection, we manually analysed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix.
In the next subsection, we manually analysed, when possible, the code that generates these high ratios of errors, and we give feedback about the possible causes and difficulties in writing a bug fix.
=== Tool by Tool Investigation // <sec:rasta-tool-by-tool-inv>
/*
@ -211,10 +211,10 @@ Anadroid: DONE
*/
#paragraph[Androguard and Androguard_dad][
Surprisingly, while Androguard almost never fails to analyse an APK, the internal decompiler of Androguard (DAD) fails more than half of the time.
Surprisingly, while Androguard rarely fails to analyse an #APK, the internal decompiler of Androguard (DAD) fails more than half of the time.
The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems).
It should be noticed that Androguard_dad rarely fails on the Drebin dataset.
This illustrates the importance to test tools on real and up-to-date APKs: even a bad handling of filenames can influence an analysis.
This illustrates the importance of testing tools on real and up-to-date #APKs: even a bad handling of filenames can influence an analysis.
]
/*
@ -231,14 +231,14 @@ dad: SError
#paragraph([Mallodroid and Apparecium])[
Mallodroid and Apparecium stand out as the tools that raised the most errors in one run.
They can raise more than #num(10000) error by analysis.
However, it happened only for a few dozen of APKs, and conspicuously, the same APKs raised the same hight number of errors for both tools.
The recurring error is a `KeyError` raise by Androguard when trying to find a string by its identifier.
Although this error is logged, it seems successfully handled and during a manual analysis of the execution, both tools seemingly perform there analysis without issue.
This hight number of occurrences may suggest that the output is not valid.
Still, the tools claim to return a result, so, from our perspective, we consider those analysis as successful.
For other numerous errors, we could not identify the reason why those specific applications raise so many exceptions.
However we noticed that Mallodroid and Apparecium use outdated version of Androguard (respectively the version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions.
This suggest the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade.
However, it happened only for a few dozen #APKs, and conspicuously, the same #APKs raised the same high number of errors for both tools.
The recurring error is a `KeyError` raised by Androguard when trying to find a string by its identifier.
Although this error is logged, it seems successfully handled, and during a manual analysis of the execution, both tools seemingly perform their analysis without issue.
This high number of occurrences may suggest that the output is not valid.
Still, the tools claim to return a result, so, from our perspective, we consider those analyses as successful.
For numerous other errors, we could not identify the reason why those specific applications raise so many exceptions.
However, we noticed that Mallodroid and Apparecium use an outdated version of Androguard (respectively version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions.
This suggests the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade.
]
/*
@ -254,7 +254,7 @@ Instruction10x%
#paragraph([Blueseal])[
Because Blueseal rarely log more than one error when crashing, it is easy to identify the relevant error.
The majority of crashes comes from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation are not found (like native methods).
The majority of crashes come from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation is not found (like native methods).
]
/*
@ -282,10 +282,10 @@ Droidsafe:
CannotFindMethodException
*/
#paragraph([Ic3 and Ic3_fork])[
We compared the number of errors between Ic3 and Ic3_fork.
Ic3_fork reports less errors for all types of analysis which suggests that the author of the fork have removed the outputed errors from the original code: the thrown errors are captured in a generic `RuntimeException` which removes the semantic, making it harder our investigations.
Nevertheless, Ic3_fork has more failures than Ic3: the number of errors reported by a tool is not correlated to the final success of its analysis.
#paragraph([IC3 and IC3_fork])[
We compared the number of errors between IC3 and IC3_fork.
IC3_fork reports fewer errors for all types of analysis, which suggests that the author of the fork has removed the outputted errors from the original code: the thrown errors are captured in a generic `RuntimeException`, which removes the semantics, making it harder for our investigations.
Nevertheless, IC3_fork has more failures than IC3: the number of errors reported by a tool is not correlated to the final success of its analysis.
]
/*
@ -305,13 +305,13 @@ jasError
*/
#paragraph([Flowdroid])[
Our exchanges with the authors of Flowdroid led us to expect more timeouts from too long executions than failed run.
Surprisingly we only got #mypercent(37,NBTOTAL) of timeout, and a hight number of failures.
We tried to detect recurring causes of failures, but the complexity of Flowdroid make the investigation difficult.
Most exceptions seems to be related to concurrency. //or display a generic messages.
Other errors that came up regularly are `java.nio.channels.ClosedChannelException` which is raised when Flowdoid fails to read from the APK, although we did not find the reason of the failure, null pointer exceptions when trying to check if a null value is in a `ConcurrentHashMap` (in `LazySummaryProvider.getClassFlows()`) and `StackOverflowError` from `StronglyConnectedComponentsFast.recurse()`.
We randomly selected 20 APKs that generated stack overflows in Flowdroid and retried the analysis with 500G of RAM allocated to the JVM.
18 of those runs still failed with a stack overflow without using all the allocated memory, the other two failed after raising null pointer exceptions from `getClassFlows`.
Our exchanges with the authors of Flowdroid led us to expect more timeouts from executions taking too long than failed runs.
Surprisingly, we only got #mypercent(37,NBTOTAL) of timeout, and a high number of failures.
We tried to detect recurring causes of failures, but the complexity of Flowdroid makes the investigation difficult.
Most exceptions seem to be related to concurrency. //or display generic messages.
Other errors that came up regularly are `java.nio.channels.ClosedChannelException`, which is raised when Flowdoid fails to read from the APK, although we did not find the reason for failure, null pointer exceptions when trying to check if a null value is in a `ConcurrentHashMap` (in `LazySummaryProvider.getClassFlows()`) and `StackOverflowError` from `StronglyConnectedComponentsFast.recurse()`.
We randomly selected 20 #APKs that generated stack overflows in Flowdroid and retried the analysis with 500GB of RAM allocated to the JVM.
18 of those runs still failed with a stack overflow without using all the allocated memory, and the other two failed after raising null pointer exceptions from `getClassFlows`.
This shows that the lack of memory is not the primary cause of those failures.
]
@ -330,4 +330,4 @@ Pauck: Flowdroid avg 2m on DIALDroid-Bench (real worlds apks)
*/
As a conclusion, we observe that a lot of errors can be linked to bugs in dependencies.
Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is a no trivial task that require familiarity with the inner code of the tools.
Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is not a trivial task that requires familiarity with the inner code of the tools.

View file

@ -19,9 +19,9 @@
#align(center, highlight-block(inset: 15pt, width: 75%, block(align(left)[
This chapter intends to explore the robustness of past software dedicated to static analysis of Android applications.
We pursue the community effort that identified software supporting publications that perform static analysis of mobile applications and we propose a method for evaluating the reliability of these software.
We extensively evaluate static analysis tools on a recent dataset of Android applications including goodware and malware, that we designed to measure the influence of parameters such as the date and size of applications.
Our results show that #resultunusable of the evaluated tools are no longer usable and that the size of the bytecode and the min #SDK version have the greatest influence on the reliability of tested tools.
We pursue the community effort that identified software supporting publications that perform static analysis of mobile applications, and we propose a method for evaluating the reliability of this software.
We extensively evaluate static analysis tools on a recent dataset of Android applications, including goodware and malware, that we designed to measure the influence of parameters such as the date and size of applications.
Our results show that #resultunusable of the evaluated tools are no longer usable and that the size of the bytecode and the min #SDK version have the greatest influence on the reliability of tested the tools.
])))