typos in ch 3
All checks were successful
/ test_checkout (push) Successful in 1m58s

This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-09-29 16:36:54 +02:00
parent 2df810c3bd
commit 4e38131df5
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
5 changed files with 65 additions and 65 deletions

View file

@ -7,16 +7,16 @@
In this section, we will compare our results with the contributions presented in @sec:bg. In this section, we will compare our results with the contributions presented in @sec:bg.
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark. Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world benchmark and the associated recommendations to build such a benchmark.
These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications. These benchmarks confirmed that some tools, such as Amandroid and Flowdroid, are less efficient on real-world applications.
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022. We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using handcrafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
In addition, even if Drebin is not hand-crafted, it is quite old seams to present similar issue as hand-crafted dataset when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of realworld applications. In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of real-world applications.
Our finding are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018. Our findings are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications. Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016. We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal. Our investigations of crashes also confirmed that dependencies on older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
/* /*
Pauck: 235 micro bench, 30 real* Pauck: 235 micro bench, 30 real*
@ -48,6 +48,6 @@ wognsen_et_al|386
Third, we extended to #nbtoolsselected different tools the work done by Reaves #etal on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations). Third, we extended to #nbtoolsselected different tools the work done by Reaves #etal on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations).
We confirmed that most tools require a significant amount of work to get them running. We confirmed that most tools require a significant amount of work to get them running.
We encounter similar issues with libraries and operating system incompatibilities, and noticed that, as time passes, dependencies issues may impact the build process. We encounter similar issues with libraries and operating system incompatibilities, and noticed that, as time passes, dependency issues may impact the build process.
For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central. For instance, we encountered cases where the repositories hosting the dependencies were closed, or cases where Maven failed to download dependencies because the OS version did not support SSL, which is now mandatory to access Maven Central.
//, and even one case were the could not find anywhere the compiled version of sbt used to build a tool. //, and even one case where they could not find anywhere the compiled version of sbt used to build a tool.

View file

@ -2,48 +2,48 @@
== Recommendations <sec:rasta-reco> == Recommendations <sec:rasta-reco>
In the light of our findings in @sec:rasta-failure-analysis and the issues we met while packaging the tools, we summarize some takeaways that we believe developers should follow to improve the success of reusing their software. In light of our findings in @sec:rasta-failure-analysis and the issues we encountered while packaging the tools, we summarise some takeaways that we believe developers should follow to improve the success of reusing their software.
//*developer*: dire que a la lumiere de ces resultats, on peut pense que certain pbs peuvent être évité ou bien corrigé par l'utilisateur] //*developer*: dire que a la lumiere de ces resultats, on peut pense que certain pbs peuvent être évité ou bien corrigé par l'utilisateur]
We understand software developped for research purposes are not and should not be held to the same standards as production sofwares. We understand software developed for research purposes is not and should not be held to the same standards as production software.
However, research is incremental and it is not sustanable to start each tools from scratch. However, research is incremental, and it is not sustainable to start each tool from scratch.
It is critical to be able to build upon tools already published, and efforts should be made to allows that when releasing a tool. It is critical to be able to build upon tools already published, and efforts should be made to allow that when releasing a tool.
Durint the packaging and testing of the tools we examined in our experiment, the most notable issues we encontered could have been avoided by following classical development best practices. During the packaging and testing of the tools we examined in our experiment, the most notable issues we encountered could have been avoided by following classical development best practices.
To make a tool easy to reuse, it should have a documentation with at least: To make a tool easy to reuse, it should have documentation with at least:
- Instructions about how to install the dependencies. - Instructions about how to install the dependencies.
- Instructions about how to build the tool (if the tool need to be build). - Instructions about how to build the tool (if the tool needs to be built).
- Instructions about how to use the tool (#eg command line arguments) - Instructions about how to use the tool (#eg command line arguments)
- Instructions about how to interpret the results of the tools (we only checked for the existance of the results in our experiment, but we found that some results can be quite obscure) - Instructions about how to interpret the results of the tools (we only checked for the existence of the results in our experiment, but we found that some results can be quite obscure)
In addition to the documentation, a minimum working example with the expected result of the tools allows a potential user to check if everything is working as intended. In addition to the documentation, a minimum working example with the expected result of the tools allows a potential user to check if everything is working as intended.
This #MWE have the additionnal benefit that is can serve as an example in the documentation. This #MWE have the additional benefit that it can serve as an example in the documentation.
Another best practice to follow is to pin the version of dependencies of the tools. Another best practice to follow is to pin the version of dependencies of the tools.
Many modern dependency management tools can handle that: for instance for python, poetry or uv generate a lock files with the exact version of the libraries to use, cargo does the same for rust, in java this can be an option in gradle, and dependencies in maven `pom.xml` files are usually the exact version. Many modern dependency management tools can handle that: for instance, for Python, Poetry or uv generate a lock file with the exact version of the libraries to use, Cargo does the same for Rust, in Java, this can be an option in Gradle, and dependencies in Maven `pom.xml` files are usually the exact version.
For other dependencies that are not managed by a dependency manager -- for instance the java virtual machine tu use, the python interpreter, resource files -- the version to use sould be clearly documented. For other dependencies that are not managed by a dependency manager -- for instance, the Java virtual machine to use, the Python interpreter, resource files -- the version to use should be clearly documented.
Alternatively, tools like nixpkg can be used to pin every dependencies. Alternatively, tools like nixpkg can be used to pin every dependency.
The worst case we encontered during our experiment was a tool whose documentation instructed to install the z3 dependencies with a simple `git clone`, whithout specifying the commit to use. The worst case we encountered during our experiment was a tool whose documentation instructed us to install the z3 dependencies with a simple `git clone`, without specifying the commit to use.
The z3 project being still actively maintained, the dependency installed was not compatible, and finding a compatible version required checking releases one by one. The z3 project is still actively maintained, so the dependency installed was not compatible, and finding a compatible version required checking releases one by one.
Dependencies fetched with version control system should alway indicate the exact version to used (in the case of git, a commit, tag or release should be used). Dependencies fetched with a version control system should always indicate the exact version to use (in the case of git, a commit, tag, or release should be used).
We also found that interactions with the running environment can become verry problematic when the environment changes. We also found that interactions with the running environment can become very problematic when the environment changes.
To minimized the issues, packaging the tool inside a docker container or even a virtual machine can ensure that future users have at least access to a working version of the tool. To minimise the issues, packaging the tool inside a Docker container or even a virtual machine can ensure that future users have at least access to a working version of the tool.
Finally, when possible, continuous integration, tests and code reviews should be implemented to improve the reliability of the developped tool. Finally, when possible, continuous integration, tests and code reviews should be implemented to improve the reliability of the developed tool.
Concerning the actual code of the tool, more attention should be paid to error repporting. Concerning the actual code of the tool, more attention should be paid to error reporting.
When a tool failed to perform its analysis, is should be clear to the user, and the reason should be clearly reported. When a tool failed to perform its analysis, it should be clear to the user, and the reason should be clearly reported.
In some cases, this may imply _not_ trying to recover from unrecoverable errors: this often leads to error seemingly unrelated to the initial issue. In some cases, this may imply _not_ trying to recover from unrecoverable errors: this often leads to errors seemingly unrelated to the initial issue.
This is often a problem in Java code where the developers are strongly encouraged to catch all exceptions, and in bash scripts that run several programs in a row without checking the exit statuses. This is often a problem in Java code, where the developers are strongly encouraged to catch all exceptions, and in bash scripts that run several programs in a row without checking the exit statuses.
Good error repporting can allow futur user to solve issues encontered using the tools: for instance the log generated by Androguard's decompiler clearly show that the issue is file names exceeding the size limit. Good error reporting can allow future users to solve issues encountered using the tools: for instance, the log generated by Androguard's decompiler clearly shows that the issue is file names exceeding the size limit.
This issue could easily fixed by changing the filenames used to store the results. This issue could easily be fixed by changing the filenames used to store the results.
In contrast, the error generated by Flowdroid are so opaque that we have no idea how we could solve them. In contrast, the errors generated by Flowdroid are so opaque that we have no idea how we could solve them.
And at last, an important remark concerns the libraries used by a tool. And at last, an important remark concerns the libraries used by a tool.
We have seen two types of libraries: We have seen two types of libraries:
- internal libraries manipulating internal data of the tool. - internal libraries manipulating internal data of the tool.
- external libraries that are used to manipulate the input data (APKs, bytecode, resources). - external libraries that are used to manipulate the input data (APKs, bytecode, resources).
We observed during our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files). We observed during our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files).
We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries. We believe that the developer should provide enough documentation to make a later upgrade of these external libraries possible.
For example, old versions of apktool are the top most libraries raising errors, but breaking changes introduced by upgrading from v1.X versions to v2.X versions preventing use from upgrading apktool. For example, old versions of Apktool are the top-most libraries raising errors, but breaking changes introduced by upgrading from v1.X versions to v2.X versions prevent us from upgrading Apktool.

View file

@ -2,15 +2,15 @@
Some limitations of our approach should be kept in mind. Some limitations of our approach should be kept in mind.
Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool. Our application dataset is biased in favour of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
Despite our best efforts, it is possible that we made mistakes when building or using the tools. Despite our best efforts, it is possible that we made mistakes when building or using the tools.
It is also possible that we wrongly classified a result as a failure. It is also possible that we wrongly classified a result as a failure.
To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion. To mitigate this possible problem, we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion.
Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved. Before running the final experiment, we also ran the tools on a subset of our dataset and manually investigated the most common errors to ensure that they are not trivial errors that can be solved.
The timeout value, amount of memory are arbitrarily fixed. The timeout value and memory limits are arbitrarily fixed.
To mitigate this issue, a small extract of our dataset has been analysed with more memory/time and we check that they was no significant difference in the results. To mitigate this issue, a small extract of our dataset has been analysed with more memory/time, and we checked that there was no significant difference in the results.
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong. Finally, the use of VirusTotal for determining if an application is malware or not may be wrong.
To limite the impact of errors, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness). To limit the impact of errors, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).

View file

@ -4,22 +4,22 @@
== Futur Works <sec:rasta-futur> == Futur Works <sec:rasta-futur>
A first extention to this work would obviously be to studdy more tools. A first extension to this work would obviously be to study more tools.
We restricted ourself to the tools listed by Li #etal, but it would interesting to compare our result to the finishing rate of recently released tools. We restricted ourselves to the tools listed by Li #etal, but it would be interesting to compare our result to the finishing rate of recently released tools.
It would be interesting to see if they are better at handling large #APKs, but also to see if older applications are more challenging for them due to discontinued features. It would be interesting to see if they are better at handling large #APKs, but also to see if older applications are more challenging for them due to discontinued features.
Another avenue would be to define a benchmark to check the ability of tools to handle real-world applications. Another avenue would be to define a benchmark to check the ability of tools to handle real-world applications.
Our dataset is much to large for a simple benchmark, and is sampled to have a variety of application size and year of publication. Our dataset is too large for a simple benchmark and is sampled to have a variety of application sizes and years of publication.
Hence, the first step would be to sample a dataset for this benchmark. Hence, the first step would be to sample a dataset for this benchmark.
Current benchmark datasets focus on accuracy of the tested tools, with difficult to analyse applications. Current benchmark datasets focus on the accuracy of the tested tools, with difficult-to-analyse applications.
It could be instesting to extract from our result some of applications that the most tools failed to analyse, and either use them directly or studdy them to craft simpler applications reproducing the same challenged as those applications. It could be interesting to extract from our results some of the applications that the most tools failed to analyse, and either use them directly or study them to craft simpler applications reproducing the same challenges as those applications.
Such dataset would need to be updated regularly: we saw that there is a trend for newer applications to be harder to analyse, a frozen dataset would ignore this factor. Such datasets would need to be updated regularly: we saw that there is a trend for newer applications to be harder to analyse, a frozen dataset would ignore this factor.
In addition to the finishing rate, it would be both interesting and usefull to have reference value. In addition to the finishing rate, it would be both interesting and useful to have reference values.
@tab:rasta-rec-deps list common Android related dependencies we encontered when packaging the tools. @tab:rasta-rec-deps list common Android-related dependencies we encountered when packaging the tools.
We can see that each tools use at least one of those dependencies. We can see that each tools use at least one of those dependencies.
It would be resonnable to consider the best finishing ratio a tool can have to be the finishing ratio of a tool that would perfom an "empty analysis" using the same dependencies. It would be reasonable to consider the best finishing ratio a tool can have to be the finishing ratio of a tool that would perform an "empty analysis" using the same dependencies.
Considering the prevalence of those dependencies, having those theoritical minimum could also guide future tool developers when choosing their dependencies. Considering the prevalence of those dependencies, having those theoretical minimums could also guide future tool developers when choosing their dependencies.
#figure({ #figure({
//show table: set text(size: 0.80em) //show table: set text(size: 0.80em)

View file

@ -5,21 +5,21 @@
== Conclusion <sec:rasta-conclusion> == Conclusion <sec:rasta-conclusion>
Since the release of Android, many tools have been published in order to analyse Android application. Since the release of Android, many tools have been published in order to analyse Android applications.
In @sec:bg, we went through contributions benchmarking and comparing some of those tools. In @sec:bg, we went through contributions that benchmark and compare some of those tools.
Those contributions suggested that analysing real-world applications might be more of a challenged than expected. Those contributions suggested that analysing real-world applications might be more challenging than expected.
This led us to question the reusability of those tools (#pb1). This led us to question the reusability of those tools (#pb1).
This chapter has assessed the suggested results of the literature~@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016 about the reliability of static analysis tools for Android applications. This chapter has assessed the suggested results of the literature~@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016 about the reliability of static analysis tools for Android applications.
With a dataset of #NBTOTALSTRING applications we established that #resultunusable of #nbtoolsselectedvariations tools are not reusable. With a dataset of #NBTOTALSTRING applications, we established that #resultunusable of #nbtoolsselectedvariations tools are not reusable.
2 of those where due to the fact that whe did not managed to use the tools, even with the help of the author. 2 of those were due to the fact that we did not manage to use the tools, even with the help of the author.
We consider the 10 other tools the be unusable due to the fact that they fail to finish their analysis more than 50% of the time.. We consider the 10 other tools to be unusable due to the fact that they fail to finish their analysis more than 50% of the time..
In total, the analysis success rate of the tools that we could run for the entire dataset is #resultratio. In total, the analysis success rate of the tools that we could run for the entire dataset is #resultratio.
The characteristics that have the most influence on the success rate is the bytecode size and min #SDK version. The characteristics that have the most influence on the success rate are the bytecode size and the min #SDK version.
Finally, we showed that malware #APKs generate less fatal errors than goodware when analysed. Finally, we showed that malware #APKs generate fewer fatal errors than goodware when analysed.
Following Reaves #etal recommendations~@reaves_droid_2016, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files. Following Reaves #etal recommendations~@reaves_droid_2016, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files.
This will allow the research community to use directly the tools without the build and installation penalty. This will allow the research community to use the tools directly without the build and installation penalty.
#v(2em) #v(2em)
@ -27,8 +27,8 @@ This will allow the research community to use directly the tools without the bui
#pb1: #pb1-text #pb1: #pb1-text
#v(0.75em) #v(0.75em)
More than half the tools we selected were not usable. More than half the tools we selected were not usable.
In some cases, it was due to our inability to setup the tool correctly. In some cases, it was due to our inability to set up the tool correctly.
Mostly, it was due to the high failure rate when analysing real-world applications. Mostly, it was due to the high failure rate when analysing real-world applications.
Results show that large applications cause more crashes, as does applications with higher min #SDK target. Results show that large applications cause more crashes, as do applications with a higher min #SDK target.
Goodware also appear to generate more analysis failure than malware. Goodware also appear to generate more analysis failures than malware.
]))) ])))