This commit is contained in:
parent
973a302f1d
commit
fe6dbb1d22
8 changed files with 137 additions and 97 deletions
|
@ -31,7 +31,7 @@ As a summary, the contributions of this paper are the following:
|
|||
|
||||
The chapter is structured as follows.
|
||||
@sec:rasta-methodology presents the methodology employed to build our evaluation process and @sec:rasta-xp gives the associated experimental results.
|
||||
@sec:rasta-discussion investigates the reasons behind the observed failures of some of the tools and discusses the limitations of this work and gives some takeaways for future contributions.
|
||||
@sec:rasta-conclusion concludes the chapter.
|
||||
|
||||
|
||||
@sec:rasta-failure-analysis investigates the reasons behind the observed failures of some of the tools.
|
||||
We then compare in @sec:rasta-soa-comp our results with the contributions presented in @sec:bg-eval-tools.
|
||||
In @sec:rasta-reco, we give recommendations for tool development we drawn from our experience running our experiment.
|
||||
Finally, @sec:rasta-limit list the limit of our approach, an @sec:rasta-conclusion concludes the chapter.
|
||||
|
|
|
@ -437,6 +437,7 @@ Some tools, like DAD or perfchecker, show the finishing rate ratio augment by mo
|
|||
)},
|
||||
caption: [#DEX size and Finishing Rate (#FR) per decile],
|
||||
) <tab:rasta-sizes-decile>
|
||||
|
||||
We saw the the bytecode size may be an explanation for this increase.
|
||||
To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size.
|
||||
We also computed the ratio of the bytecode size and finishing rate for the two populations.
|
||||
|
|
|
@ -1,11 +1,12 @@
|
|||
#import "../lib.typ": todo, jfl-note
|
||||
#import "../lib.typ": etal, paragraph
|
||||
#import "../lib.typ": todo
|
||||
#import "../lib.typ": paragraph
|
||||
#import "X_var.typ": *
|
||||
#import "X_lib.typ": *
|
||||
|
||||
== Discussion <sec:rasta-discussion>
|
||||
== Failure Analysis <sec:rasta-failure-analysis>
|
||||
|
||||
#todo[split into: error analysis, soa comp, recommendations and limitations]
|
||||
In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp.
|
||||
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file.
|
||||
|
||||
#figure({
|
||||
show table: set text(size: 0.50em)
|
||||
|
@ -97,11 +98,8 @@
|
|||
) <tab:rasta-avgerror>
|
||||
|
||||
|
||||
In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp.
|
||||
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file.
|
||||
We also compare our conclusions to the ones of the literature.
|
||||
|
||||
=== Failures Analysis <sec:rasta-failure-analysis>
|
||||
=== Error Detected //<sec:rasta-errors>
|
||||
|
||||
/*
|
||||
capture erreurs
|
||||
|
@ -143,7 +141,8 @@ Therefore, we investigated the nature of errors globally, without distinction be
|
|||
) <fig:rasta-heatmap>
|
||||
|
||||
@fig:rasta-heatmap draws the most frequent error objects for each of the tools.
|
||||
A black square is an error type that represents more than 80% of the errors raised by the considered tool.In between, gray squares show a ratio between 20% and 80% of the reported errors.
|
||||
A black square is an error type that represents more than 80% of the errors raised by the considered tool.
|
||||
In between, gray squares show a ratio between 20% and 80% of the reported errors.
|
||||
|
||||
First, the heatmap helps us to confirm that our experiments is running in adequate conditions.
|
||||
Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`.
|
||||
|
@ -156,7 +155,7 @@ Manual inspections revealed that those errors are often a consequence of a faile
|
|||
Second, the black squares indicate frequent errors that need to be investigated separately.
|
||||
In the next subsection, we manually analyzed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix.
|
||||
|
||||
=== Tool by Tool Failure Analysis <sec:rasta-tool-by-tool-failure-analysis>
|
||||
=== Tool by Tool Investigation // <sec:rasta-tool-by-tool-inv>
|
||||
/*
|
||||
Dialdroid: TODO
|
||||
com.google.common.util.concurrent.ExecutionError -> memory error: java.lang.StackOverflowError, java.lang.OutOfMemoryError: Java heap space, java.lang.OutOfMemoryError: GC overhead limit exceeded
|
||||
|
@ -332,84 +331,3 @@ Pauck: Flowdroid avg 2m on DIALDroid-Bench (real worlds apks)
|
|||
|
||||
As a conclusion, we observe that a lot of errors can be linked to bugs in dependencies.
|
||||
Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is a no trivial task that require familiarity with the inner code of the tools.
|
||||
|
||||
=== State-of-the-art comparison
|
||||
|
||||
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark.
|
||||
These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications.
|
||||
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
|
||||
In addition, even if Drebin is not hand-crafted, it is quite old seams to present similar issue as hand-crafted dataset when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of realworld applications.
|
||||
|
||||
Our finding are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
|
||||
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
|
||||
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
|
||||
Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
|
||||
|
||||
/*
|
||||
Pauck: 235 micro bench, 30 real*
|
||||
Confirm didfail failled for min_sdk >= 19, all successful run (only 4%) indicated "Only phantom classes loaded, skipping analysis..."
|
||||
|
||||
SELECT tool_status, COUNT(*), AVG(dex_size) FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 WHERE min_sdk >= 19 AND tool_name = 'didfail' GROUP BY tool_status;
|
||||
FAILED|16651|13139071.2363221
|
||||
FINISHED|694|6617861.33717579
|
||||
TIMEOUT|98|6048999.2244898
|
||||
SELECT msg, COUNT(*) FROM (SELECT DISTINCT exec.sha256, msg FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 INNER JOIN error ON exec.sha256 = error.sha256 AND exec.tool_name = error.tool_name WHERE min_sdk >= 19 AND exec.tool_name = 'didfail' AND exec.tool_status = 'FINISHED') GROUP BY msg;
|
||||
|77
|
||||
Only phantom classes loaded, skipping analysis...|694
|
||||
|
||||
DroidSafe and IccTa Failled for SDK > 19 because of old apktool
|
||||
|
||||
We obsered: (nb success < 2000 for min_skd >= 20)
|
||||
['anadroid', 'blueseal', 'dialdroid', 'didfail', 'droidsafe', 'ic3_fork', 'iccta', 'perfchecker', 'saaf', 'wognsen_et_al']
|
||||
anadroid|0
|
||||
blueseal|521
|
||||
dialdroid|812
|
||||
didfail|343
|
||||
droidsafe|35
|
||||
ic3_fork|1393
|
||||
iccta|612
|
||||
perfchecker|1921
|
||||
saaf|1588
|
||||
wognsen_et_al|386
|
||||
*/
|
||||
|
||||
Third, we extended to #nbtoolsselected different tools the work done by Reaves #etal on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations).
|
||||
We confirmed that most tools require a significant amount of work to get them running.
|
||||
We encounter similar issues with libraries and operating system incompatibilities, and noticed that, as time passes, dependencies issues may impact the build process.
|
||||
For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central.
|
||||
//, and even one case were the could not find anywhere the compiled version of sbt used to build a tool.
|
||||
|
||||
|
||||
=== Recommendations
|
||||
|
||||
#jfl-note[Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software.][*developer*: dire que a la lumiere de ces resultats, on peut pense que certain pbs peuvent être évité ou bien corrigé par l'utilisateur]
|
||||
|
||||
For improving the reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review.
|
||||
For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results.
|
||||
Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine.
|
||||
Additionally, a small dataset should be provided for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments.
|
||||
|
||||
Finally, an important remark concerns the libraries used by a tool.
|
||||
We have seen two types of libraries:
|
||||
- internal libraries manipulating internal data of the tool;
|
||||
- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
|
||||
We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files).
|
||||
We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
|
||||
//: for example, old versions of apktool are the top most libraries raising errors.
|
||||
|
||||
=== Threats to validity
|
||||
|
||||
|
||||
Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
|
||||
|
||||
Despite our best efforts, it is possible that we made mistakes when building or using the tools.
|
||||
It is also possible that we wrongly classified a result as a failure.
|
||||
To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion.
|
||||
// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
|
||||
|
||||
The timeout value, amount of memory are arbitrarily fixed.
|
||||
For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference.
|
||||
|
||||
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong.
|
||||
For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).
|
53
3_rasta/5_soa_comp.typ
Normal file
53
3_rasta/5_soa_comp.typ
Normal file
|
@ -0,0 +1,53 @@
|
|||
#import "../lib.typ": todo
|
||||
#import "../lib.typ": etal
|
||||
#import "X_var.typ": *
|
||||
#import "X_lib.typ": *
|
||||
|
||||
== State-of-the-Art Comparison <sec:rasta-soa-comp>
|
||||
|
||||
In this section, we will compare our results with the contributions presented in @sec:bg-eval-tools.
|
||||
|
||||
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark.
|
||||
These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications.
|
||||
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
|
||||
In addition, even if Drebin is not hand-crafted, it is quite old seams to present similar issue as hand-crafted dataset when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of realworld applications.
|
||||
|
||||
Our finding are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
|
||||
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
|
||||
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
|
||||
Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
|
||||
|
||||
/*
|
||||
Pauck: 235 micro bench, 30 real*
|
||||
Confirm didfail failled for min_sdk >= 19, all successful run (only 4%) indicated "Only phantom classes loaded, skipping analysis..."
|
||||
|
||||
SELECT tool_status, COUNT(*), AVG(dex_size) FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 WHERE min_sdk >= 19 AND tool_name = 'didfail' GROUP BY tool_status;
|
||||
FAILED|16651|13139071.2363221
|
||||
FINISHED|694|6617861.33717579
|
||||
TIMEOUT|98|6048999.2244898
|
||||
SELECT msg, COUNT(*) FROM (SELECT DISTINCT exec.sha256, msg FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 INNER JOIN error ON exec.sha256 = error.sha256 AND exec.tool_name = error.tool_name WHERE min_sdk >= 19 AND exec.tool_name = 'didfail' AND exec.tool_status = 'FINISHED') GROUP BY msg;
|
||||
|77
|
||||
Only phantom classes loaded, skipping analysis...|694
|
||||
|
||||
DroidSafe and IccTa Failled for SDK > 19 because of old apktool
|
||||
|
||||
We obsered: (nb success < 2000 for min_skd >= 20)
|
||||
['anadroid', 'blueseal', 'dialdroid', 'didfail', 'droidsafe', 'ic3_fork', 'iccta', 'perfchecker', 'saaf', 'wognsen_et_al']
|
||||
anadroid|0
|
||||
blueseal|521
|
||||
dialdroid|812
|
||||
didfail|343
|
||||
droidsafe|35
|
||||
ic3_fork|1393
|
||||
iccta|612
|
||||
perfchecker|1921
|
||||
saaf|1588
|
||||
wognsen_et_al|386
|
||||
*/
|
||||
|
||||
Third, we extended to #nbtoolsselected different tools the work done by Reaves #etal on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations).
|
||||
We confirmed that most tools require a significant amount of work to get them running.
|
||||
We encounter similar issues with libraries and operating system incompatibilities, and noticed that, as time passes, dependencies issues may impact the build process.
|
||||
For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central.
|
||||
//, and even one case were the could not find anywhere the compiled version of sbt used to build a tool.
|
49
3_rasta/6_recommendations.typ
Normal file
49
3_rasta/6_recommendations.typ
Normal file
|
@ -0,0 +1,49 @@
|
|||
#import "../lib.typ": eg, jfl-note, MWE
|
||||
|
||||
== Recommendations <sec:rasta-reco>
|
||||
|
||||
In the light of our findings in @sec:rasta-failure-analysis and the issues we met while packaging the tools, we summarize some takeaways that we believe developers should follow to improve the success of reusing their software.
|
||||
|
||||
//*developer*: dire que a la lumiere de ces resultats, on peut pense que certain pbs peuvent être évité ou bien corrigé par l'utilisateur]
|
||||
We understand software developped for research purposes are not and should not be held to the same standards as production sofwares.
|
||||
However, research is incremental and it is not sustanable to start each tools from scratch.
|
||||
It is critical to be able to build upon tools already published, and efforts should be made to allows that when releasing a tool.
|
||||
|
||||
Durint the packaging and testing of the tools we examined in our experiment, the most notable issues we encontered could have been avoided by following classical development best practices.
|
||||
To make a tool easy to reuse, it should have a documentation with at least:
|
||||
- Instructions about how to install the dependencies.
|
||||
- Instructions about how to build the tool (if the tool need to be build).
|
||||
- Instructions about how to use the tool (#eg command line arguments)
|
||||
- Instructions about how to interpret the results of the tools (we only checked for the existance of the results in our experiment, but we found that some results can be quite obscure)
|
||||
In addition to the documentation, a minimum working example with the expected result of the tools allows a potential user to check if everything is working as intended.
|
||||
This #MWE have the additionnal benefit that is can serve as an example in the documentation.
|
||||
|
||||
Another best practice to follow is to pin the version of dependencies of the tools.
|
||||
Many modern dependency management tools can handle that: for instance for python, poetry or uv generate a lock files with the exact version of the libraries to use, cargo does the same for rust, in java this can be an option in gradle, and dependencies in maven `pom.xml` files are usually the exact version.
|
||||
For other dependencies that are not managed by a dependency manager -- for instance the java virtual machine tu use, the python interpreter, resource files -- the version to use sould be clearly documented.
|
||||
Alternatively, tools like nixpkg can be used to pin every dependencies.
|
||||
The worst case we encontered during our experiment was a tool whose documentation instructed to install the z3 dependencies with a simple `git clone`, whithout specifying the commit to use.
|
||||
The z3 project being still actively maintained, the dependency installed was not compatible, and finding a compatible version required checking releases one by one.
|
||||
Dependencies fetched with version control system should alway indicate the exact version to used (in the case of git, a commit, tag or release should be used).
|
||||
|
||||
We also found that interactions with the running environment can become verry problematic when the environment changes.
|
||||
To minimized the issues, packaging the tool inside a docker container or even a virtual machine can ensure that future users have at least access to a working version of the tool.
|
||||
|
||||
Finally, when possible, continuous integration, tests and code reviews should be implemented to improve the reliability of the developped tool.
|
||||
|
||||
Concerning the actual code of the tool, more attention should be paid to error repporting.
|
||||
When a tool failed to perform its analysis, is should be clear to the user, and the reason should be clearly reported.
|
||||
In some cases, this may imply _not_ trying to recover from unrecoverable errors: this often leads to error seemingly unrelated to the initial issue.
|
||||
This is often a problem in Java code where the developers are strongly encouraged to catch all exceptions, and in bash scripts that run several programs in a row without checking the exit statuses.
|
||||
|
||||
Good error repporting can allow futur user to solve issues encontered using the tools: for instance the log generated by Androguard's decompiler clearly show that the issue is file names exceeding the size limit.
|
||||
This issue could easily fixed by changing the filenames used to store the results.
|
||||
In contrast, the error generated by Flowdroid are so opaque that we have no idea how we could solve them.
|
||||
|
||||
And at last, an important remark concerns the libraries used by a tool.
|
||||
We have seen two types of libraries:
|
||||
- internal libraries manipulating internal data of the tool.
|
||||
- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
|
||||
We observed during our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files).
|
||||
We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
|
||||
For example, old versions of apktool are the top most libraries raising errors, but breaking changes introduced by upgrading from v1.X versions to v2.X versions preventing use from upgrading apktool.
|
16
3_rasta/7_limitations.typ
Normal file
16
3_rasta/7_limitations.typ
Normal file
|
@ -0,0 +1,16 @@
|
|||
== Limitations <sec:rasta-limit>
|
||||
|
||||
Some limitations of our approach should be kept in mind.
|
||||
|
||||
Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
|
||||
|
||||
Despite our best efforts, it is possible that we made mistakes when building or using the tools.
|
||||
It is also possible that we wrongly classified a result as a failure.
|
||||
To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion.
|
||||
Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
|
||||
|
||||
The timeout value, amount of memory are arbitrarily fixed.
|
||||
To mitigate this issue, a small extract of our dataset has been analyzed with more memory/time and we check that they was no significant difference in the results.
|
||||
|
||||
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong.
|
||||
To limite the impact of errors, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).
|
|
@ -28,5 +28,8 @@
|
|||
#include("1_intro.typ")
|
||||
#include("2_methodology.typ")
|
||||
#include("3_experiments.typ")
|
||||
#include("4_discussion.typ")
|
||||
#include("5_conclusion.typ")
|
||||
#include("4_failures_analysis.typ")
|
||||
#include("5_soa_comp.typ")
|
||||
#include("6_recommendations.typ")
|
||||
#include("7_limitations.typ")
|
||||
#include("8_conclusion.typ")
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue