wip
Some checks failed
/ test_checkout (push) Failing after 1s

This commit is contained in:
Jean-Marie Mineau 2025-08-17 23:35:07 +02:00
parent 25c79da4f9
commit 021ac36e73
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
15 changed files with 110 additions and 75 deletions

View file

@ -23,7 +23,7 @@ The observation of the success or failure of these analysis enables us to answer
/*
As a summary, the contributions of this paper are the following:
- We provide containers with a compiled version of all studied analysis tools, which ensures the reproducibility of our experiments and an easy way to analyze applications for other researchers. Additionally receipts for rebuilding such containers are provided.
- We provide containers with a compiled version of all studied analysis tools, which ensures the reproducibility of our experiments and an easy way to analyse applications for other researchers. Additionally receipts for rebuilding such containers are provided.
- We provide a recent dataset of #NBTOTALSTRING applications balanced over the time interval 2010-2023.
- We point out which static analysis tools of Li #etal SLR paper@Li2017 can safely be used and we show that #resultunusable of evaluated tools are unusable (considering that a tool that fails more than 50% of time is unusable). In total, the success rate of the tools we could run is #resultratio on our dataset.
- We discuss the effect of applications features (date, size, SDK version, goodware/malware) on static analysis tools and the nature of the issues we found by studying statistics on the errors captured during our experiments.

View file

@ -176,7 +176,7 @@ We refer to this variant of usage as androguard_dad.
In a second step, we explored the best sources to be selected among the possible forks of a tool.
We reported some indicators about the explored forks and our decision about the selected one in @tab:rasta-sources.
For each source code repository called "Origin", we reported in @tab:rasta-sources the number of GitHub stars attributed by users and we mentioned if the project is still alive (#ok in column Alive when a commit exist in the last two years).
Then, we analyzed the fork tree of the project.
Then, we analysed the fork tree of the project.
We searched recursively if any forked repository contains a more recent commit than the last one of the branch mentioned in the documentation of the original repository.
If such a commit is found (number of such commits are reported in column Alive Forks Nb), we manually looked at the reasons behind this commit and considered if we should prefer this more up-to-date repository instead of the original one (column "Alive Forks Usable").
As reported in @tab:rasta-sources, we excluded all forks, except IC3 for which we selected the fork JordanSamhi/ic3, because they always contain experimental code with no guarantee of stability.
@ -185,7 +185,7 @@ For IC3, the fork seems promising: it has been updated to be usable on a recent
We decided to keep these two versions of the tool (IC3 and IC3_fork) to compare their results.
Then, we self-allocated a maximum of four days for each tool to successfully read and follow the documentation, compile the tool and obtain the expected result when executing an analysis of a #MWE.
We sent an email to the authors of each tool to confirm that we used the more suitable version of the code, that the command line we used to analyze an application is the most suitable one and, in some cases, requested some help to solve issues in the building process.
We sent an email to the authors of each tool to confirm that we used the more suitable version of the code, that the command line we used to analyse an application is the most suitable one and, in some cases, requested some help to solve issues in the building process.
We reported in @tab:rasta-sources the authors that answered our request and confirmed our decisions.
From this building phase, several observations can be made.

View file

@ -153,7 +153,7 @@ Regarding errors linked to the disk space, we observe few ratios for the excepti
Manual inspections revealed that those errors are often a consequence of a failed apktool execution.
Second, the black squares indicate frequent errors that need to be investigated separately.
In the next subsection, we manually analyzed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix.
In the next subsection, we manually analysed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix.
=== Tool by Tool Investigation // <sec:rasta-tool-by-tool-inv>
/*
@ -211,7 +211,7 @@ Anadroid: DONE
*/
#paragraph[Androguard and Androguard_dad][
Surprisingly, while Androguard almost never fails to analyze an APK, the internal decompiler of Androguard (DAD) fails more than half of the time.
Surprisingly, while Androguard almost never fails to analyse an APK, the internal decompiler of Androguard (DAD) fails more than half of the time.
The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems).
It should be noticed that Androguard_dad rarely fails on the Drebin dataset.
This illustrates the importance to test tools on real and up-to-date APKs: even a bad handling of filenames can influence an analysis.

View file

@ -12,7 +12,7 @@ These benchmarks confirmed that some tools such as Amandroid and Flowdroid are l
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
In addition, even if Drebin is not hand-crafted, it is quite old seams to present similar issue as hand-crafted dataset when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of realworld applications.
Our finding are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Our finding are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.

View file

@ -10,7 +10,7 @@ To mitigate this possible problem we contacted the authors of the tools to confi
Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
The timeout value, amount of memory are arbitrarily fixed.
To mitigate this issue, a small extract of our dataset has been analyzed with more memory/time and we check that they was no significant difference in the results.
To mitigate this issue, a small extract of our dataset has been analysed with more memory/time and we check that they was no significant difference in the results.
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong.
To limite the impact of errors, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).