I declare this manuscript finished
All checks were successful
/ test_checkout (push) Successful in 1m48s

This commit is contained in:
Jean-Marie Mineau 2025-10-07 17:16:32 +02:00
parent 9f39ded209
commit 5c3a6955bd
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
14 changed files with 162 additions and 131 deletions

View file

@ -341,7 +341,7 @@ Two datasets are used in the experiments of this section.
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieve) for comparison purposes only.
It is a well-known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
We intend to compare the rate of success on this old dataset with a more recent one.
The second one, *Rasta*, we built to cover all dates between 2010 and 2023.
The second one, *RASTA* (Reusability of Android Static Tools and Analysis), we built to cover all dates between 2010 and 2023.
This dataset is a random extract of Androzoo~@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.

View file

@ -4,10 +4,8 @@
== Experiments <sec:rasta-xp>
=== #rq1: Re-Usability Evaluation
#figure(
image(
"figs/exit-status-for-the-drebin-dataset.svg",
@ -71,10 +69,10 @@
wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
"
),
caption: [Exit status for the Rasta dataset],
caption: [Exit status for the RASTA dataset],
) <fig:rasta-exit>
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and RASTA datasets.
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out-of-memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit statuses represented in grey (Other) are considered as unknown errors and not as failures of the tool.
@ -84,8 +82,8 @@ Results on the Drebin datasets show that 11 tools have a high success rate (grea
The other tools have poor results.
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
On the Rasta dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on Rasta.
On the RASTA dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on RASTA.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin, surprisingly fall below the bar of 50% of failure.
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
@ -135,7 +133,7 @@ For the tools that we could run, #resultratio of analyses are finishing successf
supplement: none,
kind: "sub-rasta-exit-evolution"
) <fig:rasta-exit-evolution-not-java>]
), caption: [Exit status evolution for the Rasta dataset]
), caption: [Exit status evolution for the RASTA dataset]
) <fig:rasta-exit-evolution>
For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
@ -293,7 +291,7 @@ The date is also correlated with the success rate for Java-based tools only.
table.hline(),
table.header(
table.cell(colspan: 3/*4*/, inset: 3pt)[],
table.cell(rowspan:2)[*Rasta part*],
table.cell(rowspan:2)[*RASTA part*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[*Average size* (MB)],
@ -358,7 +356,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
width: 100%,
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
Each tools has two bars, one for goodware an one for malware.
The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
The goodware bars are the same as the one in the figure Exit status for the RASTA dataset.
The timeout rate looks the same on both bar of each tools.
The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
@ -366,7 +364,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
The other tools have similar finishing rate, finishing rate slightly in favor of malware.
"
),
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the RASTA dataset],
) <fig:rasta-exit-goodmal>
/*

View file

@ -137,7 +137,7 @@ Therefore, we investigated the nature of errors globally, without distinction be
width: 100%,
alt: "",
),
caption: [Heatmap of the ratio of error reasons for all tools for the Rasta dataset],
caption: [Heatmap of the ratio of error reasons for all tools for the RASTA dataset],
) <fig:rasta-heatmap>
@fig:rasta-heatmap draws the most frequent error objects for each of the tools.
@ -148,7 +148,7 @@ First, the heatmap helps us to confirm that our experiment is running in adequat
Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`.
The first one only appears for Gator with a low ratio.
Several tools have a low ratio of errors concerning the stack.
These results confirm that the allocated heap and stack are sufficient for running the tools with the Rasta dataset.
These results confirm that the allocated heap and stack are sufficient for running the tools with the RASTA dataset.
Regarding errors linked to the disk space, we observe small ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
Manual inspections revealed that those errors are often a consequence of a failed Apktool execution.

View file

@ -10,10 +10,10 @@ In this section, we will compare our results with the contributions presented in
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world benchmark and the associated recommendations to build such a benchmark.
These benchmarks confirmed that some tools, such as Amandroid and Flowdroid, are less efficient on real-world applications.
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using handcrafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of real-world applications.
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the RASTA dataset -- which is more representative of real-world applications.
Our findings are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the RASTA dataset of #NBTOTALSTRING applications.
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
Our investigations of crashes also confirmed that dependencies on older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.