I declare this manuscript finished
All checks were successful
/ test_checkout (push) Successful in 1m48s
All checks were successful
/ test_checkout (push) Successful in 1m48s
This commit is contained in:
parent
9f39ded209
commit
5c3a6955bd
14 changed files with 162 additions and 131 deletions
|
@ -341,7 +341,7 @@ Two datasets are used in the experiments of this section.
|
|||
The first one is *Drebin*~@Arp2014, from which we extracted the malware part (5479 samples that we could retrieve) for comparison purposes only.
|
||||
It is a well-known and very old dataset that should not be used anymore because it contains temporal and spatial biases~@Pendlebury2018.
|
||||
We intend to compare the rate of success on this old dataset with a more recent one.
|
||||
The second one, *Rasta*, we built to cover all dates between 2010 and 2023.
|
||||
The second one, *RASTA* (Reusability of Android Static Tools and Analysis), we built to cover all dates between 2010 and 2023.
|
||||
This dataset is a random extract of Androzoo~@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
|
||||
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
|
||||
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.
|
||||
|
|
|
@ -4,10 +4,8 @@
|
|||
|
||||
== Experiments <sec:rasta-xp>
|
||||
|
||||
|
||||
=== #rq1: Re-Usability Evaluation
|
||||
|
||||
|
||||
#figure(
|
||||
image(
|
||||
"figs/exit-status-for-the-drebin-dataset.svg",
|
||||
|
@ -71,10 +69,10 @@
|
|||
wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
|
||||
"
|
||||
),
|
||||
caption: [Exit status for the Rasta dataset],
|
||||
caption: [Exit status for the RASTA dataset],
|
||||
) <fig:rasta-exit>
|
||||
|
||||
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
|
||||
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and RASTA datasets.
|
||||
They represent the success/failure rate (green/orange) of the tools.
|
||||
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out-of-memory kills of the container itself).
|
||||
Because it may be caused by a bug in our own analysis stack, exit statuses represented in grey (Other) are considered as unknown errors and not as failures of the tool.
|
||||
|
@ -84,8 +82,8 @@ Results on the Drebin datasets show that 11 tools have a high success rate (grea
|
|||
The other tools have poor results.
|
||||
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
|
||||
|
||||
On the Rasta dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
|
||||
The tools that have bad results with Drebin are, of course, bad results on Rasta.
|
||||
On the RASTA dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
|
||||
The tools that have bad results with Drebin are, of course, bad results on RASTA.
|
||||
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin, surprisingly fall below the bar of 50% of failure.
|
||||
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
|
||||
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
|
||||
|
@ -135,7 +133,7 @@ For the tools that we could run, #resultratio of analyses are finishing successf
|
|||
supplement: none,
|
||||
kind: "sub-rasta-exit-evolution"
|
||||
) <fig:rasta-exit-evolution-not-java>]
|
||||
), caption: [Exit status evolution for the Rasta dataset]
|
||||
), caption: [Exit status evolution for the RASTA dataset]
|
||||
) <fig:rasta-exit-evolution>
|
||||
|
||||
For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
|
||||
|
@ -293,7 +291,7 @@ The date is also correlated with the success rate for Java-based tools only.
|
|||
table.hline(),
|
||||
table.header(
|
||||
table.cell(colspan: 3/*4*/, inset: 3pt)[],
|
||||
table.cell(rowspan:2)[*Rasta part*],
|
||||
table.cell(rowspan:2)[*RASTA part*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
table.cell(colspan:2)[*Average size* (MB)],
|
||||
|
@ -358,7 +356,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
|
|||
width: 100%,
|
||||
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
|
||||
Each tools has two bars, one for goodware an one for malware.
|
||||
The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
|
||||
The goodware bars are the same as the one in the figure Exit status for the RASTA dataset.
|
||||
The timeout rate looks the same on both bar of each tools.
|
||||
The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
|
||||
The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
|
||||
|
@ -366,7 +364,7 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
|
|||
The other tools have similar finishing rate, finishing rate slightly in favor of malware.
|
||||
"
|
||||
),
|
||||
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
|
||||
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the RASTA dataset],
|
||||
) <fig:rasta-exit-goodmal>
|
||||
|
||||
/*
|
||||
|
|
|
@ -137,7 +137,7 @@ Therefore, we investigated the nature of errors globally, without distinction be
|
|||
width: 100%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Heatmap of the ratio of error reasons for all tools for the Rasta dataset],
|
||||
caption: [Heatmap of the ratio of error reasons for all tools for the RASTA dataset],
|
||||
) <fig:rasta-heatmap>
|
||||
|
||||
@fig:rasta-heatmap draws the most frequent error objects for each of the tools.
|
||||
|
@ -148,7 +148,7 @@ First, the heatmap helps us to confirm that our experiment is running in adequat
|
|||
Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`.
|
||||
The first one only appears for Gator with a low ratio.
|
||||
Several tools have a low ratio of errors concerning the stack.
|
||||
These results confirm that the allocated heap and stack are sufficient for running the tools with the Rasta dataset.
|
||||
These results confirm that the allocated heap and stack are sufficient for running the tools with the RASTA dataset.
|
||||
Regarding errors linked to the disk space, we observe small ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
|
||||
Manual inspections revealed that those errors are often a consequence of a failed Apktool execution.
|
||||
|
||||
|
|
|
@ -10,10 +10,10 @@ In this section, we will compare our results with the contributions presented in
|
|||
Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022, a real-world benchmark and the associated recommendations to build such a benchmark.
|
||||
These benchmarks confirmed that some tools, such as Amandroid and Flowdroid, are less efficient on real-world applications.
|
||||
We confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using handcrafted test applications or old datasets~@luoTaintBenchAutomaticRealworld2022.
|
||||
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the Rasta dataset -- which is more representative of real-world applications.
|
||||
In addition, even if Drebin is not hand-crafted, it is quite old and seems to present similar issues as handcrafted datasets when used to evaluate a tool: we obtained really good results compared to the RASTA dataset -- which is more representative of real-world applications.
|
||||
|
||||
Our findings are also consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench~@bosuCollusiveDataLeak2017 real-world applications are analysed successfully with the 6 evaluated tools~@pauckAndroidTaintAnalysis2018.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the RASTA dataset of #NBTOTALSTRING applications.
|
||||
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
|
||||
We confirmed that most tools require a significant amount of work to get them running~@reaves_droid_2016.
|
||||
Our investigations of crashes also confirmed that dependencies on older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue