here we go
This commit is contained in:
commit
ef50ff2f49
30 changed files with 16645 additions and 0 deletions
390
3_rasta/3_experiments.typ
Normal file
390
3_rasta/3_experiments.typ
Normal file
|
@ -0,0 +1,390 @@
|
|||
#import "@local/template-thesis-matisse:0.0.1": todo, highlight
|
||||
#import "X_var.typ": *
|
||||
#import "X_lib.typ": *
|
||||
|
||||
== Experiments <sec:rasta-xp>
|
||||
|
||||
|
||||
=== *RQ1*: Re-Usability Evaluation
|
||||
|
||||
|
||||
#todo[alt text for figure rasta-exit / rasta-exit-drebin]
|
||||
#figure(
|
||||
image("figs/exit-status-for-the-drebin-dataset.svg", width: 80%),
|
||||
caption: [Exit status for the Drebin dataset],
|
||||
) <fig:rasta-exit-drebin>
|
||||
|
||||
#figure(
|
||||
image("figs/exit-status-for-the-rasta-dataset.svg", width: 80%),
|
||||
caption: [Exit status for the Rasta dataset],
|
||||
) <fig:rasta-exit>
|
||||
|
||||
|
||||
Figures@fig:rasta-exit-drebin and@fig:rasta-exit compare the Drebin and Rasta datasets.
|
||||
They represent the success/failure rate (green/orange) of the tools.
|
||||
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out of memory kills of the container itself).
|
||||
Because it may be caused by a bug in our own analysis stack, exit status represented in grey (Other) are considered as unknown errors and not as failure of the tool.
|
||||
#todo[We discuss further errors for which we have information in the logs in Section/*@sec:rasta-failure-analysis*/.]
|
||||
|
||||
Results on the Drebin datasets shows that 11 tools have a high success rate (greater than 85%).
|
||||
The other tools have poor results.
|
||||
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
|
||||
|
||||
On the Rasta dataset, we observe a global increase of the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
|
||||
The tools that have bad results with Drebin are of course bad result on Rasta.
|
||||
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin surprisingly fall below the bar of 50% of failure.
|
||||
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
|
||||
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
|
||||
|
||||
Two tools should be discussed in particular.
|
||||
//Androguard and Flowdroid have a large community of users, as shown by the numbers of GitHub stars in Table~\ref{tab:sources}.
|
||||
Androguard has a high success rate which is not surprising: it used by a lot of tools, including for analyzing application uploaded to the Androzoo repository.
|
||||
//Because of that, it should be noted that our dataset is biased in favour of Androguard. // Already in discution
|
||||
Nevertheless, when using Androguard decompiler (DAD) to decompile an APK, it fails more than 50% of the time.
|
||||
This example shows that even a tool that is frequently used can still run into critical failures.
|
||||
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)) which was unexpected: in our exchanges, Flowdroid's author were expecting a higher rate of timeout and fewer crashes.
|
||||
|
||||
As a summary, the final ratio of successful analysis for the tools that we could run
|
||||
// and applications of Rasta dataset
|
||||
is #mypercent(54.9, 100). When including the two defective tools, this ratio drops to #mypercent(49.9, 100).
|
||||
|
||||
#highlight()[
|
||||
*RQ1 answer:*
|
||||
On a recent dataset we consider that \resultunusable of the tools
|
||||
are unusable. For the tools that we could run, \resultratio of analysis are finishing successfully.%(those with less than 50\% of successful execution and including the two tools that we were unable to build).
|
||||
]
|
||||
|
||||
/*
|
||||
== RQ2: temporal evolution
|
||||
|
||||
#todo[alt text for fig rasta-exit-evolution-java and rasta-exit-evolution-not-java]
|
||||
|
||||
#figure(stack(dir: ltr,
|
||||
[#figure(
|
||||
image(
|
||||
"figs/finishing-rate-by-year-of-java-based-tools.svg",
|
||||
width: 48%,
|
||||
alt: ""
|
||||
),
|
||||
caption: [Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-exit-evolution-java>],
|
||||
[#figure(
|
||||
image(
|
||||
"figs/finishing-rate-by-year-of-non-java-based-tools.svg",
|
||||
width: 48%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Non Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-exit-evolution-not-java>]
|
||||
), caption: [Exit status evolution for the Rasta dataset]
|
||||
)
|
||||
|
||||
For investigating the effect of application dates on the tools, we computed the date of each APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
|
||||
Such a computation is more reliable than using the dex date that is often obfuscated when packaging the application.
|
||||
Then, for the sake of clarity of our results, we separated the tools that have mainly Java source code from those that use other languages.
|
||||
Among the ones that are Java based programs, most of them use the Soot framework which may correlate the obtained results. @fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java based tools (resp. non Java based tools).
|
||||
For Java based tools, a clear decrease of finishing rate can be observed globally for all tools.
|
||||
For non-Java based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
|
||||
The result is expected for Androguard, because the analysis is relatively simple and the tool is largely adopted, as previously mentioned.
|
||||
Mallodroid being a relatively simple script leveraging Androgard, it benefit from Androguard resilience.
|
||||
It should be noted that Saaf keeps a high success ratio until 2014 and then quickly decreases to less than 20% after 2014. This example shows that, even with an identical source code and the same running platform, a tool can change of behavior among time because of the evolution of the structure of the input files.
|
||||
|
||||
An interesting comparison is the specific case of Ic3 and Ic3_fork. Until 2019, the success rate is very similar. After 2020, ic3_fork is continuing to decrease whereas Ic3 keeps a success rate of around 60%.
|
||||
|
||||
/*
|
||||
```
|
||||
sqlite> SELECT apk1.first_seen_year, (COUNT(*) * 100) / (SELECT 20 * COUNT(*)
|
||||
(x1...> FROM apk AS apk2 WHERE apk2.first_seen_year = apk1.first_seen_year
|
||||
(x1...> )
|
||||
...> FROM exec JOIN apk AS apk1 ON exec.sha256 = apk1.sha256
|
||||
...> WHERE exec.tool_status = 'FINISHED' OR exec.tool_status = 'UNKNOWN'
|
||||
...> GROUP BY apk1.first_seen_year ORDER BY apk1.first_seen_year;
|
||||
2010|78
|
||||
2011|78
|
||||
2012|76
|
||||
2013|70
|
||||
2014|66
|
||||
2015|61
|
||||
2016|57
|
||||
2017|54
|
||||
2018|49
|
||||
2019|47
|
||||
2020|45
|
||||
2021|42
|
||||
2022|40
|
||||
2023|39
|
||||
```
|
||||
*/
|
||||
|
||||
#highlight()[
|
||||
*RQ2 answer:* For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
|
||||
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
|
||||
]
|
||||
*/
|
||||
|
||||
|
||||
=== RQ2: Size, SDK and Date Influence
|
||||
|
||||
To measure the influence of the date, SDK version and size of applications, we fixed one parameter while varying an other.
|
||||
For the sake of clarity, we separated Java based / non Java based tools.
|
||||
|
||||
#todo[Alt text for fig rasta-decorelation-size]
|
||||
#figure(stack(dir: ltr,
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-java-based-tool-by-bytecode-size-of-apks-detected-in-2022.svg",
|
||||
width: 48%,
|
||||
alt: ""
|
||||
),
|
||||
caption: [Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-java-2022>],
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-non-java-based-tool-by-bytecode-size-of-apks-detected-in-2022.svg",
|
||||
width: 48%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Non Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-non-java-2022>]
|
||||
), caption: [Finishing rate by bytecode size for APK detected in 2022]
|
||||
) <fig:rasta-decorelation-size>
|
||||
|
||||
_Fixed application year. (5000 APKs)_
|
||||
We selected the year 2022 which has a good amount of representatives for each decile of size in our application dataset.
|
||||
@fig:rasta-rate-evolution-java-2022} (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java based tools (resp. non Java based tools) analyzing applications of 2022.
|
||||
We can observe that all Java based tools have a finishing rate decreasing over years. 50% of non Java based tools have the same behavior.
|
||||
|
||||
#todo[Alt text for fig rasta-decorelation-size]
|
||||
#figure(stack(dir: ltr,
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-java-based-tool-by-discovery-year-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
|
||||
width: 48%,
|
||||
alt: ""
|
||||
),
|
||||
caption: [Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-java-decile-year>],
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-non-java-based-tool-by-discovery-year-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
|
||||
width: 48%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Non Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-non-java-decile-year>]
|
||||
), caption: [Finishing rate by discovery year with a bytecode size $in$ [4.08, 5.2] MB]
|
||||
) <fig:rasta-decorelation-size>
|
||||
|
||||
_Fixed application bytecode size. (6252 APKs)_ We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
|
||||
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending of the year at a fixed bytecode size.
|
||||
We observe that 9 tools over 12 have a finishing rate dropping below 20% for Java based tools, which is not the case for non Java based tools.
|
||||
|
||||
#todo[Alt text for fig rasta-decorelation-min-sdk]
|
||||
#figure(stack(dir: ltr,
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-java-based-tool-by-min-sdk-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
|
||||
width: 48%,
|
||||
alt: ""
|
||||
),
|
||||
caption: [Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-java-decile-min-sdk>],
|
||||
[#figure(
|
||||
image(
|
||||
"figs/decorelation/finishing-rate-of-non-java-based-tool-by-min-sdk-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
|
||||
width: 48%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Non Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-non-java-decile-min-sdk>]
|
||||
), caption: [Finishing rate by min SDK with a bytecode size $in$ [4.08, 5.2] MB]
|
||||
) <fig:rasta-decorelation-size>
|
||||
|
||||
We performed similar experiments by variating the min SDK and target SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
|
||||
We found that contrary to the target SDK, the min SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after SDK 16.
|
||||
It is not surprising, as the min SDK is highly correlated to the year.
|
||||
|
||||
#highlight()[
|
||||
*RQ2 answer:*
|
||||
The success rate varies based on the size of bytecode and SDK version.
|
||||
The date is also correlated with the success rate for Java based tools only.
|
||||
]
|
||||
|
||||
|
||||
=== RQ3: Malware vs Goodware
|
||||
|
||||
/*
|
||||
```
|
||||
sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256 WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
|
||||
0|2971 % malware
|
||||
1|60455 % goodware
|
||||
sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
|
||||
0|243
|
||||
1|6009
|
||||
```
|
||||
```
|
||||
>>> 61.13168724279835
|
||||
0.4969812257050699
|
||||
>>> 60455/6009/20 * 100
|
||||
50.30371110001665
|
||||
```
|
||||
|
||||
rate goodware rate malware avg size goodware (MB) avg size malware (MB)
|
||||
decile 1: 85.42 82.02 0.13 0.11
|
||||
decile 2: 74.46 72.34 0.54 0.55
|
||||
decile 3: 63.38 65.67 1.37 1.25
|
||||
decile 4: 57.21 62.31 2.41 2.34
|
||||
decile 5: 53.36 59.27 3.56 3.55
|
||||
decile 6: 50.3 61.13 4.61 4.56
|
||||
decile 7: 46.76 56.54 5.87 5.91
|
||||
decile 8: 42.57 56.23 7.64 7.63
|
||||
decile 9: 39.09 57.94 11.39 11.26
|
||||
decile 10: 33.34 45.86 24.24 21.36
|
||||
total: 54.28 64.82 6.29 4.14
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
#todo[Alt text for rasta-exit-goodmal]
|
||||
#figure(
|
||||
image(
|
||||
"figs/exit-status-for-the-rasta-dataset-goodware-malware.svg",
|
||||
width: 80%,
|
||||
alt: "",
|
||||
),
|
||||
caption: [Exit status comparing goodware and malware for the Rasta dataset],
|
||||
) <fig:rasta-exit-goodmal>
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
[15:25] Jean-Marie Mineau
|
||||
|
||||
moyenne de la taille total des dex: 6464228.10027989
|
||||
|
||||
[15:26] Jean-Marie Mineau
|
||||
|
||||
(tout confondu)
|
||||
|
||||
[15:26] Jean-Marie Mineau
|
||||
|
||||
goodware: 6598464.94224066
|
||||
|
||||
malware: 4337376.97252155
|
||||
|
||||
```
|
||||
sqlite> SELECT AVG(apk_size) FROM apk;
|
||||
16918107.6526989
|
||||
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
|
||||
16897989.4472311
|
||||
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
|
||||
17236860.8903556
|
||||
```
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
#figure({
|
||||
show table: set text(size: 0.80em)
|
||||
table(
|
||||
columns: 4,
|
||||
inset: (x: 0% + 5pt, y: 0% + 2pt),
|
||||
stroke: none,
|
||||
align: center+horizon,
|
||||
table.hline(),
|
||||
table.header(
|
||||
table.cell(colspan: 4, inset: 3pt)[],
|
||||
table.cell(rowspan:2)[*Rasta part*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
table.cell(colspan:2)[*Average size*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
table.cell(rowspan:2)[*Average date*],
|
||||
[*APK*],
|
||||
[*DEX*],
|
||||
),
|
||||
table.cell(colspan: 4, inset: 3pt)[],
|
||||
table.hline(),
|
||||
table.cell(colspan: 4, inset: 3pt)[],
|
||||
|
||||
[*goodware*], num(16897989), num(6598464), [2017],
|
||||
[*malware*], num(17236860), num(4337376), [2017],
|
||||
[*total*], num(16918107), num(6464228), [2017],
|
||||
|
||||
table.cell(colspan: 4, inset: 3pt)[],
|
||||
table.hline(),
|
||||
)},
|
||||
caption: [Average size and date of goodware/malware parts of the Rasta dataset],
|
||||
) <tab:rasta-sizes>
|
||||
*/
|
||||
|
||||
|
||||
#figure({
|
||||
show table: set text(size: 0.80em)
|
||||
table(
|
||||
columns: 7,
|
||||
inset: (x: 0% + 5pt, y: 0% + 2pt),
|
||||
stroke: none,
|
||||
align: center+horizon,
|
||||
table.hline(),
|
||||
table.header(
|
||||
table.cell(colspan: 7, inset: 3pt)[],
|
||||
table.cell(rowspan: 2)[*Decile*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
table.cell(colspan:2)[*Average DEX size (MB)*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
table.cell(colspan:2)[* Finishing Rate: FR*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
[*Ratio Size*],
|
||||
table.vline(end: 3),
|
||||
table.vline(start: 4),
|
||||
[*Ratio FR*],
|
||||
[Good], [Mal],
|
||||
[Good], [Mal],
|
||||
[Good/Mal], [Good/Mal],
|
||||
),
|
||||
table.cell(colspan: 7, inset: 3pt)[],
|
||||
table.hline(),
|
||||
table.cell(colspan: 7, inset: 3pt)[],
|
||||
|
||||
num(1), num(0.13), num(0.11), num(0.85), num(0.82), num(1.17), num(1.04),
|
||||
num(2), num(0.54), num(0.55), num(0.74), num(0.72), num(0.97), num(1.03),
|
||||
num(3), num(1.37), num(1.25), num(0.63), num(0.66), num(1.09), num(0.97),
|
||||
num(4), num(2.41), num(2.34), num(0.57), num(0.62), num(1.03), num(0.92),
|
||||
num(5), num(3.56), num(3.55), num(0.53), num(0.59), num(1.00), num(0.90),
|
||||
num(6), num(4.61), num(4.56), num(0.50), num(0.61), num(1.01), num(0.82),
|
||||
num(7), num(5.87), num(5.91), num(0.47), num(0.57), num(0.99), num(0.83),
|
||||
num(8), num(7.64), num(7.63), num(0.43), num(0.56), num(1.00), num(0.76),
|
||||
num(9), num(11.39), num(11.26), num(0.39), num(0.58), num(1.01), num(0.67),
|
||||
num(10), num(24.24), num(21.36), num(0.33), num(0.46), num(1.13), num(0.73),
|
||||
|
||||
table.cell(colspan: 7, inset: 3pt)[],
|
||||
table.hline(),
|
||||
)},
|
||||
caption: [DEX size and Finishing Rate (FR) per decile],
|
||||
) <tab:rasta-sizes-decile>
|
||||
|
||||
We compared the finishing rate of malware and goodware applications for evaluated tools.
|
||||
Because, the size of applications impacts this finishing rate, it is interesting to compare the success rate for each decile of bytecode size.
|
||||
@tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of size.
|
||||
We also computed the ratio of the bytecode size and finishing rate for the two populations.
|
||||
We observe that the ratio for the finishing rate decreases from 1.04 to 0.73, while the ratio of the bytecode size is around 1.
|
||||
We conclude from this table that analyzing malware triggers less errors than for goodware.
|
||||
|
||||
|
||||
#highlight()[
|
||||
*RQ3 answer:*
|
||||
Analyzing malware applications triggers less errors for static analysis tools than analyzing goodware for comparable bytecode size.
|
||||
]
|
Loading…
Add table
Add a link
Reference in a new issue