more malware vs goodware discution

This commit is contained in:
Jean-Marie Mineau 2025-08-14 00:33:01 +02:00
parent af1187f041
commit 02be146060
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2

View file

@ -1,4 +1,4 @@
#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR
#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR, APKs
#import "X_var.typ": *
#import "X_lib.typ": *
@ -8,14 +8,69 @@
=== RQ1: Re-Usability Evaluation
#todo[alt text for figure rasta-exit / rasta-exit-drebin]
#figure(
image("figs/exit-status-for-the-drebin-dataset.svg", width: 100%),
image(
"figs/exit-status-for-the-drebin-dataset.svg",
width: 100%,
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 5% of the result.
The results are (approximately) as follow:
adagio: 100% finished
amandroid: less than 5% timed out, the rest finished
anadroid: 85% failed, less than 5% timed out, the rest finished
androguard: 100% finished
androguard_dad: 5% failled, the rest finished
apparecium: arround 1% failed, the rest finished
blueseal: less than 5 failed, a little more than 10% timed out, the rest (just under 85%) finished
dialdroid: a little more than 50% finished, less than 5% timed out, arround 5% are marked as other, the rest failled
didfail: 70% finished, the rest failed
droidsafe: 40% finihed, 45% timedout, 15% failed
flowdroid: 65% finished, the rest failed
gator: 100% finished
ic3: 99% finished, 1% failed
ic3_fork: 98% finishe, 2% failed
iccta: 60% finished, less than 5% timed out, the rest failed
mallodroid: 100% finished
perfchecker: 75% finished, the rest failed
redexer: 100% finished
saaf: 90% finished, 5% timed out, 5% failed,
wognsen_et_al: 75% finished, 1% failed, the rest timed out
"
),
caption: [Exit status for the Drebin dataset],
) <fig:rasta-exit-drebin>
#figure(
image("figs/exit-status-for-the-rasta-dataset.svg", width: 100%),
image(
"figs/exit-status-for-the-rasta-dataset.svg",
width: 100%,
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 10% of the result and in the blueseal bar, for 5% of the results.
The results are (approximately) as follow:
adagio: 100% finished
amandroid: less than 5% failed, 10% timed out, the rest finished
anadroid: 95% failed, 1% timed out, the rest finished
androguard: 100% finished
androguard_dad: a little more than 45% finished, the rest failed
apparecium: arround 5% failed, 1% timed out, the rest finished
blueseal: 20% finished, a 15% timed out, 5% are marked other, the rest failed
dialdroid: 35% finished, 1% timed out, 10 are marked other, the rest failed
didfail: 25% finished, less than 5% timed out, the rest failed
droidsafe: less than 10% finihed, 20% timedout, the rest failed
flowdroid: 55% finished, the rest failed
gator: a little more than 85% finished, 5% timed out, 10% failed
ic3: less than 80% finished, 5% timed out, the rest failed
ic3_fork: 60% finished, 5% times out, the rest failed
iccta: 30% finished, 10% timed out, the rest failed
mallodroid: 100% finished
perfchecker: 25% finished, less than 5% timed out, the rest failed
redexer: 90% finished, the rest failed
saaf: 40% finished, the rest failed,
wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
"
),
caption: [Exit status for the Rasta dataset],
) <fig:rasta-exit>
@ -218,75 +273,6 @@ The date is also correlated with the success rate for Java based tools only.
=== RQ3: Malware vs Goodware <sec:rasta-mal-vs-good>
#todo[complete @sec:rasta-mal-vs-good by commenting the new figures]
/*
```
sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256 WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
0|2971 % malware
1|60455 % goodware
sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
0|243
1|6009
```
```
>>> 61.13168724279835
0.4969812257050699
>>> 60455/6009/20 * 100
50.30371110001665
```
rate goodware rate malware avg size goodware (MB) avg size malware (MB)
decile 1: 85.42 82.02 0.13 0.11
decile 2: 74.46 72.34 0.54 0.55
decile 3: 63.38 65.67 1.37 1.25
decile 4: 57.21 62.31 2.41 2.34
decile 5: 53.36 59.27 3.56 3.55
decile 6: 50.3 61.13 4.61 4.56
decile 7: 46.76 56.54 5.87 5.91
decile 8: 42.57 56.23 7.64 7.63
decile 9: 39.09 57.94 11.39 11.26
decile 10: 33.34 45.86 24.24 21.36
total: 54.28 64.82 6.29 4.14
*/
#todo[Alt text for rasta-exit-goodmal]
#figure(
image(
"figs/exit-status-for-the-rasta-dataset-goodware-malware.svg",
width: 100%,
alt: "",
),
caption: [Exit status comparing goodware and malware for the Rasta dataset],
) <fig:rasta-exit-goodmal>
/*
[15:25] Jean-Marie Mineau
moyenne de la taille total des dex: 6464228.10027989
[15:26] Jean-Marie Mineau
(tout confondu)
[15:26] Jean-Marie Mineau
goodware: 6598464.94224066
malware: 4337376.97252155
```
sqlite> SELECT AVG(apk_size) FROM apk;
16918107.6526989
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
16897989.4472311
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
17236860.8903556
```
*/
#figure({
show table: set text(size: 0.80em)
table(
@ -318,9 +304,91 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
table.cell(colspan: 3/*4*/, inset: 3pt)[],
table.hline(),
)},
placement: none, // floating figure makes this table go in the previous section :grim:
caption: [Average size and date of goodware/malware parts of the Rasta dataset],
) <tab:rasta-sizes>
We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.
/*
```
sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256 WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
0|2971 % malware
1|60455 % goodware
sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
0|243
1|6009
```
```
>>> 61.13168724279835
0.4969812257050699
>>> 60455/6009/20 * 100
50.30371110001665
```
rate goodware rate malware avg size goodware (MB) avg size malware (MB)
decile 1: 85.42 82.02 0.13 0.11
decile 2: 74.46 72.34 0.54 0.55
decile 3: 63.38 65.67 1.37 1.25
decile 4: 57.21 62.31 2.41 2.34
decile 5: 53.36 59.27 3.56 3.55
decile 6: 50.3 61.13 4.61 4.56
decile 7: 46.76 56.54 5.87 5.91
decile 8: 42.57 56.23 7.64 7.63
decile 9: 39.09 57.94 11.39 11.26
decile 10: 33.34 45.86 24.24 21.36
total: 54.28 64.82 6.29 4.14
*/
#figure(
image(
"figs/exit-status-for-the-rasta-dataset-goodware-malware.svg",
width: 100%,
alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
Each tools has two bars, one for goodware an one for malware.
The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
The timeout rate looks the same on both bar of each tools.
The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
The only two tools where the finishing rate is better for goodware are apparecium (by arround 15%) and redexer (by arround 10%).
The other tools have similar finishing rate, finishing rate slightly in favor of malware.
"
),
caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
) <fig:rasta-exit-goodmal>
/*
[15:25] Jean-Marie Mineau
moyenne de la taille total des dex: 6464228.10027989
[15:26] Jean-Marie Mineau
(tout confondu)
[15:26] Jean-Marie Mineau
goodware: 6598464.94224066
malware: 4337376.97252155
```
sqlite> SELECT AVG(apk_size) FROM apk;
16918107.6526989
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
16897989.4472311
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
17236860.8903556
```
*/
In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
We can see that malware and goodware seam to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.
#figure({
show table: set text(size: 0.80em)
@ -369,13 +437,12 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
)},
caption: [#DEX size and Finishing Rate (#FR) per decile],
) <tab:rasta-sizes-decile>
We compared the finishing rate of malware and goodware applications for evaluated tools.
Because, the size of applications impacts this finishing rate, it is interesting to compare the success rate for each decile of bytecode size.
@tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of size.
We saw the the bytecode size may be an explanation for this increase.
To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size.
We also computed the ratio of the bytecode size and finishing rate for the two populations.
We observe that the ratio for the finishing rate decreases from 1.04 to 0.73, while the ratio of the bytecode size is around 1.
We conclude from this table that analyzing malware triggers less errors than for goodware.
We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.
#highlight()[