more malware vs goodware discution

2025-08-14 00:33:01 +02:00 · 2025-08-14 00:33:01 +02:00 · 02be146060
commit 02be146060
parent af1187f041
1 changed files with 146 additions and 79 deletions
--- a/3_rasta/3_experiments.typ
+++ b/3_rasta/3_experiments.typ
@ -1,4 +1,4 @@
-#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR
+#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR, APKs
 #import "X_var.typ": *
 #import "X_lib.typ": *

@ -8,14 +8,69 @@
 === RQ1: Re-Usability Evaluation


-#todo[alt text for figure rasta-exit / rasta-exit-drebin]
 #figure(
-  image("figs/exit-status-for-the-drebin-dataset.svg", width: 100%),
+  image(
+    "figs/exit-status-for-the-drebin-dataset.svg", 
+    width: 100%,
+    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
+      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
+      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 5% of the result.
+      The results are (approximately) as follow:
+      adagio: 100% finished
+      amandroid: less than 5% timed out, the rest finished
+      anadroid: 85% failed, less than 5% timed out, the rest finished
+      androguard: 100% finished
+      androguard_dad: 5% failled, the rest finished
+      apparecium: arround 1% failed, the rest finished
+      blueseal: less than 5 failed, a little more than 10% timed out, the rest (just under 85%) finished
+      dialdroid: a little more than 50% finished, less than 5% timed out, arround 5% are marked as other, the rest failled
+      didfail: 70% finished, the rest failed
+      droidsafe: 40% finihed, 45% timedout, 15% failed
+      flowdroid: 65% finished, the rest failed
+      gator: 100% finished
+      ic3: 99% finished, 1% failed
+      ic3_fork: 98% finishe, 2% failed
+      iccta: 60% finished, less than 5% timed out, the rest failed
+      mallodroid: 100% finished
+      perfchecker: 75% finished, the rest failed
+      redexer: 100% finished
+      saaf: 90% finished, 5% timed out, 5% failed,
+      wognsen_et_al: 75% finished, 1% failed, the rest timed out
+    "
+  ),
  caption: [Exit status for the Drebin dataset],
 ) <fig:rasta-exit-drebin>

 #figure(
-  image("figs/exit-status-for-the-rasta-dataset.svg", width: 100%),
+  image(
+    "figs/exit-status-for-the-rasta-dataset.svg", 
+    width: 100%,
+    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
+      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
+      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 10% of the result and in the blueseal bar, for 5% of the results.
+      The results are (approximately) as follow:
+      adagio: 100% finished
+      amandroid: less than 5% failed, 10% timed out, the rest finished
+      anadroid: 95% failed, 1% timed out, the rest finished
+      androguard: 100% finished
+      androguard_dad: a little more than 45% finished, the rest failed
+      apparecium: arround 5% failed, 1% timed out, the rest finished
+      blueseal: 20% finished, a 15% timed out, 5% are marked other, the rest failed
+      dialdroid: 35% finished, 1% timed out, 10 are marked other, the rest failed
+      didfail: 25% finished, less than 5% timed out, the rest failed
+      droidsafe: less than 10% finihed, 20% timedout, the rest failed
+      flowdroid: 55% finished, the rest failed
+      gator: a little more than 85% finished, 5% timed out, 10% failed
+      ic3: less than 80% finished, 5% timed out, the rest failed
+      ic3_fork: 60% finished, 5% times out, the rest failed
+      iccta: 30% finished, 10% timed out, the rest failed
+      mallodroid: 100% finished
+      perfchecker: 25% finished, less than 5% timed out, the rest failed
+      redexer: 90% finished, the rest failed
+      saaf: 40% finished, the rest failed,
+      wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
+    "
+  ),
  caption: [Exit status for the Rasta dataset],
 ) <fig:rasta-exit>

@ -218,75 +273,6 @@ The date is also correlated with the success rate for Java based tools only.

 === RQ3: Malware vs Goodware <sec:rasta-mal-vs-good>

-#todo[complete @sec:rasta-mal-vs-good by commenting the new figures]
-
-/*
-```
-sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256  WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
-0|2971 % malware
-1|60455 % goodware
-sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
-0|243
-1|6009
-```
-```
->>> 61.13168724279835
-0.4969812257050699
->>> 60455/6009/20 * 100
-50.30371110001665
-```
- 
-              rate goodware    rate malware     avg size goodware (MB)    avg size malware (MB)
- decile  1:           85.42           82.02                       0.13                     0.11
- decile  2:           74.46           72.34                       0.54                     0.55
- decile  3:           63.38           65.67                       1.37                     1.25
- decile  4:           57.21           62.31                       2.41                     2.34
- decile  5:           53.36           59.27                       3.56                     3.55
- decile  6:            50.3           61.13                       4.61                     4.56
- decile  7:           46.76           56.54                       5.87                     5.91
- decile  8:           42.57           56.23                       7.64                     7.63
- decile  9:           39.09           57.94                      11.39                    11.26
- decile 10:           33.34           45.86                      24.24                    21.36
- total:               54.28           64.82                       6.29                     4.14
-*/
-
-
-#todo[Alt text for rasta-exit-goodmal]
-#figure(
-  image(
-    "figs/exit-status-for-the-rasta-dataset-goodware-malware.svg", 
-    width: 100%,
-    alt: "",
-  ),
-  caption: [Exit status comparing goodware and malware for the Rasta dataset],
-) <fig:rasta-exit-goodmal>
-
-/*
-[15:25] Jean-Marie Mineau
-
-moyenne de la taille total des dex: 6464228.10027989
-
-[15:26] Jean-Marie Mineau
-
-(tout confondu)
-
-[15:26] Jean-Marie Mineau
-
-goodware: 6598464.94224066
-
-malware: 4337376.97252155
-
-```
-sqlite> SELECT AVG(apk_size) FROM apk;
-16918107.6526989
-sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
-16897989.4472311
-sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
-17236860.8903556
-```
-*/
-
-
 #figure({
  show table: set text(size: 0.80em)
  table( 
@ -318,9 +304,91 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
    table.cell(colspan: 3/*4*/, inset: 3pt)[],
    table.hline(),
  )},
+  placement: none, // floating figure makes this table go in the previous section :grim:
  caption: [Average size and date of goodware/malware parts of the Rasta dataset],
 ) <tab:rasta-sizes>

+We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
+Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
+In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.
+
+
+/*
+```
+sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256  WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
+0|2971 % malware
+1|60455 % goodware
+sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
+0|243
+1|6009
+```
+```
+>>> 61.13168724279835
+0.4969812257050699
+>>> 60455/6009/20 * 100
+50.30371110001665
+```
+ 
+              rate goodware    rate malware     avg size goodware (MB)    avg size malware (MB)
+ decile  1:           85.42           82.02                       0.13                     0.11
+ decile  2:           74.46           72.34                       0.54                     0.55
+ decile  3:           63.38           65.67                       1.37                     1.25
+ decile  4:           57.21           62.31                       2.41                     2.34
+ decile  5:           53.36           59.27                       3.56                     3.55
+ decile  6:            50.3           61.13                       4.61                     4.56
+ decile  7:           46.76           56.54                       5.87                     5.91
+ decile  8:           42.57           56.23                       7.64                     7.63
+ decile  9:           39.09           57.94                      11.39                    11.26
+ decile 10:           33.34           45.86                      24.24                    21.36
+ total:               54.28           64.82                       6.29                     4.14
+*/
+
+#figure(
+  image(
+    "figs/exit-status-for-the-rasta-dataset-goodware-malware.svg", 
+    width: 100%,
+    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
+      Each tools has two bars, one for goodware an one for malware.
+      The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
+      The timeout rate looks the same on both bar of each tools.
+      The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
+      The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
+      The only two tools where the finishing rate is better for goodware are apparecium (by arround 15%) and redexer (by arround 10%).
+      The other tools have similar finishing rate, finishing rate slightly in favor of malware.
+    "
+  ),
+  caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
+) <fig:rasta-exit-goodmal>
+
+/*
+[15:25] Jean-Marie Mineau
+
+moyenne de la taille total des dex: 6464228.10027989
+
+[15:26] Jean-Marie Mineau
+
+(tout confondu)
+
+[15:26] Jean-Marie Mineau
+
+goodware: 6598464.94224066
+
+malware: 4337376.97252155
+
+```
+sqlite> SELECT AVG(apk_size) FROM apk;
+16918107.6526989
+sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
+16897989.4472311
+sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
+17236860.8903556
+```
+*/
+
+In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
+We can see that malware and goodware seam to generate a similar number of timeouts.
+However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
+Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.

 #figure({
  show table: set text(size: 0.80em)
@ -369,13 +437,12 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
  )},
  caption: [#DEX size and Finishing Rate (#FR) per decile],
 ) <tab:rasta-sizes-decile>
-
-We compared the finishing rate of malware and goodware applications for evaluated tools. 
-Because, the size of applications impacts this finishing rate, it is interesting to  compare the success rate for each decile of bytecode size. 
-@tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of size. 
+We saw the the bytecode size may be an explanation for this increase.
+To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size. 
 We also computed the ratio of the bytecode size and finishing rate for the two populations. 
-We observe that the ratio for the finishing rate decreases from 1.04 to 0.73, while the ratio of the bytecode size is around 1. 
-We conclude from this table that analyzing malware triggers less errors than for goodware.
+We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
+It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
+We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.


 #highlight()[