more malware vs goodware discution

2025-08-14 00:33:01 +02:00 · 2025-08-14 00:33:01 +02:00 · 02be146060
commit 02be146060
parent af1187f041
1 changed files with 146 additions and 79 deletions
--- a/3_rasta/3_experiments.typ
+++ b/3_rasta/3_experiments.typ
@ -1,4 +1,4 @@
-#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR
+#import "../lib.typ": todo, highlight, num, paragraph, SDK, APK, DEX, FR, APKs
 #import "X_var.typ": *
 #import "X_lib.typ": *
@ -8,14 +8,69 @@
 === RQ1: Re-Usability Evaluation
 #todo[alt text for figure rasta-exit / rasta-exit-drebin]
 #figure(
-  image("figs/exit-status-for-the-drebin-dataset.svg", width: 100%),
+  image(
    "figs/exit-status-for-the-drebin-dataset.svg", 
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 5% of the result.
      The results are (approximately) as follow:
      adagio: 100% finished
      amandroid: less than 5% timed out, the rest finished
      anadroid: 85% failed, less than 5% timed out, the rest finished
      androguard: 100% finished
      androguard_dad: 5% failled, the rest finished
      apparecium: arround 1% failed, the rest finished
      blueseal: less than 5 failed, a little more than 10% timed out, the rest (just under 85%) finished
      dialdroid: a little more than 50% finished, less than 5% timed out, arround 5% are marked as other, the rest failled
      didfail: 70% finished, the rest failed
      droidsafe: 40% finihed, 45% timedout, 15% failed
      flowdroid: 65% finished, the rest failed
      gator: 100% finished
      ic3: 99% finished, 1% failed
      ic3_fork: 98% finishe, 2% failed
      iccta: 60% finished, less than 5% timed out, the rest failed
      mallodroid: 100% finished
      perfchecker: 75% finished, the rest failed
      redexer: 100% finished
      saaf: 90% finished, 5% timed out, 5% failed,
      wognsen_et_al: 75% finished, 1% failed, the rest timed out
    "
  ),
  caption: [Exit status for the Drebin dataset],
 ) <fig:rasta-exit-drebin>
 #figure(
-  image("figs/exit-status-for-the-rasta-dataset.svg", width: 100%),
+  image(
    "figs/exit-status-for-the-rasta-dataset.svg", 
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 10% of the result and in the blueseal bar, for 5% of the results.
      The results are (approximately) as follow:
      adagio: 100% finished
      amandroid: less than 5% failed, 10% timed out, the rest finished
      anadroid: 95% failed, 1% timed out, the rest finished
      androguard: 100% finished
      androguard_dad: a little more than 45% finished, the rest failed
      apparecium: arround 5% failed, 1% timed out, the rest finished
      blueseal: 20% finished, a 15% timed out, 5% are marked other, the rest failed
      dialdroid: 35% finished, 1% timed out, 10 are marked other, the rest failed
      didfail: 25% finished, less than 5% timed out, the rest failed
      droidsafe: less than 10% finihed, 20% timedout, the rest failed
      flowdroid: 55% finished, the rest failed
      gator: a little more than 85% finished, 5% timed out, 10% failed
      ic3: less than 80% finished, 5% timed out, the rest failed
      ic3_fork: 60% finished, 5% times out, the rest failed
      iccta: 30% finished, 10% timed out, the rest failed
      mallodroid: 100% finished
      perfchecker: 25% finished, less than 5% timed out, the rest failed
      redexer: 90% finished, the rest failed
      saaf: 40% finished, the rest failed,
      wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
    "
  ),
  caption: [Exit status for the Rasta dataset],
 ) <fig:rasta-exit>
@ -218,75 +273,6 @@ The date is also correlated with the success rate for Java based tools only.
 === RQ3: Malware vs Goodware <sec:rasta-mal-vs-good>
 #todo[complete @sec:rasta-mal-vs-good by commenting the new figures]
 /*
 ```
 sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256  WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
 0|2971 % malware
 1|60455 % goodware
 sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
 0|243
 1|6009
 ```
 ```
 >>> 61.13168724279835
 0.4969812257050699
 >>> 60455/6009/20 * 100
 50.30371110001665
 ```
              rate goodware    rate malware     avg size goodware (MB)    avg size malware (MB)
 decile  1:           85.42           82.02                       0.13                     0.11
 decile  2:           74.46           72.34                       0.54                     0.55
 decile  3:           63.38           65.67                       1.37                     1.25
 decile  4:           57.21           62.31                       2.41                     2.34
 decile  5:           53.36           59.27                       3.56                     3.55
 decile  6:            50.3           61.13                       4.61                     4.56
 decile  7:           46.76           56.54                       5.87                     5.91
 decile  8:           42.57           56.23                       7.64                     7.63
 decile  9:           39.09           57.94                      11.39                    11.26
 decile 10:           33.34           45.86                      24.24                    21.36
 total:               54.28           64.82                       6.29                     4.14
 */
 #todo[Alt text for rasta-exit-goodmal]
 #figure(
  image(
    "figs/exit-status-for-the-rasta-dataset-goodware-malware.svg", 
    width: 100%,
    alt: "",
  ),
  caption: [Exit status comparing goodware and malware for the Rasta dataset],
 ) <fig:rasta-exit-goodmal>
 /*
 [15:25] Jean-Marie Mineau
 moyenne de la taille total des dex: 6464228.10027989
 [15:26] Jean-Marie Mineau
 (tout confondu)
 [15:26] Jean-Marie Mineau
 goodware: 6598464.94224066
 malware: 4337376.97252155
 ```
 sqlite> SELECT AVG(apk_size) FROM apk;
 16918107.6526989
 sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
 16897989.4472311
 sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
 17236860.8903556
 ```
 */
 #figure({
  show table: set text(size: 0.80em)
  table( 
@ -318,9 +304,91 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
    table.cell(colspan: 3/*4*/, inset: 3pt)[],
    table.hline(),
  )},
  placement: none, // floating figure makes this table go in the previous section :grim:
  caption: [Average size and date of goodware/malware parts of the Rasta dataset],
 ) <tab:rasta-sizes>
 We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
 Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
 In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.
 /*
 ```
 sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256  WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
 0|2971 % malware
 1|60455 % goodware
 sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
 0|243
 1|6009
 ```
 ```
 >>> 61.13168724279835
 0.4969812257050699
 >>> 60455/6009/20 * 100
 50.30371110001665
 ```
              rate goodware    rate malware     avg size goodware (MB)    avg size malware (MB)
 decile  1:           85.42           82.02                       0.13                     0.11
 decile  2:           74.46           72.34                       0.54                     0.55
 decile  3:           63.38           65.67                       1.37                     1.25
 decile  4:           57.21           62.31                       2.41                     2.34
 decile  5:           53.36           59.27                       3.56                     3.55
 decile  6:            50.3           61.13                       4.61                     4.56
 decile  7:           46.76           56.54                       5.87                     5.91
 decile  8:           42.57           56.23                       7.64                     7.63
 decile  9:           39.09           57.94                      11.39                    11.26
 decile 10:           33.34           45.86                      24.24                    21.36
 total:               54.28           64.82                       6.29                     4.14
 */
 #figure(
  image(
    "figs/exit-status-for-the-rasta-dataset-goodware-malware.svg", 
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Each tools has two bars, one for goodware an one for malware.
      The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
      The timeout rate looks the same on both bar of each tools.
      The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
      The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
      The only two tools where the finishing rate is better for goodware are apparecium (by arround 15%) and redexer (by arround 10%).
      The other tools have similar finishing rate, finishing rate slightly in favor of malware.
    "
  ),
  caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
 ) <fig:rasta-exit-goodmal>
 /*
 [15:25] Jean-Marie Mineau
 moyenne de la taille total des dex: 6464228.10027989
 [15:26] Jean-Marie Mineau
 (tout confondu)
 [15:26] Jean-Marie Mineau
 goodware: 6598464.94224066
 malware: 4337376.97252155
 ```
 sqlite> SELECT AVG(apk_size) FROM apk;
 16918107.6526989
 sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
 16897989.4472311
 sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
 17236860.8903556
 ```
 */
 In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
 We can see that malware and goodware seam to generate a similar number of timeouts.
 However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
 Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.
 #figure({
  show table: set text(size: 0.80em)
@ -369,13 +437,12 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
  )},
  caption: [#DEX size and Finishing Rate (#FR) per decile],
 ) <tab:rasta-sizes-decile>
-
+We saw the the bytecode size may be an explanation for this increase.
-We compared the finishing rate of malware and goodware applications for evaluated tools. 
+To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size. 
 Because, the size of applications impacts this finishing rate, it is interesting to  compare the success rate for each decile of bytecode size. 
@tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of size. 
 We also computed the ratio of the bytecode size and finishing rate for the two populations. 
-We observe that the ratio for the finishing rate decreases from 1.04 to 0.73, while the ratio of the bytecode size is around 1. 
+We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
-We conclude from this table that analyzing malware triggers less errors than for goodware.
+It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
 We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.
 #highlight()[