thesis/3_rasta/3_experiments.typ

#import "../lib.typ": todo, highlight-block, num, paragraph, SDK, APK, DEX, FR, APKs
#import "X_var.typ": *
#import "X_lib.typ": *

== Experiments <sec:rasta-xp>


=== #rq1: Re-Usability Evaluation


#figure(
  image(
    "figs/exit-status-for-the-drebin-dataset.svg",
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 5% of the result.
      The results are (approximately) as follow:
      adagio: 100% finished
      amandroid: less than 5% timed out, the rest finished
      anadroid: 85% failed, less than 5% timed out, the rest finished
      androguard: 100% finished
      androguard_dad: 5% failled, the rest finished
      apparecium: arround 1% failed, the rest finished
      blueseal: less than 5 failed, a little more than 10% timed out, the rest (just under 85%) finished
      dialdroid: a little more than 50% finished, less than 5% timed out, arround 5% are marked as other, the rest failled
      didfail: 70% finished, the rest failed
      droidsafe: 40% finihed, 45% timedout, 15% failed
      flowdroid: 65% finished, the rest failed
      gator: 100% finished
      ic3: 99% finished, 1% failed
      ic3_fork: 98% finishe, 2% failed
      iccta: 60% finished, less than 5% timed out, the rest failed
      mallodroid: 100% finished
      perfchecker: 75% finished, the rest failed
      redexer: 100% finished
      saaf: 90% finished, 5% timed out, 5% failed,
      wognsen_et_al: 75% finished, 1% failed, the rest timed out
    "
  ),
  caption: [Exit status for the Drebin dataset],
) <fig:rasta-exit-drebin>

#figure(
  image(
    "figs/exit-status-for-the-rasta-dataset.svg",
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Horizontal blue dotted lines mark the 15%, 50% % and 85% values.
      Each bar represent a tools, with the finished analysis in green at the bottom, the analysis that timed of in blue, then on top in red the analysis that failed. Their is a last color, grey, for the other category, only visible in the dialdroid bar representing 10% of the result and in the blueseal bar, for 5% of the results.
      The results are (approximately) as follow:
      adagio: 100% finished
      amandroid: less than 5% failed, 10% timed out, the rest finished
      anadroid: 95% failed, 1% timed out, the rest finished
      androguard: 100% finished
      androguard_dad: a little more than 45% finished, the rest failed
      apparecium: arround 5% failed, 1% timed out, the rest finished
      blueseal: 20% finished, a 15% timed out, 5% are marked other, the rest failed
      dialdroid: 35% finished, 1% timed out, 10 are marked other, the rest failed
      didfail: 25% finished, less than 5% timed out, the rest failed
      droidsafe: less than 10% finihed, 20% timedout, the rest failed
      flowdroid: 55% finished, the rest failed
      gator: a little more than 85% finished, 5% timed out, 10% failed
      ic3: less than 80% finished, 5% timed out, the rest failed
      ic3_fork: 60% finished, 5% times out, the rest failed
      iccta: 30% finished, 10% timed out, the rest failed
      mallodroid: 100% finished
      perfchecker: 25% finished, less than 5% timed out, the rest failed
      redexer: 90% finished, the rest failed
      saaf: 40% finished, the rest failed,
      wognsen_et_al: a little less than 15% finished, a little less than 20% failed, the rest timed out
    "
  ),
  caption: [Exit status for the Rasta dataset],
) <fig:rasta-exit>


@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
They represent  the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out of memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit status represented in grey (Other) are considered as unknown errors and not as failure of the tool.
We discuss further errors for which we have information in the logs in @sec:rasta-failure-analysis.

Results on the Drebin datasets shows that 11 tools have a high success rate (greater than 85%).
The other tools have poor results.
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.

On the Rasta dataset, we observe a global increase of the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are of course bad result on Rasta.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin surprisingly fall below the bar of 50% of failure.
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.

Two tools should be discussed in particular.
//Androguard and Flowdroid have a large community of users, as shown by the numbers of GitHub stars in @tab:rasta-sources.
Androguard has a high success rate which is not surprising: it used by a lot of tools, including for analyzing application uploaded to the Androzoo repository.
//Because of that, it should be noted that our dataset is biased in favour of Androguard.  // Already in discution
Nevertheless, when using Androguard decompiler (DAD) to decompile an APK, it fails more than 50% of the time.
This example shows that even a tool that is frequently used can still run into critical failures.
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)) which was unexpected: in our exchanges, Flowdroid's author were expecting a higher rate of timeout and fewer crashes.

As a summary, the final ratio of successful analysis for the tools that we could run
// and applications of Rasta dataset
is  #mypercent(54.9, 100).
When including the two defective tools, this ratio drops to #mypercent(49.9, 100).

#highlight-block()[
*#rq1 answer:*
On a recent dataset we consider that #resultunusable of the tools are unusable.
For the tools that we could run, #resultratio of analysis are finishing successfully.
//(those with less than 50% of successful execution and including the two tools that we were unable to build).
]

=== #rq2: Size, #SDK and Date Influence

#todo[alt text for fig rasta-exit-evolution-java and rasta-exit-evolution-not-java]

#figure(stack(dir: ltr,
  [#figure(
    image(
      "figs/finishing-rate-by-year-of-java-based-tools.svg",
      width: 50%,
      alt: ""
    ),
    caption: [Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-exit-evolution-java>],
  [#figure(
    image(
      "figs/finishing-rate-by-year-of-non-java-based-tools.svg",
      width: 50%,
      alt: "",
    ),
    caption: [Non Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-exit-evolution-not-java>]
  ), caption: [Exit status evolution for the Rasta dataset]
)

For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
Such a computation is more reliable than using the dex date that is often obfuscated when packaging the application.
Then, for the sake of clarity of our results, we separated the tools that have mainly Java source code from those that use other languages.
Among the ones that are Java based programs, most of them use the Soot framework which may correlate the obtained results.
@fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java based tools (resp. non Java based tools).
For Java based tools, a clear decrease of finishing rate can be observed globally for all tools.
For non-Java based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
The result is expected for Androguard, because the analysis is relatively simple and the tool is largely adopted, as previously mentioned.
Mallodroid being a relatively simple script leveraging Androgard, it benefit from Androguard resilience.
It should be noted that Saaf keeps a high success ratio until 2014 and then quickly decreases to less than 20% after 2014. This example shows that, even with an identical source code and the same running platform, a tool can change of behavior among time because of the evolution of the structure of the input files.

An interesting comparison is the specific case of Ic3 and Ic3_fork. Until 2019, the success rate is very similar. After 2020, ic3_fork is continuing to decrease whereas Ic3 keeps a success rate of around 60%.

/*
```
sqlite> SELECT apk1.first_seen_year, (COUNT(*) * 100) / (SELECT 20 * COUNT(*)
(x1...>     FROM apk AS apk2 WHERE apk2.first_seen_year = apk1.first_seen_year
(x1...> )
...> FROM exec JOIN apk AS apk1 ON exec.sha256 = apk1.sha256
...> WHERE exec.tool_status = 'FINISHED' OR exec.tool_status = 'UNKNOWN'
...> GROUP BY apk1.first_seen_year ORDER BY apk1.first_seen_year;
2010|78
2011|78
2012|76
2013|70
2014|66
2015|61
2016|57
2017|54
2018|49
2019|47
2020|45
2021|42
2022|40
2023|39
```
*/

To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying an other.

#todo[Alt text for fig rasta-decorelation-size]
#figure(stack(dir: ltr,
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-java-based-tool-by-bytecode-size-of-apks-detected-in-2022.svg",
      width: 50%,
      alt: ""
    ),
    caption: [Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-java-2022>],
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-non-java-based-tool-by-bytecode-size-of-apks-detected-in-2022.svg",
      width: 50%,
      alt: "",
    ),
    caption: [Non Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-non-java-2022>]
  ), caption: [Finishing rate by bytecode size for APK detected in 2022]
) <fig:rasta-decorelation-size>

#paragraph[Fixed application year. (#num(5000) APKs)][
We selected the year 2022 which has a good amount of representatives for each decile of size in our application dataset.
@fig:rasta-rate-evolution-java-2022 (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java based tools (resp. non Java based tools) analyzing applications of 2022.
We can observe that all Java based tools have a finishing rate decreasing over years.
50% of non Java based tools have the same behavior.
]

#todo[Alt text for fig rasta-decorelation-size]
#figure(stack(dir: ltr,
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-java-based-tool-by-discovery-year-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
      width: 50%,
      alt: ""
    ),
    caption: [Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-java-decile-year>],
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-non-java-based-tool-by-discovery-year-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
      width: 50%,
      alt: "",
    ),
    caption: [Non Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-non-java-decile-year>]
  ), caption: [Finishing rate by discovery year with a bytecode size $in$  [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>

#paragraph[Fixed application bytecode size. (#num(6252) APKs)][We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending of the year at a fixed bytecode size.
We observe that 9 tools over 12 have a finishing rate dropping below 20% for Java based tools, which is not the case for non Java based tools.
]

#todo[Alt text for fig rasta-decorelation-min-sdk]
#figure(stack(dir: ltr,
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-java-based-tool-by-min-sdk-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
      width: 50%,
      alt: ""
    ),
    caption: [Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-java-decile-min-sdk>],
  [#figure(
    image(
      "figs/decorelation/finishing-rate-of-non-java-based-tool-by-min-sdk-of-apks-with-a-bytecode-size-between-4-08-mb-and-5-2-mb.svg",
      width: 50%,
      alt: "",
    ),
    caption: [Non Java based tools],
    supplement: [Subfigure],
  ) <fig:rasta-rate-evolution-non-java-decile-min-sdk>]
  ), caption: [Finishing rate by min #SDK with a bytecode size $in$ [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>

We performed similar experiments by variating the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
We found that contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after #SDK 16.
It is not surprising, as the min #SDK is highly correlated to the year.

#highlight-block(breakable: false)[
*#rq2 answer:*
For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
The success rate varies based on the size of bytecode and #SDK version.
The date is also correlated with the success rate for Java based tools only.
]


=== #rq3: Malware vs Goodware <sec:rasta-mal-vs-good>

#figure({
  show table: set text(size: 0.80em)
  table(
    columns: 3, //4,
    inset: (x: 0% + 5pt, y: 0% + 2pt),
    stroke: none,
    align: center+horizon,
    table.hline(),
    table.header(
      table.cell(colspan: 3/*4*/, inset: 3pt)[],
      table.cell(rowspan:2)[*Rasta part*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:2)[*Average size* (MB)],
      //table.vline(end: 3),
      //table.vline(start: 4),
      //table.cell(rowspan:2)[*Average date*],
      [*APK*],
      [*DEX*],
    ),
    table.cell(colspan: 3/*4*/, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 3/*4*/, inset: 3pt)[],

    [*goodware*], num(calc.round(16.897989, digits: 1)), num(calc.round(6.598464, digits: 1)),// [2017],
    [*malware*], num(calc.round(17.236860, digits: 1)), num(calc.round(4.337376, digits: 1)),// [2017],
    [*total*], num(calc.round(16.918107, digits: 1)), num(calc.round(6.464228, digits: 1)),// [2017],

    table.cell(colspan: 3/*4*/, inset: 3pt)[],
    table.hline(),
  )},
  placement: none, // floating figure makes this table go in the previous section :grim:
  caption: [Average size and date of goodware/malware parts of the Rasta dataset],
) <tab:rasta-sizes>

We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.


/*
```
sqlite> SELECT vt_detection == 0, COUNT(exec.sha256) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256  WHERE tool_status = 'FINISHED' AND dex_size_decile = 6 GROUP BY vt_detection == 0;
0|2971 % malware
1|60455 % goodware
sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size_decile = 6 GROUP BY vt_detection == 0;
0|243
1|6009
```
```
>>> 61.13168724279835
0.4969812257050699
>>> 60455/6009/20 * 100
50.30371110001665
```

              rate goodware    rate malware     avg size goodware (MB)    avg size malware (MB)
 decile  1:           85.42           82.02                       0.13                     0.11
 decile  2:           74.46           72.34                       0.54                     0.55
 decile  3:           63.38           65.67                       1.37                     1.25
 decile  4:           57.21           62.31                       2.41                     2.34
 decile  5:           53.36           59.27                       3.56                     3.55
 decile  6:            50.3           61.13                       4.61                     4.56
 decile  7:           46.76           56.54                       5.87                     5.91
 decile  8:           42.57           56.23                       7.64                     7.63
 decile  9:           39.09           57.94                      11.39                    11.26
 decile 10:           33.34           45.86                      24.24                    21.36
 total:               54.28           64.82                       6.29                     4.14
*/

#figure(
  image(
    "figs/exit-status-for-the-rasta-dataset-goodware-malware.svg",
    width: 100%,
    alt: "Bar chart showing the % of analyse apk on the y-axis and the tools on the x-axis.
      Each tools has two bars, one for goodware an one for malware.
      The goodware bars are the same as the one in the figure Exit status for the Rasta dataset.
      The timeout rate looks the same on both bar of each tools.
      The finishing rate of the malware bar is a lot higher than in the goodware bar for androguard_dad, blueseal, didfail, iccta, perfchecker and wogsen_et_al.
      The finishing rate of the malware bar is higher than in the goodware bar for ic3 and ic3_fork.
      The only two tools where the finishing rate is better for goodware are apparecium (by arround 15%) and redexer (by arround 10%).
      The other tools have similar finishing rate, finishing rate slightly in favor of malware.
    "
  ),
  caption: [Exit status comparing goodware (left bars) and malware (right bars) for the Rasta dataset],
) <fig:rasta-exit-goodmal>

/*
[15:25] Jean-Marie Mineau

moyenne de la taille total des dex: 6464228.10027989

[15:26] Jean-Marie Mineau

(tout confondu)

[15:26] Jean-Marie Mineau

goodware: 6598464.94224066

malware: 4337376.97252155

```
sqlite> SELECT AVG(apk_size) FROM apk;
16918107.6526989
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection = 0;
16897989.4472311
sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
17236860.8903556
```
*/

In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
We can see that malware and goodware seam to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.

#figure({
  show table: set text(size: 0.80em)
  table(
    columns: 7,
    inset: (x: 0% + 5pt, y: 0% + 2pt),
    stroke: none,
    align: center+horizon,
    table.hline(),
    table.header(
      table.cell(colspan: 7, inset: 3pt)[],
      table.cell(rowspan: 2)[*Decile*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:2)[*Average DEX size (MB)*],
      table.vline(end: 3),
      table.vline(start: 4),
      table.cell(colspan:2)[* Finishing Rate: #FR*],
      table.vline(end: 3),
      table.vline(start: 4),
      [*Ratio Size*],
      table.vline(end: 3),
      table.vline(start: 4),
      [*Ratio #FR*],
      [Good], [Mal],
      [Good], [Mal],
      [Good/Mal], [Good/Mal],
    ),
    table.cell(colspan: 7, inset: 3pt)[],
    table.hline(),
    table.cell(colspan: 7, inset: 3pt)[],

    num(1), num(0.13), num(0.11), num(0.85), num(0.82), num(1.17), num(1.04),
    num(2), num(0.54), num(0.55), num(0.74), num(0.72), num(0.97), num(1.03),
    num(3), num(1.37), num(1.25), num(0.63), num(0.66), num(1.09), num(0.97),
    num(4), num(2.41), num(2.34), num(0.57), num(0.62), num(1.03), num(0.92),
    num(5), num(3.56), num(3.55), num(0.53), num(0.59), num(1.00), num(0.90),
    num(6), num(4.61), num(4.56), num(0.50), num(0.61), num(1.01), num(0.82),
    num(7), num(5.87), num(5.91), num(0.47), num(0.57), num(0.99), num(0.83),
    num(8), num(7.64), num(7.63), num(0.43), num(0.56), num(1.00), num(0.76),
    num(9), num(11.39), num(11.26), num(0.39), num(0.58), num(1.01), num(0.67),
    num(10), num(24.24), num(21.36), num(0.33), num(0.46), num(1.13), num(0.73),

    table.cell(colspan: 7, inset: 3pt)[],
    table.hline(),
  )},
  caption: [#DEX size and Finishing Rate (#FR) per decile],
) <tab:rasta-sizes-decile>

We saw the the bytecode size may be an explanation for this increase.
To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size.
We also computed the ratio of the bytecode size and finishing rate for the two populations.
We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.


#highlight-block()[
*#rq3 answer:*
Analyzing malware applications triggers less errors for static analysis tools than analyzing goodware for comparable bytecode size.
]