wip
Some checks failed
/ test_checkout (push) Failing after 22s

This commit is contained in:
Jean-Marie 'Histausse' Mineau 2025-09-29 15:48:31 +02:00
parent f23390279c
commit 87195a483a
Signed by: histausse
GPG key ID: B66AEEDA9B645AD2
5 changed files with 170 additions and 168 deletions

View file

@ -74,41 +74,40 @@
caption: [Exit status for the Rasta dataset],
) <fig:rasta-exit>
@fig:rasta-exit-drebin and @fig:rasta-exit compare the Drebin and Rasta datasets.
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out of memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit status represented in grey (Other) are considered as unknown errors and not as failure of the tool.
They represent the success/failure rate (green/orange) of the tools.
We distinguished failure to compute a result from timeout (blue) and crashes of our evaluation framework (in grey, probably due to out-of-memory kills of the container itself).
Because it may be caused by a bug in our own analysis stack, exit statuses represented in grey (Other) are considered as unknown errors and not as failures of the tool.
We discuss further errors for which we have information in the logs in @sec:rasta-failure-analysis.
Results on the Drebin datasets shows that 11 tools have a high success rate (greater than 85%).
Results on the Drebin datasets show that 11 tools have a high success rate (greater than 85%).
The other tools have poor results.
The worst, excluding Lotrack and Tresher, is Anadroid with a ratio under 20% of success.
On the Rasta dataset, we observe a global increase of the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are of course bad result on Rasta.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin surprisingly fall below the bar of 50% of failure.
On the Rasta dataset, we observe a global increase in the number of failed status: #resultunusablenb tools (#resultunusable) have a finishing rate below 50%.
The tools that have bad results with Drebin are, of course, bad results on Rasta.
Three tools (androguard_dad, blueseal, saaf) that were performing well (higher than 85%) on Drebin, surprisingly fall below the bar of 50% of failure.
7 tools keep a high success rate: Adagio, Amandroid, Androguard, Apparecium, Gator, Mallodroid, Redexer.
Regarding IC3, the fork with a simpler build process and support for modern OS has a lower success rate than the original tool.
Two tools should be discussed in particular.
//Androguard and Flowdroid have a large community of users, as shown by the numbers of GitHub stars in @tab:rasta-sources.
Androguard has a high success rate which is not surprising: it used by a lot of tools, including for analyzing application uploaded to the Androzoo repository.
//Because of that, it should be noted that our dataset is biased in favour of Androguard. // Already in discution
//Androguard and Flowdroid have a large community of users, as shown by the number of GitHub stars in @tab:rasta-sources.
Androguard has a high success rate, which is not surprising: it is used by a lot of tools, including for analysing applications uploaded to the Androzoo repository.
//Because of that, it should be noted that our dataset is biased in favour of Androguard. // Already in discussion
Nevertheless, when using Androguard decompiler (DAD) to decompile an APK, it fails more than 50% of the time.
This example shows that even a tool that is frequently used can still run into critical failures.
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)) which was unexpected: in our exchanges, Flowdroid's author were expecting a higher rate of timeout and fewer crashes.
Concerning Flowdroid, our results show a very low timeout rate (#mypercent(37, NBTOTAL)), which was unexpected: in our exchanges, Flowdroid's authors were expecting a higher rate of timeout and fewer crashes.
As a summary, the final ratio of successful analysis for the tools that we could run
// and applications of Rasta dataset
is #mypercent(54.9, 100).
// and applications of the RASTA dataset
is #mypercent(54.9, 100).
When including the two defective tools, this ratio drops to #mypercent(49.9, 100).
#highlight-block()[
*#rq1 answer:*
On a recent dataset we consider that #resultunusable of the tools are unusable.
For the tools that we could run, #resultratio of analysis are finishing successfully.
//(those with less than 50% of successful execution and including the two tools that we were unable to build).
On a recent dataset, we consider that #resultunusable of the tools are unusable.
For the tools that we could run, #resultratio of analyses are finishing successfully.
//(those with less than 50% of successful execution, and including the two tools that we were unable to build).
]
=== #rq2: Size, #SDK and Date Influence
@ -122,7 +121,7 @@ For the tools that we could run, #resultratio of analysis are finishing successf
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-exit-evolution-java>],
[#figure(
@ -131,24 +130,26 @@ For the tools that we could run, #resultratio of analysis are finishing successf
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-exit-evolution-not-java>]
), caption: [Exit status evolution for the Rasta dataset]
)
For investigating the effect of application dates on the tools, we computed the date of each #APK based on the minimum date between the first upload in AndroZoo and the first analysis in VirusTotal.
Such a computation is more reliable than using the dex date that is often obfuscated when packaging the application.
Such a computation is more reliable than using the #DEX date, which is often obfuscated when packaging the application.
Then, for the sake of clarity of our results, we separated the tools that have mainly Java source code from those that use other languages.
Among the ones that are Java based programs, most of them use the Soot framework which may correlate the obtained results.
@fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java based tools (resp. non Java based tools).
For Java based tools, a clear decrease of finishing rate can be observed globally for all tools.
For non-Java based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
Among the ones that are Java-based programs, most of them use the Soot framework, which may correlate the obtained results.
@fig:rasta-exit-evolution-java (resp. @fig:rasta-exit-evolution-not-java) compares the success rate of the tools between 2010 and 2023 for Java-based tools (resp. non Java-based tools).
For Java-based tools, a clear decrease in finishing rate can be observed globally for all tools.
For non-Java-based tools, 2 of them keep a high success rate (Androguard, Mallodroid).
The result is expected for Androguard, because the analysis is relatively simple and the tool is largely adopted, as previously mentioned.
Mallodroid being a relatively simple script leveraging Androgard, it benefit from Androguard resilience.
It should be noted that Saaf keeps a high success ratio until 2014 and then quickly decreases to less than 20% after 2014. This example shows that, even with an identical source code and the same running platform, a tool can change of behavior among time because of the evolution of the structure of the input files.
Mallodroid, being a relatively simple script leveraging Androguard, benefits from Androguard's resilience.
It should be noted that Saaf kept a high success ratio until 2014 and then quickly decreased to less than 20% after 2014.
This example shows that, even with an identical source code and the same running platform, a tool can change its behaviour over time because of the evolution of the structure of the input files.
An interesting comparison is the specific case of Ic3 and Ic3_fork. Until 2019, the success rate is very similar. After 2020, ic3_fork is continuing to decrease whereas Ic3 keeps a success rate of around 60%.
An interesting comparison is the specific case of IC3 and Ic3_fork.
Until 2019, the success rate was very similar. After 2020, ic3_fork is continuing to decrease, whereas IC3 keeps a success rate of around 60%.
/*
```
@ -175,7 +176,7 @@ sqlite> SELECT apk1.first_seen_year, (COUNT(*) * 100) / (SELECT 20 * COUNT(*)
```
*/
To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying an other.
To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying another.
#todo[Alt text for fig rasta-decorelation-size]
#figure(stack(dir: ltr,
@ -185,7 +186,7 @@ To compare the influence of the date, #SDK version and size of applications, we
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-2022>],
[#figure(
@ -194,17 +195,17 @@ To compare the influence of the date, #SDK version and size of applications, we
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-2022>]
), caption: [Finishing rate by bytecode size for APK detected in 2022]
), caption: [Finishing rate by bytecode size for #APK detected in 2022]
) <fig:rasta-decorelation-size>
#paragraph[Fixed application year. (#num(5000) APKs)][
We selected the year 2022 which has a good amount of representatives for each decile of size in our application dataset.
@fig:rasta-rate-evolution-java-2022 (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java based tools (resp. non Java based tools) analyzing applications of 2022.
We can observe that all Java based tools have a finishing rate decreasing over years.
50% of non Java based tools have the same behavior.
#paragraph[Fixed application year. (#num(5000) #APKs)][
We selected the year 2022, which has a good amount of representatives for each decile of size in our application dataset.
@fig:rasta-rate-evolution-java-2022 (resp. @fig:rasta-rate-evolution-non-java-2022) shows the finishing rate of the tools in function of the size of the bytecode for Java-based tools (resp. non-Java-based tools) analysing applications of 2022.
We can observe that all Java-based tools have a finishing rate that decreases over the years.
50% of non-Java-based tools have the same behaviour.
]
#todo[Alt text for fig rasta-decorelation-size]
@ -215,7 +216,7 @@ We can observe that all Java based tools have a finishing rate decreasing over y
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-decile-year>],
[#figure(
@ -224,15 +225,16 @@ We can observe that all Java based tools have a finishing rate decreasing over y
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-decile-year>]
), caption: [Finishing rate by discovery year with a bytecode size $in$ [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>
#paragraph[Fixed application bytecode size. (#num(6252) APKs)][We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending of the year at a fixed bytecode size.
We observe that 9 tools over 12 have a finishing rate dropping below 20% for Java based tools, which is not the case for non Java based tools.
#paragraph[Fixed application bytecode size. (#num(6252) APKs)][
We selected the sixth decile (between 4.08 and 5.20 MB), which is well represented in a wide number of years.
@fig:rasta-rate-evolution-java-decile-year (resp. @fig:rasta-rate-evolution-non-java-decile-year) represents the finishing rate depending on the year at a fixed bytecode size.
We observe that 9 tools out of 12 have a finishing rate dropping below 20% for Java-based tools, which is not the case for non-Java-based tools.
]
#todo[Alt text for fig rasta-decorelation-min-sdk]
@ -243,7 +245,7 @@ We observe that 9 tools over 12 have a finishing rate dropping below 20% for Jav
width: 50%,
alt: ""
),
caption: [Java based tools],
caption: [Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-java-decile-min-sdk>],
[#figure(
@ -252,22 +254,22 @@ We observe that 9 tools over 12 have a finishing rate dropping below 20% for Jav
width: 50%,
alt: "",
),
caption: [Non Java based tools],
caption: [Non-Java-based tools],
supplement: [Subfigure],
) <fig:rasta-rate-evolution-non-java-decile-min-sdk>]
), caption: [Finishing rate by min #SDK with a bytecode size $in$ [4.08, 5.2] MB]
) <fig:rasta-decorelation-size>
We performed similar experiments by variating the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
We found that contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after #SDK 16.
We performed similar experiments by varying the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
We found that, contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java-based tools: 8 tools over 12 are below 50% after #SDK 16.
It is not surprising, as the min #SDK is highly correlated to the year.
#highlight-block(breakable: false)[
*#rq2 answer:*
For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
The success rate varies based on the size of bytecode and #SDK version.
The date is also correlated with the success rate for Java based tools only.
For the #nbtoolsselected tools that can be used partially, a global decrease in the success rate of tools' analysis is observed over time.
Starting at a 78% success rate, after five years, tools have 61% success; after ten years, 45% success.
The success rate varies based on the size of the bytecode and #SDK version.
The date is also correlated with the success rate for Java-based tools only.
]
@ -290,8 +292,8 @@ The date is also correlated with the success rate for Java based tools only.
//table.vline(end: 3),
//table.vline(start: 4),
//table.cell(rowspan:2)[*Average date*],
[*APK*],
[*DEX*],
[*#APK*],
[*#DEX*],
),
table.cell(colspan: 3/*4*/, inset: 3pt)[],
table.hline(),
@ -305,13 +307,12 @@ The date is also correlated with the success rate for Java based tools only.
table.hline(),
)},
placement: none, // floating figure makes this table go in the previous section :grim:
caption: [Average size and date of goodware/malware parts of the Rasta dataset],
caption: [Average size and date of goodware/malware parts of the RASTA dataset],
) <tab:rasta-sizes>
We sampled our dataset to have a variety of #APK sizes, but the size of the application is not entirely proportional to the bytecode size.
Looking at @tab:rasta-sizes, we can see that although malware are in average bigger #APKs, they contains less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflect that.
Looking at @tab:rasta-sizes, we can see that although malware are, on average, bigger #APKs, they contain less bytecode than goodware.
In the previous section, we saw that the size of the bytecode has the most significant impact on the finishing rate of analysis tools, and indeed, @fig:rasta-exit-goodmal reflects that.
/*
```
@ -386,9 +387,9 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
*/
In @fig:rasta-exit-goodmal, we compared the finishing rate of malware and goodware applications for the evaluated tools.
We can see that malware and goodware seam to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware beeing harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augment by more than 20 points.
We can see that malware and goodware seem to generate a similar number of timeouts.
However, with the exception of two tools -- apparecium and redexer, we can see a trend of goodware being harder to analyse than malware.
Some tools, like DAD or perfchecker, show the finishing rate ratio augmented by more than 20 points.
#figure({
show table: set text(size: 0.80em)
@ -403,7 +404,7 @@ Some tools, like DAD or perfchecker, show the finishing rate ratio augment by mo
table.cell(rowspan: 2)[*Decile*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[*Average DEX size (MB)*],
table.cell(colspan:2)[*Average #DEX size (MB)*],
table.vline(end: 3),
table.vline(start: 4),
table.cell(colspan:2)[* Finishing Rate: #FR*],
@ -438,15 +439,15 @@ Some tools, like DAD or perfchecker, show the finishing rate ratio augment by mo
caption: [#DEX size and Finishing Rate (#FR) per decile],
) <tab:rasta-sizes-decile>
We saw the the bytecode size may be an explanation for this increase.
We saw that the bytecode size may be an explanation for this increase.
To investigate this further, @tab:rasta-sizes-decile reports the bytecode size and the finishing rate of goodware and malware in each decile of bytecode size.
We also computed the ratio of the bytecode size and finishing rate for the two populations.
We observe that the while the bytecode size ratio between goodware an malware stays close to one in each deciles (excluding the two extremes), the goodware/malware finishing rate ratio decrease for each decile.
We observe that while the bytecode size ratio between goodware and malware stays close to one in each decile (excluding the two extremes), the goodware/malware finishing rate ratio decreases for each decile.
It goes from 1.03 for the 2#super[nd] decile to 0.67 in the 9#super[th] decile.
We conclude from this table that, at equal size, analyzing malware still triggers less errors than for goodware, and that the difference of errors generated between when analyzing a goodware and analyzing a malware increase with the bytecode size.
We conclude from this table that, at equal size, analysing malware still triggers fewer errors than for goodware, and that the difference in errors generated between when analysing goodware and analysing malware increases with the bytecode size.
#highlight-block()[
*#rq3 answer:*
Analyzing malware applications triggers less errors for static analysis tools than analyzing goodware for comparable bytecode size.
Analysing malware applications triggers fewer errors for static analysis tools than analysing goodware for comparable bytecode size.
]