start modifying contrib section
All checks were successful
/ test_checkout (push) Successful in 1m0s
All checks were successful
/ test_checkout (push) Successful in 1m0s
This commit is contained in:
parent
ad66b1293d
commit
fd4d6fa239
49 changed files with 22629 additions and 88 deletions
|
@ -59,12 +59,11 @@ Reaves #etal also report that real world applications are more challenging to an
|
|||
We will confirm and expand this result in this paper with a larger dataset than only 16 real-world applications.
|
||||
// Indeed, a more diverse dataset would assess the results and give more insight about the factors impacting the performances of the tools.
|
||||
|
||||
// PAS LA PLACE !
|
||||
// Finally, our approach is similar to the methodology employed by Mauthe #etal for decompilers@mauthe_large-scale_2021.
|
||||
// To assess the robustness of android decompilers, Mauthe #etal used 4 decompilers on a dataset of 40 000 applications.
|
||||
// The error messages of the decompilers were parsed to list the methods that failed to decompile, and this information was used to estimate the main causes of failure.
|
||||
// It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failure are from third parties library rather than the core code of the application.
|
||||
// They also concluded that malware are easier to entirely decompile, but have a higher failure rate, meaning that the one that are hard to decompile are substantially harder to decompile than goodware.
|
||||
Finally, our approach is similar to the methodology employed by Mauthe #etal for decompilers@mauthe_large-scale_2021.
|
||||
To assess the robustness of android decompilers, Mauthe #etal used 4 decompilers on a dataset of 40 000 applications.
|
||||
The error messages of the decompilers were parsed to list the methods that failed to decompile, and this information was used to estimate the main causes of failure.
|
||||
It was found that the failure rate is correlated to the size of the method, and that a consequent amount of failure are from third parties library rather than the core code of the application.
|
||||
They also concluded that malware are easier to entirely decompile, but have a higher failure rate, meaning that the one that are hard to decompile are substantially harder to decompile than goodware.
|
||||
|
||||
|
||||
/*
|
||||
|
|
|
@ -254,12 +254,11 @@ Probleme 2: pour sampler, on utilise les deciles de taille d'apk, mais pour nos
|
|||
|
||||
*/
|
||||
|
||||
// Two datasets are used in the experiments of this section.
|
||||
// The first one is *Drebin*@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
|
||||
// It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases@Pendlebury2018.
|
||||
// We intend to compare the rate of success on this old dataset with a more recent one.
|
||||
// The second one,
|
||||
We built a dataset named *Rasta* to cover all dates between 2010 to 2023.
|
||||
Two datasets are used in the experiments of this section.
|
||||
The first one is *Drebin*@Arp2014, from which we extracted the malware part (5479 samples that we could retrieved) for comparison purpose only.
|
||||
It is a well known and very old dataset that should not be used anymore because it contains temporal and spatial biases@Pendlebury2018.
|
||||
We intend to compare the rate of success on this old dataset with a more recent one.
|
||||
The second one, *Rasta*, we built to cover all dates between 2010 to 2023.
|
||||
This dataset is a random extract of Androzoo@allixAndroZooCollectingMillions2016, for which we balanced applications between years and size.
|
||||
For each year and inter-decile range of size in Androzoo, 500 applications have been extracted with an arbitrary proportion of 7% of malware.
|
||||
This ratio has been chosen because it is the ratio of goodware/malware that we observed when performing a raw extract of Androzoo.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
#import "../lib.typ": todo, highlight, num, paragraph
|
||||
#import "../lib.typ": todo, highlight, num, paragraph, SDK
|
||||
#import "X_var.typ": *
|
||||
#import "X_lib.typ": *
|
||||
|
||||
|
@ -55,8 +55,7 @@ For the tools that we could run, #resultratio of analysis are finishing successf
|
|||
//(those with less than 50% of successful execution and including the two tools that we were unable to build).
|
||||
]
|
||||
|
||||
/*
|
||||
== RQ2: temporal evolution
|
||||
=== RQ2: Size, #SDK and Date Influence
|
||||
|
||||
#todo[alt text for fig rasta-exit-evolution-java and rasta-exit-evolution-not-java]
|
||||
|
||||
|
@ -120,17 +119,7 @@ sqlite> SELECT apk1.first_seen_year, (COUNT(*) * 100) / (SELECT 20 * COUNT(*)
|
|||
```
|
||||
*/
|
||||
|
||||
#highlight()[
|
||||
*RQ2 answer:* For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
|
||||
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
|
||||
]
|
||||
*/
|
||||
|
||||
|
||||
=== RQ2: Size, SDK and Date Influence
|
||||
|
||||
To measure the influence of the date, SDK version and size of applications, we fixed one parameter while varying an other.
|
||||
For the sake of clarity, we separated Java based / non Java based tools.
|
||||
To compare the influence of the date, #SDK version and size of applications, we fixed one parameter while varying an other.
|
||||
|
||||
#todo[Alt text for fig rasta-decorelation-size]
|
||||
#figure(stack(dir: ltr,
|
||||
|
@ -209,21 +198,25 @@ We observe that 9 tools over 12 have a finishing rate dropping below 20% for Jav
|
|||
caption: [Non Java based tools],
|
||||
supplement: [Subfigure],
|
||||
) <fig:rasta-rate-evolution-non-java-decile-min-sdk>]
|
||||
), caption: [Finishing rate by min SDK with a bytecode size $in$ [4.08, 5.2] MB]
|
||||
), caption: [Finishing rate by min #SDK with a bytecode size $in$ [4.08, 5.2] MB]
|
||||
) <fig:rasta-decorelation-size>
|
||||
|
||||
We performed similar experiments by variating the min SDK and target SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
|
||||
We found that contrary to the target SDK, the min SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after SDK 16.
|
||||
It is not surprising, as the min SDK is highly correlated to the year.
|
||||
We performed similar experiments by variating the min #SDK and target #SDK versions, still with a fixed bytecode size between 4.08 and 5.2 MB, as shown in @fig:rasta-rate-evolution-java-decile-min-sdk and @fig:rasta-rate-evolution-non-java-decile-min-sdk.
|
||||
We found that contrary to the target #SDK, the min #SDK version has an impact on the finishing rate of Java based tools: 8 tools over 12 are below 50% after #SDK 16.
|
||||
It is not surprising, as the min #SDK is highly correlated to the year.
|
||||
|
||||
#highlight()[
|
||||
*RQ2 answer:*
|
||||
The success rate varies based on the size of bytecode and SDK version.
|
||||
For the #nbtoolsselected tools that can be used partially, a global decrease of the success rate of tools' analysis is observed over time.
|
||||
Starting at 78% of success rate, after five years, tools have 61% of success; after ten years, 45% of success.
|
||||
The success rate varies based on the size of bytecode and #SDK version.
|
||||
The date is also correlated with the success rate for Java based tools only.
|
||||
]
|
||||
|
||||
|
||||
=== RQ3: Malware vs Goodware
|
||||
=== RQ3: Malware vs Goodware <sec:rasta-mal-vs-good>
|
||||
|
||||
#todo[complete @sec:rasta-mal-vs-good by commenting the new figures]
|
||||
|
||||
/*
|
||||
```
|
||||
|
@ -256,7 +249,6 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
|
|||
*/
|
||||
|
||||
|
||||
/*
|
||||
#todo[Alt text for rasta-exit-goodmal]
|
||||
#figure(
|
||||
image(
|
||||
|
@ -266,8 +258,6 @@ sqlite> SELECT vt_detection == 0, COUNT(DISTINCT sha256) FROM apk WHERE dex_size
|
|||
),
|
||||
caption: [Exit status comparing goodware and malware for the Rasta dataset],
|
||||
) <fig:rasta-exit-goodmal>
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
[15:25] Jean-Marie Mineau
|
||||
|
@ -295,7 +285,6 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
|
|||
*/
|
||||
|
||||
|
||||
/*
|
||||
#figure({
|
||||
show table: set text(size: 0.80em)
|
||||
table(
|
||||
|
@ -329,7 +318,6 @@ sqlite> SELECT AVG(apk_size) FROM apk WHERE vt_detection != 0;
|
|||
)},
|
||||
caption: [Average size and date of goodware/malware parts of the Rasta dataset],
|
||||
) <tab:rasta-sizes>
|
||||
*/
|
||||
|
||||
|
||||
#figure({
|
||||
|
|
|
@ -4,51 +4,6 @@
|
|||
|
||||
== Discussion <sec:rasta-discussion>
|
||||
|
||||
=== State-of-the-art comparison
|
||||
|
||||
Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
|
||||
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016.
|
||||
// Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
|
||||
|
||||
Investigating the reason behind tools' errors is a difficult task and will be investigated in a future work.
|
||||
For now, our manual investigations show that the nature of errors varies from one analysis to another, without any easy solution for the end user for fixing it.
|
||||
|
||||
=== Recommendations
|
||||
|
||||
Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software.
|
||||
|
||||
For improving the reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review.
|
||||
For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results.
|
||||
Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine.
|
||||
Additionally, a small dataset should be provided for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments.
|
||||
|
||||
Finally, an important remark concerns the libraries used by a tool.
|
||||
We have seen two types of libraries:
|
||||
- internal libraries manipulating internal data of the tool;
|
||||
- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
|
||||
We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files).
|
||||
We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
|
||||
//: for example, old versions of apktool are the top most libraries raising errors.
|
||||
|
||||
=== Threats to validity
|
||||
|
||||
|
||||
Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
|
||||
|
||||
Despite our best efforts, it is possible that we made mistakes when building or using the tools.
|
||||
It is also possible that we wrongly classified a result as a failure.
|
||||
To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion.
|
||||
// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
|
||||
|
||||
The timeout value, amount of memory are arbitrarily fixed.
|
||||
For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference.
|
||||
|
||||
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong.
|
||||
For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).
|
||||
/*
|
||||
== Discussion <sec:rasta-discussion>
|
||||
|
||||
#figure({
|
||||
show table: set text(size: 0.50em)
|
||||
show table.cell.where(y: 0): it => if it.x == 0 { it } else { rotate(-90deg, reflow: true, it) }
|
||||
|
@ -139,8 +94,6 @@ For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no
|
|||
) <tab:rasta-avgerror>
|
||||
|
||||
|
||||
|
||||
|
||||
In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp.
|
||||
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file.
|
||||
We also compare our conclusions to the ones of the literature.
|
||||
|
@ -254,7 +207,7 @@ Anadroid: DONE
|
|||
SELECT AVG(cnt), MAX(cnt) FROM (SELECT COUNT(*) AS cnt FROM error WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY sha256);
|
||||
*/
|
||||
|
||||
#paragraph[Androguard and Androguard_dad])[
|
||||
#paragraph[Androguard and Androguard_dad][
|
||||
Surprisingly, while Androguard almost never fails to analyze an APK, the internal decompiler of Androguard (DAD) fails more than half of the time.
|
||||
The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems).
|
||||
It should be noticed that Androguard_dad rarely fails on the Drebin dataset.
|
||||
|
@ -398,9 +351,13 @@ We believe that it is explained by the fact that the complexity of the code incr
|
|||
58.88
|
||||
*/
|
||||
|
||||
Second, our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench 30 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
|
||||
|
||||
=== State-of-the-art comparison
|
||||
|
||||
Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
|
||||
Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications.
|
||||
We extended this result to our set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
|
||||
We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio.
|
||||
We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016.
|
||||
Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
|
||||
|
||||
/*
|
||||
|
@ -436,5 +393,37 @@ We confirmed that most tools require a significant amount of work to get them ru
|
|||
We encounter similar issues with libraries and operating system incompatibilities, and noticed that, with time, dependencies issues may impact the build process.
|
||||
For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central.
|
||||
//, and even one case were the could not find anywhere the compiled version of sbt used to build a tool.
|
||||
*/
|
||||
|
||||
|
||||
=== Recommendations
|
||||
|
||||
Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software.
|
||||
|
||||
For improving the reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review.
|
||||
For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results.
|
||||
Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine.
|
||||
Additionally, a small dataset should be provided for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments.
|
||||
|
||||
Finally, an important remark concerns the libraries used by a tool.
|
||||
We have seen two types of libraries:
|
||||
- internal libraries manipulating internal data of the tool;
|
||||
- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
|
||||
We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files).
|
||||
We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
|
||||
//: for example, old versions of apktool are the top most libraries raising errors.
|
||||
|
||||
=== Threats to validity
|
||||
|
||||
|
||||
Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
|
||||
|
||||
Despite our best efforts, it is possible that we made mistakes when building or using the tools.
|
||||
It is also possible that we wrongly classified a result as a failure.
|
||||
To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion.
|
||||
// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
|
||||
|
||||
The timeout value, amount of memory are arbitrarily fixed.
|
||||
For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference.
|
||||
|
||||
Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong.
|
||||
For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue