start modifying contrib section

2025-07-19 23:01:15 +02:00 · 2025-07-19 23:01:15 +02:00 · fd4d6fa239
commit fd4d6fa239
parent ad66b1293d
49 changed files with 22629 additions and 88 deletions
--- a/3_rasta/4_discussion.typ
+++ b/3_rasta/4_discussion.typ
@ -4,51 +4,6 @@

 == Discussion <sec:rasta-discussion>

-=== State-of-the-art comparison
-
-Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
-Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications. 
-We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016. 
-// Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.
-
-Investigating the reason behind tools' errors is a difficult task and will be investigated in a future work. 
-For now, our manual investigations show that the nature of errors varies from one analysis to another, without any easy solution for the end user for fixing it.
-
-=== Recommendations
-
-Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software.
-
-For improving the  reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review. 
-For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results. 
-Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine. 
-Additionally, a small dataset should be provided  for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments.
-
-Finally, an important remark concerns the libraries used by a tool. 
-We have seen two types of libraries:
- internal libraries manipulating internal data of the tool;
- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
-We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files). 
-We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
-//: for example, old versions of apktool are the top most libraries raising errors.
-
-=== Threats to validity
-
-
-Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
-
-Despite our best efforts, it is possible that we made mistakes when building or using the tools. 
-It is also possible that we wrongly classified a result as a failure. 
-To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion. 
-// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
-
-The timeout value, amount of memory are arbitrarily fixed. 
-For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference.
-
-Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong. 
-For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).
-/*
-== Discussion <sec:rasta-discussion>
-
 #figure({
  show table: set text(size: 0.50em)
  show table.cell.where(y: 0): it => if it.x == 0 { it } else { rotate(-90deg, reflow: true, it) }
@ -139,8 +94,6 @@ For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no
 ) <tab:rasta-avgerror>


-
-
 In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp. 
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file. 
 We also compare our conclusions to the ones of the literature.
@ -254,7 +207,7 @@ Anadroid: DONE
  SELECT AVG(cnt), MAX(cnt) FROM (SELECT COUNT(*) AS cnt FROM error WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY sha256);
 */

-#paragraph[Androguard and Androguard_dad])[
+#paragraph[Androguard and Androguard_dad][
 Surprisingly, while Androguard almost never fails to analyze an APK, the internal decompiler of Androguard (DAD) fails more than half of the time. 
 The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems). 
 It should be noticed that Androguard_dad rarely fails on the Drebin dataset. 
@ -398,9 +351,13 @@ We believe that it is explained by the fact that the complexity of the code incr
 58.88
 */

-Second, our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench 30 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
+
+=== State-of-the-art comparison
+
+Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018.
 Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications. 
-We extended this result to our set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. 
+We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. 
+We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016. 
 Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal.

 /*
@ -436,5 +393,37 @@ We confirmed that most tools require a significant amount of work to get them ru
 We encounter similar issues with libraries and operating system incompatibilities, and noticed that, with time, dependencies issues may impact the build process. 
 For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central.
 //, and even one case were the could not find anywhere the compiled version of sbt used to build a tool.
-*/

+
+=== Recommendations
+
+Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software.
+
+For improving the  reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review. 
+For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results. 
+Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine. 
+Additionally, a small dataset should be provided  for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments.
+
+Finally, an important remark concerns the libraries used by a tool. 
+We have seen two types of libraries:
+- internal libraries manipulating internal data of the tool;
+- external libraries that are used to manipulate the input data (APKs, bytecode, resources).
+We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files). 
+We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries.
+//: for example, old versions of apktool are the top most libraries raising errors.
+
+=== Threats to validity
+
+
+Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool.
+
+Despite our best efforts, it is possible that we made mistakes when building or using the tools. 
+It is also possible that we wrongly classified a result as a failure. 
+To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion. 
+// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved.
+
+The timeout value, amount of memory are arbitrarily fixed. 
+For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference.
+
+Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong. 
+For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness).