wip

2025-09-29 15:48:31 +02:00 · 2025-09-29 15:48:31 +02:00 · 87195a483a
commit 87195a483a
parent f23390279c
5 changed files with 170 additions and 168 deletions
--- a/3_rasta/4_failures_analysis.typ
+++ b/3_rasta/4_failures_analysis.typ
@ -3,9 +3,9 @@
 #import "X_var.typ": *
 #import "X_lib.typ": *

-== Failure Analysis <sec:rasta-failure-analysis>
+== Failures Analysis <sec:rasta-failure-analysis>

-In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp. 
+In this section, we investigate the reasons behind the high failure ratio presented in @sec:rasta-xp. 
@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file. 

 #figure({
@ -99,7 +99,7 @@ In this section, we investigate the reasons behind the high ratio of failures pr


 
-=== Error Detected //<sec:rasta-errors>
+=== Detected Errors //<sec:rasta-errors>

 /*
 capture erreurs
@ -109,7 +109,7 @@ stdout, stderr
 android.jar en version 9 qui génère des erreurs
 */

-During the running of our experiments we parse the standard output and error to capture:
+During the running of our experiment, we parsed the standard output and error to capture:

 - Java errors and stack traces
 - Python errors and stack traces
@ -118,15 +118,15 @@ During the running of our experiments we parse the standard output and error to
 - XSB error messages
 - Ocaml errors

-For example, Dialdroid reports in average #num(55.9) errors for one successful analysis. 
-On the contrary, some tools such as Blueseal report very few error at a time, making it easier to identify the cause of the failure. 
+For example, Dialdroid reports an average of #num(55.9) errors for one successful analysis. 
+On the contrary, some tools, such as Blueseal report very few errors at a time, making it easier to identify the cause of the failure. 

 Because some tools send back a high number of errors in our logs (up to #num(46698) for one execution), we tried to determine the error that is linked to the failed status. 
 Unfortunately, our manual investigations confirmed that the last error of a log output is not always the one that should be attributed to the global failure of the analysis. 
 The error that seems to generate the failure can occur in the middle of the execution, be caught by the code and then other subsequent parts of the code may generate new errors as consequences of the first one. 
-Similarly, the first error of a log is not always the cause of a failure. 
+Similarly, the first error in the logs is not always the cause of a failure. 
 Sometimes errors successfully caught and handled are logged anyway. 
-Thus, it is impossible to extract accurately the error responsible for a failed execution. 
+Thus, it is impossible to accurately extract the error responsible for a failed execution. 
 Therefore, we investigated the nature of errors globally, without distinction between error messages in a log.

 #todo()[alt text for rasta-heatmap]
@ -137,23 +137,23 @@ Therefore, we investigated the nature of errors globally, without distinction be
    width: 100%,
    alt: "",
  ),
-  caption: [Heatmap of the ratio of errors reasons for all tools for the Rasta dataset],
+  caption: [Heatmap of the ratio of error reasons for all tools for the Rasta dataset],
 ) <fig:rasta-heatmap>

@fig:rasta-heatmap draws the most frequent error objects for each of the tools. 
 A black square is an error type that represents more than 80% of the errors raised by the considered tool.
-In between, gray squares show a ratio between 20% and 80% of the reported errors. 
+In between, grey squares show a ratio between 20% and 80% of the reported errors. 

-First, the heatmap helps us to confirm that our experiments is running in adequate conditions. 
+First, the heatmap helps us to confirm that our experiment is running in adequate conditions. 
 Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`. 
-The first one only appears for gator with a low ratio. 
-Several tool have a low ratio of errors concerning the stack. 
-These results confirm that the allocated heap and stack is sufficient for running the tools with the Rasta dataset. 
-Regarding errors linked to the disk space, we observe few ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
-Manual inspections revealed that those errors are often a consequence of a failed apktool execution.
+The first one only appears for Gator with a low ratio. 
+Several tools have a low ratio of errors concerning the stack. 
+These results confirm that the allocated heap and stack are sufficient for running the tools with the Rasta dataset. 
+Regarding errors linked to the disk space, we observe small ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`.
+Manual inspections revealed that those errors are often a consequence of a failed Apktool execution.

 Second, the black squares indicate frequent errors that need to be investigated separately. 
-In the next subsection, we manually analysed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix.
+In the next subsection, we manually analysed, when possible, the code that generates these high ratios of errors, and we give feedback about the possible causes and difficulties in writing a bug fix.

 === Tool by Tool Investigation // <sec:rasta-tool-by-tool-inv>
 /*
@ -211,10 +211,10 @@ Anadroid: DONE
 */

 #paragraph[Androguard and Androguard_dad][
-Surprisingly, while Androguard almost never fails to analyse an APK, the internal decompiler of Androguard (DAD) fails more than half of the time. 
+Surprisingly, while Androguard rarely fails to analyse an #APK, the internal decompiler of Androguard (DAD) fails more than half of the time. 
 The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems). 
 It should be noticed that Androguard_dad rarely fails on the Drebin dataset. 
-This illustrates the importance to test tools on real and up-to-date APKs: even a bad handling of filenames can influence an analysis.
+This illustrates the importance of testing tools on real and up-to-date #APKs: even a bad handling of filenames can influence an analysis.
 ]

 /*
@ -231,14 +231,14 @@ dad: SError
 #paragraph([Mallodroid and Apparecium])[
 Mallodroid and Apparecium stand out as the tools that raised the most errors in one run. 
 They can raise more than #num(10000) error by analysis. 
-However, it happened only for a few dozen of APKs, and conspicuously, the same APKs raised the same hight number of errors for both tools. 
-The recurring error is a `KeyError` raise by Androguard when trying to find a string by its identifier. 
-Although this error is logged, it seems successfully handled and during a manual analysis of the execution, both tools seemingly perform there analysis without issue. 
-This hight number of occurrences may suggest that the output is not valid. 
-Still, the tools claim to return a result, so, from our perspective, we consider those analysis as successful.
-For other numerous errors, we could not identify the reason why those specific applications raise so many exceptions. 
-However we noticed that Mallodroid and Apparecium use outdated version of Androguard (respectively the version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions. 
-This suggest the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade. 
+However, it happened only for a few dozen #APKs, and conspicuously, the same #APKs raised the same high number of errors for both tools. 
+The recurring error is a `KeyError` raised by Androguard when trying to find a string by its identifier. 
+Although this error is logged, it seems successfully handled, and during a manual analysis of the execution, both tools seemingly perform their analysis without issue. 
+This high number of occurrences may suggest that the output is not valid. 
+Still, the tools claim to return a result, so, from our perspective, we consider those analyses as successful.
+For numerous other errors, we could not identify the reason why those specific applications raise so many exceptions. 
+However, we noticed that Mallodroid and Apparecium use an outdated version of Androguard (respectively version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions. 
+This suggests the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade. 
 ]

 /*
@ -254,7 +254,7 @@ Instruction10x%

 #paragraph([Blueseal])[
 Because Blueseal rarely log more than one error when crashing, it is easy to identify the relevant error. 
-The majority of crashes comes from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation are not found (like native methods).
+The majority of crashes come from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation is not found (like native methods).
 ]

 /*
@ -282,10 +282,10 @@ Droidsafe:
 CannotFindMethodException
 */

-#paragraph([Ic3 and Ic3_fork])[
-We compared the number of errors between Ic3 and Ic3_fork. 
-Ic3_fork reports less errors for all types of analysis which suggests that the author of the fork have removed the outputed errors from the original code: the thrown errors are captured in a generic `RuntimeException` which removes the semantic, making it harder our investigations.
-Nevertheless, Ic3_fork has more failures than Ic3: the number of errors reported by a tool is not correlated to the final success of its analysis.  
+#paragraph([IC3 and IC3_fork])[
+We compared the number of errors between IC3 and IC3_fork. 
+IC3_fork reports fewer errors for all types of analysis, which suggests that the author of the fork has removed the outputted errors from the original code: the thrown errors are captured in a generic `RuntimeException`, which removes the semantics, making it harder for our investigations.
+Nevertheless, IC3_fork has more failures than IC3: the number of errors reported by a tool is not correlated to the final success of its analysis.  
 ]

 /*
@ -305,13 +305,13 @@ jasError
 */

 #paragraph([Flowdroid])[
-Our exchanges with the authors of Flowdroid led us to expect more timeouts from too long executions than failed run. 
-Surprisingly we only got #mypercent(37,NBTOTAL) of timeout, and a hight number of failures.
-We tried to detect recurring causes of failures, but the complexity of Flowdroid make the investigation difficult. 
-Most exceptions seems to be related to concurrency. //or display a generic messages. 
-Other errors that came up regularly are `java.nio.channels.ClosedChannelException` which is raised when Flowdoid fails to read from the APK, although we did not find the reason of the failure, null pointer exceptions when trying to check if a null value is in a `ConcurrentHashMap` (in `LazySummaryProvider.getClassFlows()`) and `StackOverflowError` from `StronglyConnectedComponentsFast.recurse()`. 
-We randomly selected 20 APKs that generated stack overflows in Flowdroid and retried the analysis with 500G of RAM allocated to the JVM. 
-18 of those runs still failed with a stack overflow without using all the allocated memory, the other two failed after raising null pointer exceptions from `getClassFlows`. 
+Our exchanges with the authors of Flowdroid led us to expect more timeouts from executions taking too long than failed runs. 
+Surprisingly, we only got #mypercent(37,NBTOTAL) of timeout, and a high number of failures.
+We tried to detect recurring causes of failures, but the complexity of Flowdroid makes the investigation difficult. 
+Most exceptions seem to be related to concurrency. //or display generic messages. 
+Other errors that came up regularly are `java.nio.channels.ClosedChannelException`, which is raised when Flowdoid fails to read from the APK, although we did not find the reason for failure, null pointer exceptions when trying to check if a null value is in a `ConcurrentHashMap` (in `LazySummaryProvider.getClassFlows()`) and `StackOverflowError` from `StronglyConnectedComponentsFast.recurse()`. 
+We randomly selected 20 #APKs that generated stack overflows in Flowdroid and retried the analysis with 500GB of RAM allocated to the JVM. 
+18 of those runs still failed with a stack overflow without using all the allocated memory, and the other two failed after raising null pointer exceptions from `getClassFlows`. 
 This shows that the lack of memory is not the primary cause of those failures. 
 ]

@ -330,4 +330,4 @@ Pauck: Flowdroid avg 2m on DIALDroid-Bench (real worlds apks)
 */

 As a conclusion, we observe that a lot of errors can be linked to bugs in dependencies. 
-Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is a no trivial task that require familiarity with the inner code of the tools.
+Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is not a trivial task that requires familiarity with the inner code of the tools.