I declare this manuscript finished

2025-10-07 17:16:32 +02:00 · 2025-10-07 17:16:32 +02:00 · 5c3a6955bd
commit 5c3a6955bd
parent 9f39ded209
14 changed files with 162 additions and 131 deletions
--- a/5_theseus/5_results.typ
+++ b/5_theseus/5_results.typ
@ -111,7 +111,8 @@ The remaining #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_faceboo
    table.cell(colspan: 4)[...],
    table.hline(),
  ),
-  caption: [Most common dynamically loaded files]
+  caption: [Most common dynamically loaded files],
+  placement: top,
 ) <tab:th-bytecode-hashes>

 === Impact on Analysis Tools
@ -167,16 +168,6 @@ This is a reasonable failure rate, but we should keep in mind that it adds up to
 To check the impact on the finishing rate of our instrumentation, we then run the same experiment we ran in @sec:rasta.
 We run the tools on the #APK before and after instrumentation, and compared the finishing rates in @fig:th-status-npatched-vs-patched (without taking into account #APKs we failed to patch#footnote[Due to a handling error during the experiment, the figure shows the results for #nb_patched_rasta #APKs instead of #nb_patched. \ We also ignored the tool from Wognsen #etal due to the high number of timeouts]).

-The finishing rate comparison is shown in @fig:th-status-npatched-vs-patched. 
-We can see that in most cases, the finishing rate is either the same or slightly lower for the instrumented application.
-This is consistent with the fact that we add more bytecode to the application, hence adding more opportunities for failure during analysis.
-They are two notable exceptions: Saaf and IC3.
-The finishing rate of IC3, which was previously reasonable, dropped to 0 after our instrumentation, while the finishing rate of Saaf jumped to 100%, which is extremely suspicious.
-Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has a version number of 37 (the version introduced by Android 7.0).
-Unfortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognises this version of bytecode, and thus failed to parse the #APK.
-In the case of Dare and IC3, our experiment correctly identifies this as a crash.
-On the other hand, Saaf do not detect the issue with Apktool and pursues the analysis with no bytecode to analyse and returns a valid return file, but for an empty application.
-
 #todo[alt text @fig:th-status-npatched-vs-patched]
 #figure({
  image(
@ -189,6 +180,16 @@ On the other hand, Saaf do not detect the issue with Apktool and pursues the ana
  caption: [Exit status of static analysis tools on original #APKs (left) and patched #APKs (right)]
 ) <fig:th-status-npatched-vs-patched>

+The finishing rate comparison is shown in @fig:th-status-npatched-vs-patched. 
+We can see that in most cases, the finishing rate is either the same or slightly lower for the instrumented application.
+This is consistent with the fact that we add more bytecode to the application, hence adding more opportunities for failure during analysis.
+They are two notable exceptions: Saaf and IC3.
+The finishing rate of IC3, which was previously reasonable, dropped to 0 after our instrumentation, while the finishing rate of Saaf jumped to 100%, which is extremely suspicious.
+Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has a version number of 37 (the version introduced by Android 7.0).
+Unfortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognises this version of bytecode, and thus failed to parse the #APK.
+In the case of Dare and IC3, our experiment correctly identifies this as a crash.
+On the other hand, Saaf do not detect the issue with Apktool and pursues the analysis with no bytecode to analyse and returns a valid return file, but for an empty application.
+
 #todo[Flowdroid results are inconclusive: some apks have more leak after and as many apks have less? also, runing flowdroid on the same apk can return a different number of leak???]

 === Example
@ -266,6 +267,35 @@ Although self-explanatory, verifying the code of those methods indeed confirms t
    caption: [Code of `Main.main()`, as shown by Jadx, after patching],
 )<lst:th-demo-after>

+#todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
+#figure([
+  #figure(
+    render(
+      read("figs/demo_main_main.dot"),
+      width: 100%,
+      alt: (
+        "",
+      ).join(),
+    ),
+    caption: [Call Graph of `Main.main()` generated by Androguard before patching],
+  ) <fig:th-cg-before>
+  
+  #figure(
+    render(
+      read("figs/patched_main_main.dot"),
+      width: 100%,
+      alt: (
+        "",
+      ).join(),
+    ),
+    caption: [Call Graph of `Main.main()` generated by Androguard after patching],
+  ) <fig:th-cg-after>
+  ],
+  caption: none,
+  kind: "th-cg-cmp-andro",
+  supplement: none,
+)
+
 For a higher-level view of the method, we can also look at its call graph.
 We used Androguard to generate the call graphs in @fig:th-cg-before and @fig:th-cg-after#footnote[We manually edited the generated .dot files for readability.].
@fig:th-cg-before shows the original call graph, and gives a good idea of the obfuscation methods used: we can see calls to `Main.decrypt(String)` that itself calls cryptographic #APIs, as well as calls to `ClassLoader.loadClass(String)`, `Class.getMethod(String, Class[])` and `Method.invoke(Object, Object[])`.
@ -275,34 +305,11 @@ In grey on the figure, we can see the glue methods (`T.check_is_Xxx_xxx(Method)`
 Those methods are part of the instrumentation process presented in @sec:th-trans, but do not bring a lot to the analysis of the call graph.
 In red on the figure however, we have the calls that were hidded by reflection in the first call graph, and thank to the bytecode of the methods called being injected in the application, we can also see that they call `Utils.source(String)` and `Utils.sink(String)`, the methods we defined for this application as source of confidential data and exfiltration method.

-#todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
-#figure(
-  render(
-    read("figs/demo_main_main.dot"),
-    width: 100%,
-    alt: (
-      "",
-    ).join(),
-  ),
-  caption: [Call Graph of `Main.main()` generated by Androguard before patching],
-) <fig:th-cg-before>
-
-#figure(
-  render(
-    read("figs/patched_main_main.dot"),
-    width: 100%,
-    alt: (
-      "",
-    ).join(),
-  ),
-  caption: [Call Graph of `Main.main()` generated by Androguard after patching],
-) <fig:th-cg-after>
-
 === Androscalpel Performances <sec:th-lib-perf>

 Because we implemented our own instrumentation library, we wanted to compare it to other existing options.
 Unfortunately, we did not have time to compare the robustness and correctness of the generated applications.
-However, we did compare the performances of our library, Androscalpel, to Apktool and Soot.
+However, we did compare the performances of our library, Androscalpel, to Apktool and Soot, over the first 100 applications of RASTA (in alphabetical order of the SHA256).

 Due to time constraints, we could not test a complex transformation, as adding registers requires complex operations for both Androscalpel and Apktool (see @sec:th-implem for more details).
 We decided to test two operations: travelling the instructions of an application (a read-only operation), and regenerating an application, without modification (a read/write operation).
@ -316,19 +323,46 @@ It should be noted that all three of the tested tools have multiprocessing suppo
    table.header(
      table.cell(colspan: 2)[Tool], [Soot], [Apktool], [Androscalpel],
    ),
-    table.cell(rowspan: 2)[Read],
-      [Time], [], [], [],
-      [Mem], [], [], [],
-    table.cell(rowspan: 2)[Read/Write], 
-      [Time], [], [], [],
-      [Mem], [], [], [],
+    table.cell(colspan: nb_col, inset: 1pt, stroke: none)[],
+    table.cell(rowspan: 3)[Read],
+      [Time (s)], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).read
+        (num(calc.round(res.cumulative_time / res.nb_results, digits: 2)),)
+      },
+      [Mem (GB)], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).read
+        (num(calc.round(res.cumulative_mem / res.nb_results / 1000000, digits: 2)),)
+      },
+      [Detected Crashes], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).read
+        (num(100 - res.nb_results),)
+      },
+    table.cell(colspan: nb_col, inset: 1pt, stroke: none)[],
+    table.cell(rowspan: 3)[Read/Write],
+      [Time (s)], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).write
+        (num(calc.round(res.cumulative_time / res.nb_results, digits: 2)),)
+      },
+      [Mem (GB)], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).write
+        (num(calc.round(res.cumulative_mem / res.nb_results / 1000000, digits: 2)),)
+      },
+      [Detected Crashes], ..for tool in ("soot", "apktool", "androscalpel") {
+        let res = performance_results.at(tool).write
+        (num(100 - res.nb_results),)
+      },
  )},
  caption: [Average time and memory consumption of Soot, Apktool and Androscalpel]
 ) <tab:th-compare-perf>

@tab:th-compare-perf compares the resources consumed by each tool for each operation.
+We can see that for read-only operation, we are 16 times faster than Soot and 8 times faster than Apktool, while keeping a smaller memory footprint.
+When generating an application, the gap lessens, but we are still almost 8 times faster than Soot.
+Some of this difference probably comes from implementation choices: Soot and Apktool are implemented in Java, which has a noticeable overhead compared to Rust.
+However, a noticeable part of this difference can also be explained by the specialised nature of our library; we did not implement all the features Soot has, and we do not parse Android resources like Apktool does.
+Having better performances does not means that our solution can replace the other in all cases.

-#todo[Conlude depending on the results of the experiment]
+Nevertheless, it should be noted that over the 100 applications tested, Soot failed to regenerate 10 of them, Apktool 4, and Androscalpel only 1, showing that our efforts to limit crashes were successful.

 #midskip