comment results

2025-09-21 00:53:40 +02:00 · 2025-09-21 00:53:40 +02:00 · 8b5885ae55
commit 8b5885ae55
parent caa9cd8787
1 changed files with 67 additions and 34 deletions
--- a/5_theseus/4_results.typ
+++ b/5_theseus/4_results.typ
@ -1,6 +1,6 @@
 #import "@preview/diagraph:0.3.5": render
-#import "../lib.typ": SDK, num, mypercent, ART, ie, APKs, API,
+#import "../lib.typ": SDK, num, mypercent, ART, ie, APKs, API, APIs
 #import "../lib.typ": todo, jfl-note
 #import "X_var.typ": *
 #import "../3_rasta/X_var.typ": NBTOTALSTRING
@ -14,6 +14,9 @@ Because we are running the application on a rescent version of Android (#SDK 34)
 This represent #num(5000) applications over the #NBTOTALSTRING total of the initial dataset.
 Among them, we could not retrieve 43 from Androzoo, leaving us with #num(dyn_res.all.nb) applications to test.
 We will first look at the results of the dynamic analysis and look closer at the bytecode we intercepted.
 Then, we will studdy the impact the instrumentation have on static analysis tools, notably on their success rate, and we will finish with the analysis of an handcrafted application to check the instrumentation does in fact improve the results of analysis tools.
 === Dynamic Analysis Results <sec:th-dyn-failure>
 After running the dynamic analysis on our dataset the first time we realised our dynamic setup was quite fragile.
@ -110,17 +113,68 @@ The remaining #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_faceboo
 === Impact on Analysis Tools
-Unfortunately, our implementation of the transformation is imperfect and does fails some time.
+We took the applications associated with the #num(nb_bytecode_collected - nb_google - nb_appsflyer - nb_facebook) unique #DEX files we found to see the impact of our transformation.
-Over the #num(dyn_res.all.nb - dyn_res.all.nb_failed), #num(nb_patched) were patched.
+
-The remaining #mypercent(dyn_res.all.nb - dyn_res.all.nb_failed - nb_patched, dyn_res.all.nb - dyn_res.all.nb_failed) failed either due to some quirk in the zip format of the #APK file or because of a bug in our implementation when exceeding the method reference limit in a single #DEX file.
+The applications where indeed obfuscated, making a manual analysis tedious.
-Taking into accound the failure from both dynamic analysis and the patching, we have a #mypercent(dyn_res.all.nb - nb_patched, dyn_res.all.nb) failure rate.
+We did not found visible #DEX files or #APK files inside the applications, meaning the applications are either downloading or generating them from variables and assets at runtime.
 To estimate the scope of the code we made available, we use Androguard to generate the call graph of the applications, before and after the instrumentation.
@tab:th-compare-cg shows the number of edges of those call graphs.
 The columns before and after shows the total number of edges of the graphs, and the diff columns is the number of new edges detected (#ie the number of edges after instrumentation minus the number of edges before).
 This number include edges from the bytecode loaded dynamically, as well as the call added to reflect reflection calls, and calls to "glue" methods (method like `Integer.intValue()` used to convert objects to scalar values, or calls to `T.check_is_Xxx_xxx(Method)` used to check if a `Method` object represent a known method).
 The last column, "Added Reflection", is the list of non-glue method calls found in the call graph of the instrumented application but neither in call graph of the original #APK, nor in the call graphes of the added bytecode files that we computed separately.
 This correspond to the calls we added to represent reflection calls.
 The first application, #lower(compared_callgraph.at(0).sha256), is noticable.
 The instrumented #APK has ten times more edges to its call graph than the original, and only one reflection call.
 This is consistant with the behaviour of a packer: the application load the main of its code at runtime and switch from the bootstrap code to the loaded code with a single reflection call.
 #figure({
  let nb_col = 5
  table(
    columns: (2fr, 1fr, 1fr, 1fr, 2fr),
    align: center+horizon,
    stroke: none,
    table.hline(),
    table.header(
      //[SHA 256], [Original CG edges], [New CG edges], [Edges added], [Reflection edges added],
      table.cell(rowspan: 2)[#APK SHA 256], table.cell(colspan: nb_col - 1)[Number of Call Graph edges], [Before], [After], [Diff], [Added Reflection],
    ),
    table.hline(),
    ..compared_callgraph.map(
      //(e) => ([#lower(e.sha256).slice(0, 10)...], num(e.edges_before), num(e.edges_after), num(e.added), num(e.added_ref_only))
      (e) => (
        [#lower(e.sha256).slice(0, 10)...], 
        text(fill: luma(75), num(e.edges_before)), 
        text(fill: luma(75), num(e.edges_after)),
        num(e.added),
        num(e.added_ref_only)
    )).flatten(),
    [#lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE").slice(0, 10)...], table.cell(colspan: nb_col - 1)[_Instrumentation Crashed_],
    table.hline(),
  )},
  caption: [Edges added to the call graphes computed by Androguard by instrumenting the applications]
 ) <tab:th-compare-cg>
 Unfortunately, our implementation of the transformation is imperfect and does fails some time, as illustrated by #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE") in @tab:th-compare-cg.
 However, over the #num(dyn_res.all.nb - dyn_res.all.nb_failed) applications whose dynamic analysis finished in our experiment, #num(nb_patched) were patched.
 The remaining #mypercent(dyn_res.all.nb - dyn_res.all.nb_failed - nb_patched, dyn_res.all.nb - dyn_res.all.nb_failed) failed either due to some quirk in the zip format of the #APK file, because of a bug in our implementation when exceeding the method reference limit in a single #DEX file, or in the case of #lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE"), because the application reused the original application classloader to load new code instead of instanciated a new classes loader (a behavior we did not expected as not possible using only the #SDK, but enabled by hidden #APIs).
 Taking into accound the failure from both dynamic analysis and the instrumentation process, we have a #mypercent(dyn_res.all.nb - nb_patched, dyn_res.all.nb) failure rate.
 This is a reasonable failure rate, but we should keep in mind that it adds up to the failure rate of the other tools we want to use on the patched application.
-We then run the same experiment we run in @sec:rasta.
+To check the impact on the finishing rate of or instrumentation, we then run the same experiment we run in @sec:rasta.
-We run the tools on the #APK before and after patching, and compared the finishing rates in @fig:th-status-npatched-vs-patched without taking into account #APKs we failed to patch#footnote[Due to an handling error during the experiment, the figure show the results for #nb_patched_rasta #APKs instead of #nb_patched.].
+We run the tools on the #APK before and after instrumentation, and compared the finishing rates in @fig:th-status-npatched-vs-patched (without taking into account #APKs we failed to patch#footnote[Due to an handling error during the experiment, the figure show the results for #nb_patched_rasta #APKs instead of #nb_patched.]).
 The finishing rate comparision is shown in @fig:th-status-npatched-vs-patched. 
 We can see that in most cases, the finishing rate either the same, or slightly lower for the instrumented application.
 This is consistent with the fact that we add more bytecode to the application, hence adding more oportunities of failure during analysis.
 They are two notable exceptions: Saaf and IC3.
 The finishing rate of IC3 which was previously reasibabe drop to 0 after our instrumentation, while the finishing rate of Saaf jump to 100%, which is extremely suspicious.
 Analysing the logs of the analysis showed that both cases have the same origin: the bytecode generated by our instrumentation has version number of 37 (the version introduced by Android 7.0).
 Infortunately, neither the version of Apktool used by Saaf nor Dare (the tool used by IC3 to convert Dalvik bytecode to Java bytecode) recognize this version of bytecode, and thus failed to parse the #APK.
 In the case of Dare and IC3, our experiment correctly identify this a crash.
 On the other hand, Saaf do not detect the issue with Apktool and pursue the analysis with no bytecode to analyse and return a valid return file, but for an empty application.
 #todo[alt text @fig:th-status-npatched-vs-patched]
 #todo[Check SAAF and IC3 results on patched]
 #figure({
  image(
    "figs/comparision-of-exit-status.svg",
@ -134,35 +188,11 @@ We run the tools on the #APK before and after patching, and compared the finishi
 #todo[Flowdroid results are inconclusive: some apks have more leak after and as many apks have less? also, runing flowdroid on the same apk can return a different number of leak???]
 #jfl-note[Combien d'app tranforme? on parle des 888? on fait les 2 tranformation sur chaque apk? ca reussit tout le temps?]
 #todo[Finish @tab:th-compare-cg]
 #figure({
  let nb_col = 3
  table(
    columns: (2fr, 2fr, 1fr),
    align: center+horizon,
    stroke: none,
    table.header(
      //[SHA 256], [Original CG edges], [New CG edges], [Edges added], [Reflection edges added],
      [SHA 256], [CG Edges added], [Reflection edges added],
    ),
    table.hline(),
    ..compared_callgraph.map(
      //(e) => ([#lower(e.sha256).slice(0, 10)...], num(e.edges_before), num(e.edges_after), num(e.added), num(e.added_ref_only))
      (e) => ([#lower(e.sha256).slice(0, 10)...], [#num(e.added) #h(.5em) #text(fill: luma(75))[(#num(e.edges_after) - #num(e.edges_before))]], num(e.added_ref_only))
    ).flatten(),
    [#lower("5D2CD1D10ABE9B1E8D93C4C339A6B4E3D75895DE1FC49E248248B5F0B05EF1CE").slice(0, 10)...], table.cell(colspan: nb_col - 1)[_Instrumentation Crached_],
    table.hline(),
  )},
  caption: [Edges added to the call graphes computed by Androguard by instrumenting the applications]
 ) <tab:th-compare-cg>
 === Example
-We use on our approach on a small #APK.
+In this subsection, we use on our approach on a small #APK to look in more details into analysis of the transformed application.
 We handcrafted this application for the purpose of demonstrating how this can improve help a reverse engineer in its work.
-Accordingly, this application is quite small and contains boff dynamic code loading and reflection.
+Accordingly, this application is quite small and contains both dynamic code loading and reflection.
 We defined a method `Utils.source()` and `Utils.sink()` to model respectively a method that collect sensitive data and that exfiltrate data.
 Those methods are the one we will use with Flowdroid to track data flows.
@ -233,6 +263,9 @@ Although self explanatory, verifying the code of those methods indeed confirm th
    caption: [Code of `Main.main()` showed by Jadx, after patching],
 )<lst:th-demo-after>
 #todo[alt text for @fig:th-cg-before and @fig:th-cg-after]
 #todo[comment @fig:th-cg-before and @fig:th-cg-after]
 #todo[Conclude and transition]
 #figure(
  render(
    read("figs/demo_main_main.dot"),