thesis/5_theseus/3_dynamic_data_collection.typ

#import "@preview/diagraph:0.3.5": raw-render
#import "../lib.typ": todo, SDK, API, ART, DEX, APK, JAR, ADB, jfl-note

== Collecting Runtime Information <sec:th-dyn>

@fig:th-process show the general idea of our process.
To perform the transformations discribed in @sec:th-trans, we need information like the name and signature of the method called with reflection, or the actual bytecode loaded dynamically.
We decided to collet those information through dynamic analysis.
We saw in @sec:bg different contributions that collect this kind of information.
In the end, we decided to keep the analysis as simple as possible, so we avoided using a custom Android build like DexHunter, and instead use Frida(see @sec:bg-frida) to instrument the application and intercept calls of the methods that interest us.
@sec:th-fr-dcl present our approach to collect dynamically loaded bytecode, and @sec:th-fr-ref present our approach to collect the reflection data.
Because using dynamic analysis raise the concern of coverage, we also need some interaction with application during the analysis.
Ideally, a reverse engineer would do the interaction.
Because we wanted to analyse many applications in a reasonable time, we replaced this engineer by an automated runner that simulates the interactions.
We discuss this option in @sec:th-grod.

#figure(
  raw-render(
    ```
    digraph {
      rankdir=LR

      splines="ortho"


      APK [shape=parallelogram]
      "Automated Runner"
      "Reverse Engineer"
      "Dynamic Analysis" [shape=box]
      "Runtime Information" [shape=parallelogram]
      Transformation [shape=box]
      "APK'" [shape=parallelogram]

      APK:c -> "Dynamic Analysis"
      "Automated Runner" -> "Dynamic Analysis" [style="dashed"]
      "Reverse Engineer" -> "Dynamic Analysis" [style="dashed"]
      "Dynamic Analysis" -> "Runtime Information"
      APK -> Transformation
      "Runtime Information" -> Transformation
      Transformation -> "APK'"
    }
    ```,
    width: 100%,
    alt: (
      "A diagram showing the process to transform an application.",
      "Dotted arrows go from a \"Automated Runner\" and from \"Reverse Engineer\" to a box labeled \"Dynamic Analysis\", as well as plain arrow from \"APK\" to \"Dynamic Analysis\".",
      "An arrow goes from \"Dynamic Analysis\" to \"Runtime Information\", then from \"Runtime Information\" to a box labeled \"Transformation\".",
      "Another arrow goes from \"APK\" to \"Transformation\".",
      "Finally, an arrow goes from \"Transformation\" to \"APK'\"."
    ).join(),
  ),
  caption: [Process to add runtime information to an #APK],
) <fig:th-process>

=== Collecting Bytecode Dynamically Loaded <sec:th-fr-dcl>

Initially, we considered instrumenting the constructor methods of the classloaders of the Android #SDK.
However, this is a significant number of methods to instrument, and looking at older application, we realized that we missed the `DexFile` class.
`DexFile` is a now deprecated but still usable class that can be used to load bytecode dynamically.
We initially missed this class because it is neither a `ClassLoader` class nor an #SDK class (anymore).
To avoid running into this kind of oversight again, we decided to look at the #ART source code an list all the places where the internal function used to parse bytecode are called.
We found that all those calls are from under either `DexFile.openInMemoryDexFilesNative(..)` or `DexFile.openDexFileNative(..)`, two hidden #API methods.
As a reference, in 2015, DexHunter~@zhang2015dexhunter already noticed `DexFile.openDexFileNative(..)` (although in the end DexHunter intrument another function, `DefineClass(..)`).
`DefineClass(..)` is still a good function to instrument, but it is a C++ native method that does not have a Java interface, making it harder to work with using Frida, and we want to avoid patching the source code of the #ART like DexHunter did.
For this reason, we decided to hook `DexFile.openInMemoryDexFilesNative(..)` and `DexFile.openDexFileNative(..)` instead.
Those methods takes as argument a list of Androis code files, either in the form of in memory byte arrays or file path, and a reference to the classloader associated to the code.
Instrumenting those methods allows us to collect all the code files loaded by the #ART and associate them to their classloaders.

=== Collecting Reflection Data <sec:th-fr-ref>

As described in @sec:th-trans-ref, they are 3 methods that we need to instrument to capture reflection calls: `Class.newInstance()`, `Constructor.newInstance(..)` and `Method.invoke(..)`.
Because Java has polymorphism, we need not only the method name and defining class, but also the whole signature of the method.
In addition to that, in case there are several classes with the same name as the defining class, we also need the classloader of the defining class to distinguish it from the other classes.

_Where_ the reflection method is called is more difficult to find.
In order to correctly modify the application, we need to know which specific call to a reflection method we intercepted.
Specifically, we need to known the caller method (once again, we need the method name, full signature, defining class and its classloader), and the speficic instruction that called the reflection method (in case the caller method call a reflection method several times).
This information is more difficult to collect than one would expect.
Those information are stored in the stack, but before the #SDK 34, the stack was not directly accessible programmatically.
Historically, when a reverse engineer needed to access the stack, they would trigger and catch an exception, get the stack from the exception.
The issue with this approach is that the data stored in the exception are meant for debbuging.
In particullar, the location of the call in the bytecode has a different meaning depending on the debug information encoded in the bytecode.
It can either be the address of the bytecode instruction invoking the callee method in the instruction array of the caller method, or the line number of original source code that call the callee method.
Fortunatelly, in the #SDK 34, Android introduced the `StackWalker` #API.
This #API allow to programatically travel the current stack and retrieve informations from it, including the bytecode address of the instruction calling the callee methods.
Considering that the line number is not a reliable information, we chose to use the new #API, despite the restriction that come with chosing such a recent Android version (it was released in october 2023, arround 2 years ago, and less than 50% of the current Android market share support this #API today#footnote[https://gs.statcounter.com/android-version-market-share/mobile-tablet/worldwide/#monthly-202401-202508]).

=== Application Execution <sec:th-grod>

Dynamic analysis requires actually running the application.
In order to test automatically multiple applications, we needed to simulate human interractions with the applications.
In @sec:bg we presented a few solution to explore an application dynamically.
We first eliminated Sapienz, as it rely on an application instrumentation library called ELLA, that has not be updated since 9 years ago.
We also chose to avoid the Monkey because we noticed that it will often trigger event that will close the application (events likes pressing the 'home' button, or openning the general setting drop-down menu at the top of the screen).
Stoat and GroddDroid use UI Automator to interact with the application.
UI Automator is a standard Android #API inteded for automatic testing.
Both Soat and GroddDroid perfom additionnal analysis on the application to improve the exploration.
In the end, we elected to use the most basic execution mode of GroddDroid that does not need this additionnal analysis.
It explore the application following a depth-first search algorithm.
We chose this option to keep the exploiration lightwight and limit the chance of crashing the analysis (we saw in @sec:rasta the issues brought by complexe analysis).
It might be interesting in futur work to explore more complexe exploration techniques.

Because we are using Frida, we do not need to use a custom version of Android with a modified #ART or kernel like, however, we decided to not inject Frida in the original application.
This means we need to have root access to directly run Frida in Android which is not a normal thing to have on Android.
Because dynamic analysis can be slow, we also decided to run the applications on emulators.
This makes its easier to run several analysis in parallel.
The alternative would have been to run the application on actual smartphones, and would have required multiple phones to run the analysis in parallel.
For simplicity, we choosed to use Google Android emulator for our experiment.
We spawned multiple emulators, installed Frida on it, took a snapshot of the emulator before installing the application to analyse.
Then we run the application for a five minutes with GroddRunner, and at the end of the analysis, we reload the snapshot in case the application modified the system in some unforseen way.
If at some point the emulator start responding for too long, we terminate it and restart it.

#todo[Droid donjon, dire qu'on est au niveau -1 de l'anti-evation]
As we will see in @sec:th-res #todo[donner la bonne subsection], our experimental setup is quite naive and still requiee improvement. #todo(strike(stroke: green)[Comment on dit proprement que c'est tout pété?])
Nonetheless, the benefit of our implementation is that it only requires a #ADB connection to a phone with a rooted Android system to work.
Of course, to analyse a specific application, a reverse engineer could use an actual smartphone and explore the application manually.
It wiykd be a lot more stable than our automated batch analysis setup.

#todo[Futur work: Droiddonjon like, GroddDroid improved exploration, potentiellement faire de l'execution forcé avec frida]