thesis/5_theseus/4_dynamic_data_collection.typ

#import "../lib.typ": todo, SDK, API, ART, DEX, APK, JAR, ADB, jfl-note, APKs, midskip

== Collecting Runtime Information <sec:th-dyn>

To perform the transformations described in @sec:th-trans, we need information like the name and signature of the method called with reflection, or the actual bytecode loaded dynamically.
We decided to collect that information through dynamic analysis.
We saw in @sec:bg different contributions that collect this kind of information.
In the end, we decided to keep the analysis as simple as possible, so we avoided using a custom Android build like DexHunter and instead used Frida to instrument the application and intercept calls to the methods of interest.
@sec:th-fr-dcl presents our approach to collect dynamically loaded bytecode, and @sec:th-fr-ref presents our approach to collect the reflection data.
Because using dynamic analysis raises the concern of coverage, we also need some interaction with the graphical user interface of the application during the analysis.
Ideally, a reverse engineer would do the interaction.
Because we wanted to analyse many applications in a reasonable time, we replaced this engineer with an automated runner that simulates the interactions.
We discuss this option in @sec:th-grod.

=== Collecting the Dynamically Loaded Bytecode <sec:th-fr-dcl>

Initially, we considered instrumenting the constructor methods of the class loaders of the Android #SDK.
However, this is a significant number of methods to instrument, and looking at older applications, we realised that we missed the `DexFile` class.
`DexFile` is now deprecated but still usable class that can be used to load bytecode dynamically.
We initially missed this class because it is neither a `ClassLoader` class nor an #SDK class (anymore).
To avoid running into this kind of oversight again, we decided to look at the #ART source code and list all the places where the internal functions used to parse bytecode are called.
We found that all those calls are from under either `DexFile.openInMemoryDexFilesNative(..)` or `DexFile.openDexFileNative(..)`, two hidden #API methods.
As a reference, in 2015, DexHunter~@zhang2015dexhunter already noticed `DexFile.openDexFileNative(..)` (although in the end DexHunter intruments another function, `DefineClass(..)`).
`DefineClass(..)` is still a good function to instrument, but it is a C++ native method that does not have a Java interface, making it harder to work with using Frida, and we want to avoid patching the source code of the #ART like DexHunter did.
For this reason, we decided to hook `DexFile.openInMemoryDexFilesNative(..)` and `DexFile.openDexFileNative(..)` instead.
Those methods take a list of Android code files as argument, either in the form of in-memory byte arrays or file paths, and a reference to the classloader associated with the code.
Instrumenting those methods allows us to collect all the code files loaded by the #ART and associate them with their class loaders.

=== Collecting Reflection Data <sec:th-fr-ref>

As described in @sec:th-trans-ref, there are 3 methods that we need to instrument to capture reflection calls: `Class.newInstance()`, `Constructor.newInstance(..)` and `Method.invoke(..)`.
Because Java has polymorphism, we need not only the method name and defining class, but also the whole signature of the method.
In addition to that, in case there are several classes with the same name as the defining class, we also need the classloader of the defining class to distinguish it from the other classes.

_Where_ the reflection method is called is more difficult to find.
In order to correctly modify the application, we need to know which specific call to a reflection method we intercepted.
Specifically, we need the caller method (once again, we need the method name, full signature, defining class and its classloader), and the exact instruction that called the reflection method (in case the caller method uses reflection several times in different sites).
This information is more difficult to collect than one would expect.
It is stored in the stack, but before the #SDK 34, the stack was not directly accessible programmatically.
Historically, when a reverse engineer needed to access the stack, they would trigger and catch an exception and get the stack from that exception.
The issue with this approach is that data stored in exceptions is meant for debugging.
In particular, the location of the call in the bytecode has a different meaning depending on the debug information encoded in the bytecode.
It can either be the address of the bytecode instruction invoking the callee method in the instruction array of the caller method, or the line number of the original source code that calls the callee method.
Fortunately, in the #SDK 34, Android introduced the `StackWalker` #API.
This #API allows to programatically travel the current stack and retrieve information from it, including the bytecode address of the instruction calling the callee methods.
Considering that the line number is not a reliable information, we chose to use the new #API, despite the restrictions that come with choosing such a recent Android version (it was released in October 2023, around 2 years ago, and less than 50% of the current Android market share supports this #API today#footnote[https://gs.statcounter.com/android-version-market-share/mobile-tablet/worldwide/#monthly-202401-202508]).

=== Application Execution <sec:th-grod>

Dynamic analysis requires actually running the application.
In order to test multiple applications automatically, we needed to simulate human interactions with the applications.
In @sec:bg, we presented a few solutions to explore an application dynamically.
We first eliminated Sapienz~@mao_sapienz_2016, as it relies on an application instrumentation library called ELLA, which has not been updated for 9 years.
We also chose to avoid the Monkey because we noticed that it often triggers events that close the application (events like pressing the 'home' button, or opening the general settings drop-down menu at the top of the screen).
Stoat~@su_guided_2017 and GroddDroid~@abraham_grodddroid_2015 use UI Automator to interact with the application.
UI Automator is a standard Android #API intended for automatic testing.
Both Soat and GroddDroid perform additional analysis on the application to improve the exploration.
In the end, we elected to use the most basic execution mode of GroddDroid that does not need this additional analysis.
It explores the application following a depth-first search algorithm.
We chose this option to keep the exploration lightweight and limit the chance of crashing the analysis (we saw in @sec:rasta the issues brought by complex analysis).
It might be interesting in future work to explore more advanced exploration techniques.

Because we are using Frida, we do not need to use a custom version of Android with a modified #ART or kernel.
However, we decided not to inject Frida into the original application.
This means we need to have root access to directly run Frida in Android, which is not a normal thing to have on Android.
Because dynamic analysis can be slow, we also decided to run the applications on emulators.
This makes it easier to run several analyses in parallel.
The alternative would have been to run the application on actual smartphones, and would have required multiple phones to run the analysis in parallel.
For simplicity, we chose to use Google's Android emulator for our experiment.
We spawned multiple emulators, installed Frida on them, took a snapshot of the emulator before installing the application to analyse.
Then we run the application for five minutes with GroddRunner, and at the end of the analysis, we reload the snapshot in case the application modified the system in some unforeseen way.
If at some point an emulator stops responding for too long, we terminate it and restart it.

As we will see in @sec:th-dyn-failure, our experimental setup is quite naive and still requires improvement. #todo(strike(stroke: green)[Comment on dit proprement que c'est tout pété?])
For example, we do not implement any anti-evasion techniques, which can be a significant issue when analysing malware.
Nonetheless, the benefit of our implementation is that it only requires an #ADB connection to a phone with a rooted Android system to work.
Of course, to analyse a specific application, a reverse engineer could use an actual smartphone and explore the application manually.
It would be a lot more stable than our automated batch analysis setup.

#midskip

Now that we saw both the dynamic analysis setup and the transformation we want to perform on the #APKs, we put our proposed approach into practice.
In the next section, we will run our dynamic analysis on #APKs and study the data collected, as well as the impact the instrumentation has on applications and different analysis tools.