wip

2025-07-17 21:34:01 +02:00 · 2025-07-17 21:34:01 +02:00 · a1a5794250
commit a1a5794250
parent e6c8b0ee6c
2 changed files with 62 additions and 5 deletions
--- a/5_theseus/2_dynamic_data_collection.typ
+++ b/5_theseus/2_dynamic_data_collection.typ
@ -1,13 +1,69 @@
-#import "../lib.typ": todo
+#import "../lib.typ": todo, SDK, API, ART, DEX, APK, JAR, ADB

-== Collection Runtime Information <sec:th-dyn>
+== Collecting Runtime Information <sec:th-dyn>

 In order to perform the transformations described in @sec:th-trans, we need information like the name and signature of the method called with reflection, or the actual bytecode loaded dynamically.
 We are doing those transformation specifically because those information are difficult to extract statically.
 Hence, we are using dynamic analysis to collect the runtime information we need.
+We use Frida(see @sec:bg-frida) to instrument the application and intercept calls of specific methods.

-=== Collect Bytecode
+=== Collecting Bytecode Dynamically Loaded

-=== Collect Reflection Data
+Initially, we considered instrumenting the constructor methods of the classloaders of the Android #SDK.
+However, this is a significant number of methods to instrument, and looking at older application, we realized that we missed the `DexFile` class.
+`DexFile` is a now depreciated class but still usable class that can be used to load bytecode dynamically.
+Instead of looking for all possible methods to load bytecode, we decided to look at the #ART source code an list all the places where the internal function used to parse bytecode is called.
+We found that all those calls are from under either `DexFile.openInMemoryDexFilesNative(..)` or `DexFile.openDexFileNative(..)`, two hidden #API methods.
+Those methods takes as argument a list of Androis code files, either in the form of in memory byte arrays or file path, and a reference to the classloader associated to the code.
+The code files can have many format, usually #DEX files, or #APK / #JAR files containing #DEX files, but it can also be internal format like `.aot` #todo[check, aot explain somewhere?]. #todo[cf later to explain that only #DEX / #APK / #JAR are found?]
+Instrumenting those methods allows us to collect all the #DEX files loaded by the #ART and associate them to their classloaders.
+
+=== Collecting Reflection Data
+
+Like describe in @sec:th-trans-ref, they are 3 methods that we need to instrument to capture reflection calls: `Class.newInstance()`, `Constructor.newInstance(..)` and `Method.invoke(..)`.
+Because Java has polymorphism, we need not only the method name and defining class, but also the whole signature of the method.
+In addition to that, in case they are several classes with the same name as the defining class, we also need the classloader of the defining class to distinguish it from the other classes.
+
+A more challenging information to collect is the from where the reflection method is called. 
+In order to correctly modify the application, we need to know which specific call to a reflection method we intercepted.
+Specifically, we need to known the caller method (once again, we need the method name, full signature, defining class and its classloader), and the speficic instruction that called the reflection method (in case the caller method call a reflection method several times).
+This information is more difficult to collect than one would expect.
+Those information are stored in the stack, but before the #SDK 34, the stack was not directly accessible programmatically.
+Historically, when a reverse engineer needed to access the stack, they would trigger and catch an exception, get the stack from the exception.
+The issue with this approche is that the data stored in the exception are meant for debbuging.
+In particullar, the location of the call in the bytecode has a different meaning depending on the debug information encoded in the bytecode.
+It can either be the address of the bytecode instruction invoking the callee method in the instruction array of the caller method, or the line number of original source code that call the callee method.
+In the #SDK 34, Android introduced the `StackWalker` #API.
+This #API allow to programatically travel the current stack and retrieve informations from it, including the bytecode address of the instruction calling the callee methods.
+Considering that the line number is not a reliable information, we chose to use the new #API, despite the restriction that come with chosing such a recent Android version (it was released in october 2023, arround 2 years ago, and less than 50% of the current Android market share support this #API today #todo[archive ref https://gs.statcounter.com/android-version-market-share]).
+
+=== Application Execution
+
+Dynamic analysis requires actually running the application.
+In order to test automatically multiple applications, we needed to simulate human interractions with the applications.
+We found a few tools available, #todo[ref les outils testé, peut etre mettre dans state of the art?].
+After some tests, the most suitable one we found were the Monkey, a standard Android tool from Google that generate random event, and GroddDroid #todo[ref].
+We choose to avoir the Monkey because we noticed that it will often trigger event that will close the application (events likes pressing the 'home' button, or openning the general setting drop-down menu at the top of the screen).
+GroddDroid different execution modes.
+We choosed to use the most simple one, that explore the application following a depth-first search algorithm.
+GroddDroid can do more advance explorations targetting suspicious section of the application en priority, but this require to perform heavy static analysis.
+We elected to avoid this option to keep the exploiration lightwight and limit the chance to encontering a fatal issue.
+Behind the scene, GroddDroid uses UI Automator to interact with the application, an standar Android API used intended for automatic testing.
+
+Because we a using Frida, we do not need to use a custom version of Android with a modified #ART or kernel like some dynamica analysis framework. #todo[references]
+However, we decided to not inject Frida in the original application, so we need to have root access to directly run Frida in Android, wich is not a normal thing to have on Android.
+Because dynamic analysis can be slow, we also decided to run the applications on emulators.
+This makes its easier to run several analysis in parallel.
+The alternative would have been to run the application on actual smartphones, and would have required multiple phones to run the analysis in parallel.
+For simplicity, we choosed to use Google Android emulator for our experiment.
+We spawned multiple emulators, installed Frida on it, took a snapshot of the emulator before installing the application to analyse. 
+Then we run the application for a minute #todo[check la valeur exacte] with GroddRunner, and at the end of the analysis, we reload the snapshot in case the application modified the system in some unforseen way.
+If at some point the emulator start responding for too long, we force kill it and restart it.
+
+#todo[Droid donjon, dire qu'on est au niveau -1 de l'anti-evation]
+As we will see in @sec:th-res #todo[donner la bonne subsection], our experimental setup is quite naive and still requiere improvement. #todo[Comment on dit proprement que c'est tout pété?]
+Nonetheless, our analysis tool itself only require a #ADB connection to a phone with a rooted Android system to work.
+To analyse a specific application, using an actual smartphone and exploring the application manually is still possible and a lot more stable than our automated batch analysis setup.
+
+#todo[Futur work: Droiddonjon like, GroddDroid improved exploration, potentiellement faire de l'execution forcé avec frida]

-=== Application Execition
--- a/5_theseus/main.typ
+++ b/5_theseus/main.typ
@ -5,5 +5,6 @@
 #todo[theseus chapter title for @sec:th]

 #include("1_static_transformation.typ")
+#include("2_dynamic_data_collection.typ")
 #include("3_results.typ")
 #include("4_ttv.typ")