thesis/2_background/4_3_theseus.typ

#import "../lib.typ": APK, etal, ART, SDK, eg, DEX, eg, pb3, pb3-text
#import "../lib.typ": todo, jm-note, jfl-note

=== Allowing Static Analysis Tools to Analyse Obfuscated Application <sec:bg-soa-th>

#pb3-text

Dynamic analysis of Android applications has been researched for a long time.
Like static analysis, it has its own challenges, which we will explore in this subsection.
After that, we will also look at contributions that sought to encode results inside the #APK format or used instrumentation to improve analyses in some way.

==== Dynamic Analysis <sec:bg-dynamic>

Some situations, like reflection of dynamic code loading, are difficult to solve with static analysis and require a different approach: dynamic analysis.
With dynamic analysis, the application is actually executed, and the reverse engineer observes its behaviour.
Monitoring the behaviour can be achieved by various strategies: observing the filesystem, the display screen, the process memory, the kernel, ...
Depending on the chosen level of observation, it can be technically difficult.
A basic example of dynamic analysis is presented by Bernardi #etal~@bernardi_dynamic_2019: the logs generated by `strace` are used to list the system calls generated in response to an event to determine if an application is malicious or not.

More advanced methods are more intrusive and require modifying either the #APK, the Android framework, runtime, or kernel.
TaintDroid~@Enck2010, for example, modifies the Dalvik Virtual Machine (the predecessor of the #ART) to track the data flow of an application at runtime, while AndroBlare~@Andriatsimandefitra2012 @andriatsimandefitra_detection_2015 try to compute the taint flow by hooking system calls using a Linux Security Module.
DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 also patch the Dalvik Virtual Machine/#ART, this time to collect bytecode loaded dynamically.
Modifying the Android framework, runtime, or kernel is possible thanks to the Android project being open-source, but this is a delicate operation that requires revising a patch for each new version of Android.
Thus, a common issue faced by tools that took this approach is that they are stuck with a specific version of Android.
Some sandboxes limit this issue by using dynamic binary instrumentation, like DroidHook~@cui_droidhook_2023, based on the Xposed framework, or CamoDroid~@faghihi_camodroid_2022, based on Frida.
This approach is a lot less stealthy than patching Android, but it is generally easier to set up and is easier to port to new Android versions.

Another known challenge when analysing an application dynamically is the code coverage: if some part of the application is not executed, it cannot be analysed.
Considering that Android applications are meant to interact with a user, this can become problematic for automatic analysis.
The Monkey tool developed by Google is one of the most used solution~@sutter_dynamic_2024.
It sends a random stream of events to the phone without tracking the state of the application.
More advanced tools statically analyse the application to model in order to improve the exploration.
Sapienz~@mao_sapienz_2016 and Stoat~@su_guided_2017 use this technique to improve application testing.
GroddDroid~@abraham_grodddroid_2015 has the same approach but detects statically suspicious sections of code to target, and will interact with the application to target those code sections.

Unfortunately, exploring the application entirely is not always possible, as some applications will try to detect if they are in a sandbox environment (#eg if they are in an emulator, or if Frida is present in memory) and will refuse to run some sections of code if this is the case.
Ruggia #etal~@ruggia_unmasking_2024 make a list of evasion techniques.
They propose a new sandbox, DroidDungeon, that, contrary to other sandboxes like DroidScope@droidscope180237 or CopperDroid@Tam2015, strongly emphasises resilience against evasion mechanisms.

A common objective of dynamic analysis is to collect bytecode loaded dynamically and reflection information.
Like we said earlier, DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 do that by instrumenting the Android Runtime.
Qu #etal~@qu_dydroid_2017 developed DyDroid, a hybrid framework using dynamic analysis to intercept dynamic code loading and static analysis to determine the nature of the loaded code.
They used DyDroid to make an audit of the use of dynamic code loading in applications from the Google Play store in 2016.
It resulted that dynamic code loading was mostly related to mobile advertisement, and that the code loading originated from a third-party library included in the application, rather than the code of the application developer itself.
Similarly, StaDynA~@zhauniarovichStaDynAAddressingProblem2015 is a framework that generates a call graph statically, then uses dynamic analysis to analyse dynamic code loading and reflection calls to complete this call graph.

The issue with those approaches is that they are only compatible with their own subsequent analysis.
For instance, StaDynA only provide the call graph, and cannot be used as is to improve the capacity of Flowdroid.
This is unfortunate: the reverse engineer's next step will depend on the context.
Not being able to reuse the result of a previous analysis with any ad hoc tools greatly limits their options.
AppSpear has an interesting solution to this issue: the code it intercepts is repackaged inside a new #APK file that Android analysis tools should be able to analyse.
We will now explore further the contributions that take this approach of using actual applications to encode their results.

//#todo[RealDroid sandbox bases on modified ART?]
//#todo[force execution?]
==== Improving Analysis with Instrumentation <sec:bg-instrumentation>

Usually, instrumentation refers to the practice of modifying the behaviour of a program to collect information during its execution.
Frida is a good example of an instrumentation framework.
The term can also be used more generally to describe operations that modify the application code.
In this section, we will focus on the use of instrumentation that makes an application easier to analyse by other tools, instead of just collecting additional information at runtime.

In the previous section, we gave the example of AppSpear~@yang_appspear_2015, which reconstructs #DEX files intercepted at runtime and repackages the #APK with the new code in it.
DexLego~@dexlego has a similar but a lot more aggressive technique.
It targets heavily obfuscated packers that decrypt then re-encrypt the method's instructions just in time.
To get the bytecode, DexLego logs each instruction executed by the #ART, and reconstructs the methods, then the #DEX files, from this stream of instructions.
The main limitation of this technique is that it carries over the limitation of dynamic analysis to static analysis: the bytecode injected in the application is limited to the instructions executed during the dynamic analysis.
Nevertheless, it is an interesting way to encode the traces of a dynamic analysis in a way that can be used by any Android analysis tool.

IccTa~@liIccTADetectingInterComponent2015 technique is close to the idea of modifying the application to improve its analysis: it performs a first analysis to compute the potential inter-component communication of an application, then modifies the Jimple representation of this application before feeding it to Flowdroid to perform a taint analysis.
Jimple is the intermediate language used by Soot, so even if IccTa does not generate a new application, this modified representation can probably be used by any tool based on the Soot framework or recompiled into a new application without too much effort.
Samhi #etal~@samhi_jucify_2022 followed this direction to unify the analysis of bytecode and native code.
Their tool, JuCify, uses Angr~@angrPeople to generate the call graph of the native code, and uses heuristics to encode this call graph into Jimple that can then be added to the Jimple generated by Soot from the bytecode of the application.
Like IccTa, they use Flowdroid to analyse this new augmented representation of the application, but it should be usable by any analysis tools relying on Soot.

Finally, DroidRA~@li_droidra_2016 use the COAL~@octeauCompositeConstantPropagation2015 solver to statically compute the reflection information.
The reflection calls are transformed into direct calls inside the application using Soot.
Using COAL makes DroidRA quite good at solving the simpler cases, where the names of classes and methods targeted by reflection are already present in the application.
Those cases are quite common; being able to solve those without resorting to dynamic analysis is quite useful.
On the other hand, COAL will struggle to solve cases with complex string manipulation and is simply not able to handle cases that rely on external data (#eg downloaded from the internet at runtime).
Likewise, this can only access code loaded dynamically if the code was present inside the application without any kind of obfuscation (#eg a #DEX file in the assets of the application can be analysed, but not if it is ciphered).

#v(2em)

Instrumenting applications to encode the result of an analysis as a unified representation has been explored before.
It has been used by tools like AppSpear and DexLego to expose heavily obfuscated bytecode collected dynamically.
Similarly, DroidRA compute reflection information statically and injects the actual method calls inside the application it returns.
However, AppSpear and DexLego focus primarily on specific obfuscation techniques, making their implementation difficult to port to more recent versions of Android, and DroidRA suffers from the limitation of static analysis.
We believe that instrumentation is a promising approach to encoding that information.
Especially, we think that it could be used to provide dynamic information that is not available to static analysis tools like DroidRA.

In @sec:th, we will try to use instrumentation to combine dynamic analysis (to collect code loaded dynamically and reflection information) with static analysis, regardless of the static analysis tool used.