thesis/2_background/4_3_theseus.typ

#import "../lib.typ": APK, etal, ART, SDK, eg, DEX, eg, pb3, pb3-text
#import "../lib.typ": todo, jm-note, jfl-note

== Allowing Static Analysis Tools to Analyse Obfuscated Application <sec:bg-soa-th>


=== Dynamic Analysis <sec:bg-dynamic>

As we said previously, static analysis is not capable of analysing everything.
Some situation, like reflection of dynamic code loading, require a different approach: dynamic analysis.
With dynamic analysis, the application is actually executed and the reverse engineer obserces its behavior.
Monitoring the behavior can be achieved by various strategies: observing the filesystem, the display screen, the process memory, the kernel, ...
Depending on the chosen level of observation, it can be technically difficult.
A basic example of dynamic analysis is presented by Bernardi #etal~@bernardi_dynamic_2019: the logs generated by `strace` is used to list the system calls generated in response to an event to determine if an application is malicious or not.

More advanced methods are more intrusive and require modifing either the #APK, the Android framework, runtime, or kernel.
TaintDroid~@Enck2010 for example modify the Dalvik Virtual Machine (the predecessor of the #ART) to track the data flow of an application at runtime, while AndroBlare~@Andriatsimandefitra2012 @andriatsimandefitra_detection_2015 try to compute the taint flow by hooking system calls using a Linux Security Module.
DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 also patch the Dalvik Virtual Machine/#ART, this time to collect bytecode loaded dynamically.
Modifying the Android framwork, runtime or kernel is possible thanks to the Android project beeing open source, however this is a delicate operation that require to revise a patch for each new version of Android.
Thus, a common issue faced by tools that took this approach is that they are stuck with a specific version of Android.
Some sandboxes limit this issue by using dynamic binary instrumentation, like DroidHook~@cui_droidhook_2023, based the Xposed framework, or CamoDroid~@faghihi_camodroid_2022, based on Frida.
This approche is a lot less stealthy than patching Android, but is generally easier to setup and is easier to port to new Android version.

Another known challenge when analysing an application dynamically is the code coverage: if some part of the application is not executed, it cannot be annalysed.
Considering that Android applications are meant to interact with a user, this can become problematic for automatic analysis.
The Monkey tool developed by Google is one of the most used solution~@sutter_dynamic_2024.
It sends a random streams of events the phone without tracking the state of the application.
More advance tools statically analyse the application to model in order to improve the exploration.
Sapienz~@mao_sapienz_2016 and Stoat~@su_guided_2017 uses this technique to improve application testing.
GroddDroid~@abraham_grodddroid_2015 has the same approach but detect statically suspicious sections of code to target, and will interact with the application to target those code section.

Unfortuntely, exploring the application entirely is not always possible, as some applications will try to detect is they are in a sandbox environnement (#eg if they are in an emmulator, or if Frida is present in memory) and will refuse to run some sections of code if this is the case.
Ruggia #etal~@ruggia_unmasking_2024 make a list of evasion techniques.
They propose a new sandbox, DroidDungeon, that contrary to other sandboxes like DroidScope@droidscope180237 or CopperDroid@Tam2015, strongly emphasizes on resiliance against evasion mechanism.

A common objectif of dynamic analysis is to collect bytecode loaded dynamically and reflections information.
Like we said earlier, DexHunter~@zhang2015dexhunter and AppSpear~@yang_appspear_2015 that by instrumenting the Android Runtime.
Qu #etal~@qu_dydroid_2017 developped DyDroid, an hybrid framework using dynamic analysis to intercept dynamic code loading and static analysis to determine the nature of the loaded code.
They used DyDroid to make an autit of the use of dynamic code loading in applications from the Google Play store in 2016.
It resulted that dynamic code loading was mostly related to mobile advertisement, and that the code loading originated from a third party library included in the application, rather than the code of the application developper itself.
Similarly, StaDynA~@zhauniarovichStaDynAAddressingProblem2015 is a framework that generate a call graph statically, then use dynamic analysis to analyse dynamic code loading and reflection calls to complete this call graph.

The issue with those approach is that they are only compatible with their own subsequent analysis.
For instance, StaDynA only provide the call graph, and cannot be used as is to improve the capacity of Flowdroid.
This is unfortunate, has the reverse engineer next step will depend on the context: not beeing able to reuse the result of a previous analysis with other #jm-note[non-specialise][erf, non-specific? non-adapted?] tools limit greatly their options.
AppSpear has an interesting solution to this issue: the code it intercept is repackage inside a new #APK file that Android analysis tools should be able to analyze.
In the next section, we will explore further the contributions that take this approache of using actual application to encode its result.

//#todo[RealDroid sandbox bases on modified ART?]
//#todo[force execution?]
=== Improving Analysis with Instrumentation <sec:bg-instrumentation>

Usually, instrumentation refers to the practice of modifying the behavior of a program to collect information during its execution.
Frida is a good example of instrumentation framework.
The term can also be used more generally to describe operation that modify the application code.
In this section, we will focus on the use of instrumentation that make an application easier to analyse by other tools, instead of just collecting additionnal information at runtime.

I the previous section, we gave the example of AppSpear~@yang_appspear_2015, that reconstruct #DEX files intercepted at runtime and repackage the #APK with the new code in it.
DexLego~@dexlego has a similar but a lot more aggressive technique.
It targets heavily obfuscated packer that decrypt then reencrypt the methods instructions just in time.
To get the bytecode, DexLego log each instruction executed by the #ART, and reconstruct the methods, then the #DEX files, from this stream of instructions.
The main limitation of this technique is that it carrys over the limitation of dynamic analysis to static analysis: the bytecode injected in the application is limited to the instructions executed during the dynamic analysis.
Nevertheless, it is an intersting way to encode the traces of a dynamic analysis in a way that can be used by any Android analysis tool.

IccTa~@liIccTADetectingInterComponent2015 technique is close to idea of modifying the application to improve its analysis: it perform a first analysis to compute the potential inter-component communication of an application, then modify the jimple representation of this application before feeding it to Flowdroid to perform a taint analysis.
Jimple is the intermediate language used by Soot, so even if IccTa does not generate a new application, this modify representation can probably be used by any tool based on the Soot framework or recompilled into a new application without too much effort.
Samhi #etal~@samhi_jucify_2022 followed this direction to unify the analysis of bytecode and native code.
Their tool, JuCify, use Angr~@angrPeople to generate the call graph of the native code, and use euristics to encode this call graph into jimple that can then be added to the jimple generated by Soot from the bytecode of the application.
Like IccTa, they use Flowdroid to analyse this new augmented representation of the application, but it should be usable by any analysis tools relying on Soot.

Finally, DroidRA~@li_droidra_2016 use the COAL~@octeauCompositeConstantPropagation2015 solver to statically compute the reflection informations.
The reflection calls are transformed into direct calls inside the application using Soot.
Using COAL makes DroidRA quite good to solve the simpler cases, where name of classes and methods targeted by reflection are already present in the application.
Those cases are quite commons and beeing able to solve those without resorting to dynamic analysis is quite useful.
On the other hand, COAL will struggle to solve cases with complexe string manipulation and is simply not able to handle cases that rely on external data (#eg downloaded from the internet at runtime).
Likewise, this can only access code loaded dynamically if the code was present inside the application without any kind of obfuscation (#eg a #DEX file in the assets of the application can be analyse, but not if it is ciphered).


#v(2em)

Instrumenting applications to encode the result of an analysis as an unified representation has been explored before.
It has been used by tools like AppSpear and DexLego to expose heavily obfuscated bytecode collected dynamically.
Similarly, DroidRA compute reflection information computed statically and inject the actual method calls inside the application it returns.
However, AppSpear and DexLego focus primarely on specific obfuscation techniques, making there implementation difficult to port to more rescent version of Android, and DroidRA suffers the limitation of static analysis.
We believe that instrumentation is a promising approach to encode those information.
Especially, we think that it could be used to provide dynamic information that are not available to static analysis tools like DroidRA.
To explore this possibility, we will try to anwser our third problem statement #pb3: #pb3-text