#import "../lib.typ": epigraph, eg, APK, API, highlight-block, pb1-text, pb2-text, pb3-text
#import "../lib.typ": todo, jfl-note, jm-note

= Introduction <sec:intro>

// https://youtu.be/si9iqF5uTFk?t=1512
#epigraph("Rear Admiral Grace Hopper")[If during the next 12 months any one of you says "but we have always done it that way", I will instantly materialize beside you and I will haunt you for 24 hours.]


// De tout temps les hommes on fait des apps android ...
Android is the most used mobile operating system since 2014, and since 2017, it even surpasses Windows all platforms combined#footnote[https://gs.statcounter.com/os-market-share#monthly-200901-202304].
The public adoption of Android is confirmed by application developers, with 1.3 million apps available in the Google Play Store in 2014, and 3.5 million apps available in 2017#footnote[https://www.statista.com/statistics/266210]. 
Its popularity makes Android a prime target for malware developers. 
Indeed, various applications have been shown to behave maliciously, from stealing personal informations~@shanSelfhidingBehaviorAndroid2018 to hijacking the smartphone's computing resources to mine cryptocurrency~@adjibi_devil_2022.

Considering the importance of Android in the everyday life of so many people, Google, the company that develops Android, defined a very strong security model that addresses an extensive threat model~@mayrhofer_android_2021.
This threat model goes as far as to consider that an adversary can have physical access to an unlocked device (#eg an abusive partner, or a border control). // Americaaaaa
On the device, this security model includes the sandboxing of each application, controlled using a system of permissions to allow the applications to perform potentially unwanted actions.
For example, an application cannot access the contact list without requesting permission from the user first.
Android keeps improving its security from version to version by improving the sandboxing (#eg starting with Android 10, applications can no longer access the clipboard if they are not focused) or by using safer defaults (#eg since Android 9, by default, all network connections must use TLS).
// Android Bouncer, ca marche pas tres bien quand même ect ect (stralker ware?)

In the spirit of _defence in depth_, Google developed a _Bouncer_ service that scans applications in the store for malicious software#footnote[https://googlemobile.blogspot.com/2012/02/android-and-security.html].
Although its #jm-note[operation][I would have said "operating" but grammarly disagrees] is kept secret, it seems that the Bouncer is both comparing the applications with known malware code and running the applications in Google's cloud infrastructure to detect hidden behavior.
Despite Google's efforts, malicious applications are still found in the Play Store~@adjibi_devil_2022.
Also, it is not uncommon for people in abusive situations #jfl-note[to have their abuser install][jfl says "install#strong[ing]", jm says no, grammarly is on the side of jm] on their phone a stalkerware (spying application) found outside of the Play Store~@stateofstalkerware.

For these reasons, it is important to be able to analyse an application and understand what it does.
This process is called reverse engineering.
A lot of work has been done to reverse engineer computer software, but Android applications come with specific challenges that need to be addressed.
For instance, Android applications are distributed in a specific file format, the #APK format, and the code of the application is mainly compiled into an Android-specific bytecode: Dalvik.
An Android reverse engineer will need tools that can read those Android-specific formats.
A first test in the process of reverse engineering an application would be to simply read the content of the application and the code in it.
Tools like Apktool can be used to convert the binary files of an application into a human-readable format.
Other tools like Jadx can go further and try to generate Java code from the bytecode in the application.
Because Android applications tend to be quite large, it can be quite tedious to understand what it does just from reading its bytecode.
To address this issue, many tools/approaches have been developed~@Li2017 @sutter_dynamic_2024 to extract higher-level information about the behavior of the application without having to manually analyse the application.
For example, Flowdroid~@Arzt2014a aims to detect information leaks: given a set of methods that can generate private information, and a set of methods that send information to the outside, Flowdroid will detect if private information is sent to the outside.
Once again, those kinds of tools need to target Android specifically.
Android runs its applications code differently than a computer would run software.
One example would be the handling of entry points: computer software usually has one entry point, whereas Android applications have many, and Android will choose depending on context.
Unfortunately, those tools are hard to use, and even when they work on small example applications, it is not uncommon for them to fail to run on real-life applications~@reaves_droid_2016.
This is worrying.
Android applications are becoming more complex every year, and tools that cannot handle this complexity will fail more often.
This leads us to our first problem statement:
// Chiffrer les contrib avec des xp qui ignore les app qui font crasher les outils?

#highlight-block(breakable: false)[
  *Pb1*: #pb1-text
  
  Many tools have been published to analyse Android applications, but the Android ecosystem is evolving rapidly.
  Tools developed 5 years ago might not be usable anymore.
  We will endeavor to identify which tools are still usable today, and for the others, what causes them to no longer be an option.
] <pb-1>

Another issue is that Android application developers sometimes use various techniques to slow down reverse engineering.
This process is called obfuscation.
Malware developers do that to hide malicious behavior and avoid detection, but the use of obfuscation is not proof that an application is malicious.
Indeed, legitimate application developers can also use obfuscation to protect their intellectual property. // burrkkk
Thus, developers and reverse engineers are playing a game of cat and mouse, constantly inventing new techniques to hide or reveal the behavior of an application.

There are two types of reverse engineering techniques: static and dynamic.
Static analysis consists #jfl-note[of][jfl asks "in"?\ grammarly says "of"] examining the application without running it, while dynamic analysis studies the action of the application while it is running.
Both methods have their drawbacks, and techniques will often capitalyse on the drawbacks of one of those methods.
For instance, an application can try to detect if it is running in a sandbox environment and not act maliciously if it is the case.
Similarly, an application can dynamically load bytecode at runtime, and this bytecode will not be available during a static analysis.
Dynamic code loading relies on Java classes called `ClassLoader` that are central components of the Android runtime environment.
Because dynamic code loading is such a difficult problem for static analysis, dynamic class loading is often ignored when doing static analysis.
However, class loading is not limited to dynamic code loading. 
As a matter of fact, the Android Runtime is constantly performing class loading to load classes from the application or from the Android platform itself.
This blind spot in static analysis tools raises our second problem statement:

#highlight-block(breakable: false)[
  *Pb2*: #pb2-text

  Class loading is an operation often ignored by static analysis tools.
  The exact algorithm used is not well known and might not be accurately modeled by static analysis tools.
  If it is the case, discrepancies between the model of the tools and the one used by Android could be used as a base for new obfuscation techniques.
] <pb-2>

#jfl-note[
Reflection is another common obfuscation technique against static analysis.
Instead of directly invoking methods, the generic `Method.invoke()` #API is used, and the method is retrieved from its name in the form of a character string.
Finding the value of this string can be quite difficult to determine statically, so it is once again an issue more suitable for dynamic analysis.
When encountering a complex case of reflection (#ie using ciphered strings) or code loading, a reverse engineer will switch to dynamic analysis to collect the relevant data (the name of the methods called or the code that was loaded), then switch back to static analysis.
This is doable for a manual analysis; unfortunately, the more automated tools that would require that runtime information to perform an accurate analysis may not have a way to access this new data.
This led us to our last problem statement:
][

  Peu developpé. 
  Expliquer qu'un reverser, s'il trouve de la reflection ou du dyn load peut eventuellement capturer les données en analyse dynamique. 
  Mais ensuite ces données devienent inutiles s'il retourne a de l'analyse static. 
  En effet, il fait souvant les deux en alternances. 
  Il avait besoin que les data issues de l'analyse dyn soient prisent en compte par l'analyse statique, par example...

  TODO: trouver un example simple a formuler
]
#highlight-block(breakable: false)[
  *Pb3*: #pb3-text 

  Dynamic code loading and reflection are problems most suited for dynamic analysis.
  However, static analysis tools do not have access to collected data.
  Encoding this information inside valid applications could be a way to make it universally available to any static analysis tool.
  Ideally, this encoding should not degrade the quality of the static analysis compared to the original application.
] <pb-3>

#[
#set heading(numbering: none, outlined: false, bookmarked: false)

== Contributions

The contributions of this thesis are the following:

+ We evaluate the reusability of Android static analysis tools published by the community:
  we rebuild the tools in their original environment as container images.
  With those containers, those tools are now readily available on any environment capable of running either Docker or Singularity.
  We tested those tools on a dataset of real-life applications balanced in order to have a significant number of applications with different characteristics to assess which characteristics impact the success of a tool. 
  This work was presented at the ICSR 2024 conference~@rasta.
+ We model the default class loading behavior of Android.
  Based on this model, we define a class of obfuscation techniques that we call _shadow attacks_ where a class definition in an #APK shadows the actual class definition.
  We show that common state-of-the-art tools like Jadx or Flowdroid do not implement this model correctly and thus can fall for those shadow attacks.
  We analysed a large number of recent Android applications and found that applications with class shadowing do exist, though they are the result of quirks in the #APK compilation process and not deliberate obfuscation attempts.
  This work was published in the Digital Threats journal~@classloaderinthemiddle. #todo[update ref when not 'just published' anymore]
+ We propose an approach to allow static analysis tools to analyse applications that perform dynamic code loading:
  We collect at runtime the bytecode dynamically loaded and the reflection calls information, and patch the #APK file to perform those operations statically.
  Finally, we evaluate the impact this transformation has on the tools we containerized previously.

== Outline

This dissertation is composed of 6 chapters. 
This introduction is the first chapter.
It is followed by @sec:bg which gives background information about Android and the different analysis techniques targeting Android applications.

The next 3 chapters are dedicated to the contributions of this thesis.
First @sec:rasta studies the reusability of static analysis tools.
Next in @sec:cl, we model the default class loading algorithm used by Android and show the consequences for reverse engineering tools that implement a wrong model.
Then @sec:th presents an approach that allows for static analysis tools to analyse applications that load bytecode at runtime.

Finally, @sec:conclusion summarizes the contributions of this thesis and opens perspectives for future work.
]