Malicious Software (IY3840)

Ace your homework & exams now with Quizwiz!

Metamorphic malware

"Body-polymorphic": create semantically-equivalent completely different versions of code at each infection. It works by: 1. Analysing its own code 2. Split the code in blocks 3. Mutate each block separately

Supervised Learning (classification)

"Training with a teacher". - Known number of classes - Learning from a labelled training set - Used to classify future observations

Unsupervised Learning (Clustering)

"Training without a teacher" (no correct answers given, no prior knowledge) Given a collection of objects X = (x1, x2, ..., x_n) without class labels y_i, there are methods for building a model that captures the structure of the data while finding the "natural" grouping of instances with no. Uses: - Labelling large datasets can be a costly procedure -Class labels may not be known beforehand - Large datasets can be compressed into a small set of prototypes

Accuracy

(TP + TN) / (TP + FP + TN + FN) Where: - TP is True Positives - FP is False Positives - TN is True Negatives - FN is False Negatives However it can be misleading when datasets are very imbalanced (e.g.: a classifier that predicts always "benign" will have 99% accuracy on a dataset with 99% of goodware and 1% of malware)

Evasion attacks

*Formalisation:* - z* is the optimally modified malware sample for evasion - z' ∈ Ω(z) is the set of all possible modifications to the malware sample (as long as the code works) - ŵ is the vector of estimated feature weights - x' is the feature vector corresponding to the altered sample *Possible Scenarios:* - Zero-effort Attack: θ = () - DexGuard-based Obfuscation Attacks: θ = () + PRAGuard obfuscations - Mimicry Attack: θ = (D̂, X), where D̂ is a surrogate dataset and X is the feature space - Limited-Knowledge (LK) Attacks: θ = ( D̂, X , fˆ), where fˆ is the classifier but without known parameters - Perfect-Knowledge (PK) Attacks: θ = (D, X , f ), i.e., the attacker knows w and b parameters of the SVM

Registers

*IA-32*: - General purpose: - *a*ccumulator register (ax): used in arithmetic operations - *c*ounter register (cx): used in shift/rotate instructions and loops - *d*ata register (dx): used in arithmetic operations and I/O operations - *b*ase register (bx): used as a pointer to data (located in DS, when in segmented mode) - *s*tack *p*ointer register (sp): pointer to the top of the stack - stack *b*ase *p*ointer register (bp): used to point to the base of the stack - *s*ource *i*ndex register (si): used as a pointer to a source in stream operations - *d*estination *i*ndex register (di): used as a pointer to a destination in stream operations - (Extended) program counter/instruction pointer: %eip (contains the address of the next instruction to be executed if no branching is done - %eflags (ZF, SF, CF, ...) - Segment registers: CS, DS, SS, ES, FS , GS - Used to select segments (e.g. code, data, stack) - Program status and control: EFLAGS - The instruction pointer: EIP - Points to the next instruction to be executed - Cannot be read or set explicitly - It is modified by jump and call/return instructions - Can be read by executing a call and checking the value pushed on the stack - Floating point units and MMX/XMM registers Access modes: - 8-bit (only for AX, CV, DX, BX): A*L* for the lower half and A*H* for the high half of AX - 16-bit: ..X (so the regular abbreviations, e.g.: AX) - 32-bit: *E*... (e.g.: EAX) - 64-bit: *R*... (e.g.: RAX)

Euclidean distance

*L_2 norm*

Chebyshev distance

*L_∞ norm*

Minkowski distance

*Lk norm*: The choice of an appropriate value of *k* depends on the amount of emphasis that you would like to give to the larger differences between the components. Special cases: Manhattan or city-block distance, Euclidean, Chebyshev distance

Malware

*Malicious software*: unwanted software and executable code that is used to perform an unauthorized, often harmful, action on a computing device. It is an umbrella-term for various types of harmful software, including viruses, worms, Trojans, rootkits, and botnets. Traditional ones are written in ASM/C/macro code and have unprotected/unobfuscated payloads

strace

*System call tracer* which intercepts and records the system calls which are called by a process and the signals which are received by a process. - trace system calls, parameters, signals, ... - trace child processes - attach to a ran processes - leveraging ptrace

x86 ASM

- (Slightly) higher-level language than machine language. - Program is made of - directives: commands for the assembler .data identifies a section with variables - instructions: actual operations jmp 8048f3f 2 possible syntaxes, with different ordering of the operands! - AT&T syntax (objdump, GNU Assembler) - DOS/Intel syntax (Microsoft Assembler, Nasm) - Instructions can be modified using suffixes: *b*yte, *w*ord (16 bits), *l*ong (32 bits), *q*uad (64 bits) movl %ecx,%eax - Addresses are specified by using a segment selector and an offset: CS, DS, SS, ES, FS, GS Call segments (Code Segment, Data Segment, Stack Segment, G Segment) - Memory access is of form displacement(%base, %index, scale) where the result address is displacement+%base+%index*scale - movl 0xffffff98(%ebp),%eax copies the contents of the memory pointed by ebp - 104 into eax - mov (%eax),%eax copies the contents of the memory pointed by eax in eax - movl %eax,15(%edx,%ecx,2) moves the contents of eax into the memory at address 15 + edx + ecx * 2 - mov $0x804a0e4,%ebx copies the value 0x804a0e4 into ebx - mov 0x804a0e4,%eax copies the content of memory at address 0x804a0e4 into eax - mov ax, [di] copies the operand address

ML challenges (in security)

- *High cost of FNs*: a FN is a missed "attack", but reducing FNs implies increased FPs - *Hard to find public datasets*: mainly due to privacy reasons, there is a lack of datasets for security (except maybe for malware analysis) - *High cost of labelling*: analysing a malware sample or a traffic trace may require a huge amount of time - *Explainability is hard*: many events may be correlated but it is not trivial to determine causation - *Imbalanced datasets*: the vast majority of events/applications are benign, only a small fraction (< 0.1%) are malicious

Binder Protocol

- *IPC/RPC*: - Binder protocols enable fast inter-process communication - Allows apps to invoke other app component functions - Binder objects handled by Binder Driver in kernel - Serialized/marshalled passing through kernel - Results in input output control (ioctl) system calls - *Android Interface Definition Language (AIDL)*: - AIDL defines which/how services can be invoked remotely - Describes how to marshal method parameters

80x86 CPU Family

- 8088, 8086: 16 bit registers, real-mode only - 80286: 16-bit protected mode - 80386: 32-bit registers, 32-bit protected mode - 80486/Pentium/Pentium Pro: Adds few features, speed-up - Pentium MMX: Introduces the multimedia extensions (MMX) - Pentium II: Pentium Pro with MMX instructions - Pentium III: Speed-up, introduces the Streaming SIMD Extensions (SSE) - Pentium 4: Introduces the NetBurst architecture - Xeon: Introduces Hyper-Threading - Core: Multiple cores

Anti-analysis

- *Obfuscation*: many levels of packing. - *Anti-forensics*: - self-deletion from disk; - erase key from memory; - change time of the module to that of the kernel32.dll. - Anti-AV: tricks signature checks by spawning hollowed explorer.exe (RunPE). - Whitelisting security products: attackers want to be sure that security software is not encrypted

Segment registers

- *S*tack *S*egment: pointer to the stack - *C*ode *S*egment: pointer to the code - *D*ata *S*egment: pointer to the data - *E*xtra *S*egment: pointer to extra data - *F* *S*egment: pointer to more extra data - *G* *S*egment: pointer to still more extra data

Basic protection from ransomware

- *User education*: email attachments and social engineering. - *Patch systems*: operating system, applications, browser plugins, ... - *Backup files*: backup all important data regularly. - *Remove admin rights*. - *Protected folder*.

Android ransomware families

- Android Defender: fake AV (first example of ransomware for Android). - Police Ransomware: illegal activity detected. - Simplocker: encrypts files on memory card. - Lockerpin: set or change the PIN on the device. - Dogspectus: silently installs on Android devices, via malvertising.

Forced multi-path exploration

- Assumption: the behaviour of the program depends on the output of the syscalls it executes. - Track dependencies between syscalls output and program variables - Detect untaken paths and force the execution of these paths by computing new program states that satisfy the path conditions

Adversarial Taxonomy

- Attacker's Goals - Security violation - integrity (malware classified as benign) - availability (benign classified as malware) - privacy (user data leaked) - Attack specificity (targeted, indiscriminate) - Attacker's Knowledge - training data - feature set, feature extraction/selection - learning algorithm, parameters - Attacker's Capability - exploratory: only at test time - causative: poison training

Domain flux

- Bots periodically generates new C&C domain names. - The local date (system time) is often used as input. - Botmaster needs to register one of these domains and respond properly so that bots recognize valid C&C server. - Defenders must register all domains to take down botnet

Polymorphic malware

- Change layout with each infection - Payload is encrypted - Using different key for each infection - Makes static string analysis practically impossible - Encryption routine must change too or detection is trivial

Taint analysis

- Concerned about tracking how interesting data flow throughout a program's execution - Taint source (the interesting data source) - Propagation rules (the how-to) - Taint sinks (where the data flows to, allow to enforce security policies) - At the basis of many dynamic behaviour malware analysis framework - Explicit (data) flows: x = y; - Control-dependent (data) flows: if (x == <value>) { expr; } - Implicit flows (w gets the same value of y): x = 0; z = 0; if (y == 1) x = 1; else z = 1; if (x == 0) w = 0; if (z == 0) w = 1 - Very successful when protecting benign programs - Open to a number of evasions when applied on malicious programs

Instruction Classes

- Data transfer: mov, xchg, push, pop - Binary arithmetic: add, sub, imul, mul, idiv, div, inc, dec - Logical: and, or, xor, not - Control transfer: jmp, jne, call, ret, int, iret - Input/output: in, out

Algorithmic-agnostic Unpacking

- Dynamic analysis - Emulation/tracing of the sample execution until the "termination" of the packing routine. Examples: OmniUnpack, Renovo, Justin, PolyUnpack

Memory models

- Flat MM: Memory is considered a single, continuous address space from 0 to 2^32 - 1 (4GB) - Segmented MM: Memory is a group of independent address spaces called segments, each addressable separately

HOW NOT TO CODE YOUR RANSOMWARE

- Forget to delete the original files (or to wipe them). - Erase everything but forget about shadow copies. - Delete everything but the encryption key. - Key management issues. - Design your own encryption.

Branch functions

- Given a finite map φ = {a1 → b1, ..., a_n → b_n}, a branch function f_φ is a function such that: a1: jmp b1 -> a1: call f_φ a2: jmp b2 -> a2: call f_φ ... a_n: jmp b_n -> a_n: call f_φ - Obscure the control flow (hard to reconstruct the original map φ) - Misleading disassembler: junk bytes can be introduced after the call instruction

Conditional Code Obfuscation

- Historically, encryption, polymorphism and other obfuscation schemes have been primarily employed to thwart anti-virus tools and static analysis based approaches. - Dynamic analysis based approaches inherently overcome all anti-static analysis obfuscations, but they only observe a single execution path. - Malware can exploit this limitation by employing trigger-based behaviors such as time-bombs, logic-bombs, bot-command inputs, and testing the presence of analyzers, to hide its intended behaviour. Fortunately, recent analysers provide a powerful way to discover trigger based malicious behaviour in arbitrary malicious programs: Exploration of multiple paths during execution of a malware Analysers improvements: - First, analysers may be equipped with decryptor to reduce the search space of keys by taking the input domain into account. - Another approach can be to move more towards input-aware analysis. Rather than capturing binaries only, collection mechanisms should capture interaction the binary with its environment if possible. In case of bots, having related network traces. - Existing honey-pots already have the capability to capture network activity. Recording system interaction can provide more information about the inputs required by the binary

System call

- Interface between a user-space application and a service that the kernel provides. - Generally something that only the kernel has the privilege to do. - Provide useful information about process behaviour It require a control transfer to the OS: - int (Linux: int 0x80, Windows: int 0x2e) - sysenter/sysexit (≥ Intel Pentium ii), syscall/sysret (≥ AMD K6); It's invoked by placing: a system call number in %eax, parameters into general purpose registers

Stack Layout (x86)

- LIFO data structure - "Grows" towards lower memory addresses - Stack-management registers: - stack pointer (%esp): contains the address of the last element on the stack - frame pointer (%ebp): points to the activation record of the current procedure - Assembly instructions that manipulate the stack: - push: decrements %esp, then stores an element at the address contained in %esp - pop: reads the element at address %esp, then increments %esp

AT&T instruction syntax

- Label: mnemonic source(s), destination # comment - Numerical constants are prefixed with a $ - Hexadecimal numbers start with 0x - Binary numbers start with 0b - Registers are denoted by %

Endianess

- Little: *L*east *S*ignificant *B*yte last (LSB in Intel) - Big: Least significant byte first and *M*ost *S*ignificant *B*yte last (MSB)

Linear Sweep

- Locate instructions: where one instruction begins, another ends. - Assume that everything in a section marked as code actually represents machine instructions - Starts to disassemble from the first byte of the code section in a linear fashion - Disassembles one instruction after another until the end of the section is reached Pros: - It provides complete coverage of a program's code sections Cons: - No effort is made to understand the program's control flow - Compiler often mixes code with data (e.g. switch statement ∼ jump table) as it can't distinguish from code and data. When an error occurs the disassembler eventually ends up re-synchronizing with the actual instruction stream (self-repairing) Used by: objdump

Packing

- Malicious code hidden by 1 + layers of compression/encryption. - Decompression/decryption performed at runtime. Unpacking requires a lot of knowledge of all families and need algorithmic-agnostic unpacking techniques.

Trojan horse

- Malicious program disguised as a legitimate software - Many different malicious actions: spy on sensitive user data, hide presence (e.g., rootkit), allow remote access (e.g., Back Orifice, NetBus)

Behaviour-based malware detection

- Monitor the events that characterize the execution of the program (e.g. the system calls executed) - Infer the behavior of the program from these events - Detect high-level malicious behaviours - Can detect novel malware since the majority of them share the same high-level behaviours (e.g. rely spam, steal sensitive information)

Debugging

- Monitored single process execution. - Fine-grained debugging levels (LOC, single-step) - *Breakpoint* stops your program whenever a particular point in the program is reached. - *Watchpoint* stops your program whenever the value of a variable or expression changes. - *Catchpoint* stops your program whenever a particular event occurs. - Analyze CPU environment (memory, registers). E.g.: dbg (GNU/Linux) or for Windows: ImmunityDebugger, OllyDbg, SoftIce, WinDbg

Fast flux

- Offline, disinfected or problematic agents are replaced with others - The botnet is typically composed of millions of agents - The identity of the code components of the infrastructure is well protected - *Multiple domains are used by the same botnet* (it is not sufficient to shut down a domain)

Iterative Optimisation

- Once a criterion function has been defined, we must find a partition of the data set that minimizes the criterion. - Exhaustive enumeration of all partitions, which guarantees the optimal solution, is infeasible. The common approach is: 1. Find some reasonable initial partition 2. Move observations from one cluster to another in order to reduce the criterion function Groups of methods: - Flat clustering algorithms (produce a set of disjoint clusters, those are the most widely used such as K-means) - Hierarchical clustering algorithms (broadly divided in agglomerative and divisive approaches, and the result is hierarchy of nested clusters)

Android actions

Android functionality: - Largely achieved through IPC/ICC (ioctl) - The Binder protocol is crucial to this

Clustering

- Process of organising objects into groups whose members are similar in some way (high/low intra-cluster similarity) - Clusterings are usually not "right" or "wrong"—different clusterings can reveal different things about the data - Some clustering criteria/algorithms have probabilistic interpretations E.g.: non-parametric clustering, hierarchical clustering Type: Unsupervised

Malware objectives

- Profit-oriented - Information stealing (e.g.: spyware, botnets) - Resource consumption (e.g.: botnets) - Resource rental (e.g.: botnets) - Ransom (e.g.: ransomware) - Maintain access to a compromised system such as Rootkits

Virus

- Self-replicating - Host-dependent for infection E.g.: Boot (Brain virus), overwrite, parasitic, cavity, entry point obfuscation, code integration (W95/Zmist virus)

Worm

- Self-replicating, spreads (autonomously) over network - Exploits vulnerabilities affecting a large number of hosts - Sends itself via email E.g.: Internet worm, Netsky, Sobig, Code Red, Blaster, Slammer

Breakpoint

- Software interrupts (traps) on x86. - When the CPU executes an int insn the control transfers to the routine associated - The return address for the ISR points to the insn following the trap instruction. It's a mechanism used for suspending program execution to examine registers and memory locations

Dynamic techniques

- The program must be executed and monitored: - Interaction with the environment - Interaction with the OS (e.g. system calls)

Feature classes

- Start-up installation - Use of cryptographic API - File operations - Collecting user computer's information - Intervention with other processes (malicious code injection, process termination) - Disabling system recovery - ransom note (looking for typical strings, such as encrypted, protected, ransom, RSA, AES, ...) - evasion (debugging/virtualised/sandboxed environment)

Dynamic analysis

- Studying a program's properties by executing it (allowing us to observe actual executions to infer properties and behaviours (*under approximation*)) - Environment-limited analysis - Ability to monitor the code (and data flow) as it executes - Allow to perform precise security analysis thanks to run-time information - Debugging (finding bugs) - Instrumentation: - Add extra semantic-preserving code to a program or a process - Taint-tracking Goals of that system: - *Visibility*: a sandbox must see as much as possible of the execution of a program. Otherwise, it risks of missing interesting, potentially malicious, behaviours - *Resistance to detection*: monitoring should be hard to detect and environment hard to fingerprint - *Scalability*: with 500k+ malware samples per day, analysis must scale up - The execution of one sample does not interfere with the execution of subsequent malware programs - Analyses should be automated

Static analysis

- Studying a program's properties without executing it (*over approximation*) - Reverse engineering may be hampered (e.g.: obfuscation, encryption) Issues: - Opaque predicates: conditions' outcome known upfront, but hard to deduce statically—more complex CFGs - Anti-analysis, e.g., anti-disassembly, CFG flattening, but also packing (see next)—incomplete CFG - Indirect calls or jumps—partial CFG exploration

Execution Levels

- There are different privilege levels - These are used to separate user-land execution from kernel-level execution - Subroutines at higher privilege levels can be accessed through gates and require special setup - Usually kernel-mode is mapped to level 0 and user-mode to level 3 - System calls typically cause transition from user to kernel space

Malware fight goals

- Understand malware behaviours - (Automatically identifying and classifying families of malware - Automatically generating effective malware detection models

Rootkit

- Used to keep access to a compromised system - Usually hides files (usually a malware), processes, network connections (user/kernel level), registry keys, services, processes, ...

AUC

Area Under the ROC Curve which is used as performance metric. The higher, the better. Random classifier has AUC=0.5

Environment interaction

- lsof lists on its standard output file information about files opened by processes. - netstat displays the contents of various network-related data structures. - ltrace intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process

Analysis process

1. Collect malware samples 2. Static/Dynamic analysis 3. Extract (and generalize) malicious behaviour (host/network) 4. Generate and deploy detection models Problems: - Lack of general definition of malicious behaviour - Cat-and-mouse game: attackers have much freedom - Victims often (unwittingly) help attackers

Sandbox based analysis

1. Execute a suspicious program in a sandbox (typically an emulator) 2. Monitor the execution using VM introspection 3. Identify suspicious and malicious behaviours E.g.: Anubis, CWSandbox, Cuckoo Box, BitBlaze Limitations: only the behaviours associated to the taken paths can be monitored

Linear separation

1. Map the samples with the features to a vector space (e.g.: one on x and the other one on y) 2. Separate using a model (best separating hyperplane/line)

Sinkholing

1. Purchase hosting from two different hosting providers known to be unresponsive to complaints 2. Register wd.com and wd.net with two different registrars 3. Set up Apache web servers to receive bot requests 4. Record all network traffic 5. Automatically download and remove data from your hosting providers 6. Enable hosts a week early

Function invocation

1. The caller pushes the parameters on the stack 2. The caller saves the return address on the stack, and then it jumps to the callee (e.g. call <strcpy>) 3. The callee executes a prologue, that consists of the following operations: - Save %ebp on the stack - %ebp = %esp - Allocate space for local variables

Data collection principles

1. The sinkholed botnet should be operated so that any harm and/or damage to victims and targets of attacks would be minimized - Always respond with okn message - Never send new/blank configuration files 2. The sinkholed botnet should collect enough information to enable notification and remediation of affected parties - Work with law enforcement (FBI and DoD Cybercrime units) - Work with bank security officers - Work with ISPs

Exploit kit attack

2nd most used attack vector for ransomware (after emails)

EFLAGS

32-bit register used as a collection of bits representing boolean values to store the results of operations and the state of the processor (ID, VIP, VIF, AC, VM, RF, NT, IOPL, OF, DF, IF, TF, SF, ZS, AF, PF, CF) - *C*arry *F*lag: set if the last arithmetic operation carried (add) or borrowed (sub) a bit beyond the size of the register - *P*arity *F*lag: set if the number of set bits in the LSB is a multiple of 2 - *A*djust *F*lag: carry of Binary Code Decimal (BCD) numbers arithmetic operations - *Z*ero *F*lag: set if the result of an operation is 0 - *S*ign *F*lag: set if the result of an operation is negative - *T*rap *F*lag: set if there's step-by-step debugging - *I*nterruption *F*lag: set if interrupts are enabled - *D*irection *F*lag: stream direction. If set, string operations will decrement their pointer instead of incrementing it - *O*verflow *F*lag: set if signed arithmetic operations result in a value too large for the register to contain - *I*/*O* *P*rivilege *L*evel of the current process - *N*ested *T*ask flag: controls chaining of interrupts, set if the current process is linked to the next process - *R*esume *F*lag: response to debug exceptions - *V*irtual-8086 Mode, set if in 8086 compatibility mode - *A*lignment *C*heck, set if alignment checking of memory references is done - *V*irtual *I*nterrupt *F*lag: virtual image of IF - *V*irtual *I*nterrupt *P*ending flag: set if an interrupt is pending - *ID*entification flag: support for CPUID instruction if can be set

Machine Learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Informally: - It's a sub-field of AI - a set of algorithms that can automatically learn rules from data, to represent models All algorithms rely on sums and products of vectors and matrices, and often the algorithms have geometrical interpretations. Not all information is relevant. It tries to use only the most discriminative features/dimensions. At the end, the algorithm tries to find an optimal model that represents the input data by "minimizing" some error function

Random Forest

A kind of supervised learning which is one of the most popular and effective algorithms, because it includes several "tricks" by design to improve generalization and reduce overfitting. It uses divide and conquer decision tree intuition. Algorithm: 1. For b = 1 to B: (a) Draw a *bootstrap sample Z** of size N from the training data. 4 (b) Grow a random-forest tree T_b to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size n_min is reached. i. Select m variables at random for the p variables. ii. Pick the best variable/split-point among the m. iii. Split the node into two daughter nodes. 2. Output the ensemble of trees {T_b}^B_1. To make a prediction at a new point x: Classification: Let Ĉ_b(x) be the class prediction of the b-th random-forest tree. Then Ĉ^B_rf(x) = majority vote{Ĉ_b(x)}^B Differences to Standard Decision Tree - Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e. some samples will probably occur multiple times in new data set). - For each split, consider only m randomly selected variables - Don't prune (pruning in SDT, i.e. removing parts of the tree, is used to reduce risks of overfitting. In RF, this is taken care by constructing an ensemble of trees on different (bootstrapped) data, and different variables - Fit B trees in such a way, and use majority voting to aggregate results

Botnet

A network of compromised devices (bots) controlled by a bot master (via C&C channels, fast/domain flux or push/pull/P2P). Each bots have the same domain generation algorithm and 3 fixed domains to be used if all else fails. It gets created via infection (worm, trojan via P2P, drive-by downloads, existing backdoor) and spreading

Red-pill

A program capable of detecting if it is executed in an emulator. void main() { redpill = ''\x08\x7c\xe3\x04...''; if (((void (*)())redpill)()) { // Executed on physical CPU return CPU; } else { // Executed on emulated CPU return EMU; } }

Signature-based detection

AV maintains a database of signatures (e.g. byte patterns, regular expressions). A program is considered malicious if it matches a signature

Manhattan distance

Also known as city-block distance (defined via *L_1 norm*)

Adversarial ML

An attacker may try to evade detection or poison training data). Spam filtering may be used on features linked to the presence/absence of words, where the attacker could obfuscate bad words and insert good ones. Defence to that: - Reactive: - timely detection of attacks - frequent re-training - decision verification - Proactive: - Security-by-Design (against white-box attacks [no probing]): security/robust learning, attack detection) which has effects on decision boundaries (noise-specific margin or/and enclosure of legitimate training classes) - Security-by-Obscurity (against grey-box and black-box attacks [probing]): information hiding, randomisation, detection of probing attacks

Dataset

An ideally representative of the real-world population, who's statistically significant (10k+ samples). The dataset should also have a realistic goodware/malware ratio (e.g.: 1-10 malware for every 100-1k goodware) and reliable ground-truth labels (if available). If it has a poor quality then the results of the analysis would be meaningless; that is, there's no general conclusion. The problem with public ones is that data may be poisoned by attackers which may include "bogus" samples to make the classifier learn wrong rules

Drive-by downloads

Attacks against web browser and/or vulnerable plug-ins which is typically launched via malicious client-side scripts (JavaScript, VBScript) that was injected into legitimate sites (e.g: via SQL injection). Sometimes it's hosted on malicious sites or embedded into ads. It can also be linked to redirection where the landing page redirects to a malicious site which allows exploit customisation. Propagation techniques: - Remote exploit + this - Rogue antivirus (fake one on a website)

Bot

Autonomous programs performing tasks

Average linkage

Average inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities

Agglomerative Clustering

Begin with n observations and a measure (such as Euclidean distance) of all the n(n − 1) / 2 pairwise dissimilarities. Treat each observation as its own cluster. For i = n, n - 1, ..., 2: 1. Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are the least dissimilar (that is, most similar). Fuse these 2 clusters. The dissimilarity between these 2 clusters indicates the height in the dendrogram at which the fusion should be placed 2. Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters

Dynamic Analysis for Android

Challenges: - Android apps are interactive and hard to stimulate (pragmatically, we stimulate what we can, e.g.: using MonkeyRunner) - State-modifying actions manifest at multiple abstractions - Traditional OS interactions (e.g. filesystem/network interactions) - Android-specific behaviours (e.g. SMS, phone calls)

BotMining

Clustering analysis of network traffic and structure-independent botnet detection. It's assumed that bots within the same botnet are characterized by similar malicious activities and C&C communications. The C-plane monitor captures network flows and records information on who is talking to whom. Each flow contains: times/duration, IP/port, number of packets, bytes transferred. The A-plane monitor logs information on who is doing what, it analyses: - Outbound traffic through the monitored network - Detecting several malicious activities the internal hosts may perform. The C-plane cluster is responsible for: - Reading the logs generated by the C-plane monitoring - Finding clusters of machines that share similar communication patterns (performs basic filtering, performs white listing, multi-step clustering). The A-plane cluster has a client list with malicious activity and cluster according to activity type (scan, spam, DDoS, binary downloading, exploit downloading) or activity features Cross-plane correlation: The idea is to cross-check clusters in two plans to find intersections. A score s(h) is computer for each host h

C&C

Command & Control. - Centralised control: IRC, HTTP - Distributed control: P2P - Push (The bot silently waits for commands from the "commander") vs Pull (The bot repeatedly queries the "commander" to see if there is a new work to do). Bots locate it based: - on an hardcoded IP address - FastFlux: Hardcoded FQDN or dynamically generated FQDNs (1 FQDN → 1 or more IP addresses) - DomainFlux: Hardcoded URL or dynamically generated URLs - Search keys in the P2P network The location measures can be prevented by: (network level)/DNS/HTTP ACLs

CFG

Control Flow Graph. Given a program P, its control flow graph is a directed graph G = (V , E ) representing all the paths a program might traverse during its execution. - V is the set of basic blocks - E ⊆ V × V is the set of edges representing control flow between basic blocks - a control flow edge from block u to v is e = (u, v) ∈ E

DKOM

Direct Kernel Object Manipulation: in memory alteration of a kernel structure with no hook/patch needed. But then malware.exe disappears from the list of running processes (by delinking itself from both eprocesses). The scheduling is thread-based

Distance to a line

Distance of a point (x0, y0) to a line Ax + By + C = 0

Distance to a hyperplane

Distance of a point x (with right arrow on top) to a hyperplane w^T * X + b = 0 *Hyperplane*: subspace whose dimension is one less than that of its ambient space

DDoS

Distributed Denial of Service

Data definition

Data objects are defined in a data segment using the syntax: label type data1, data2, ... Examples: .data myvar .long 0x12345678, 0x23456789 bar .word 0x1234 mystr .asciz "foo"!

History

Early 90s (IRC bots) -> 98-99-00 (Trojan horse & remote control, DDoS tools & distribution) -> 01-now (worms & spreading)

Crypto-Ransomware

Encrypt personal files to make them inaccessible. Stages of infection: 0. Break-in (phishing/spam, exploit kits, self-propagation, exploiting server vulnerabilities, malvertising, ...) 1. Installation (copies itself into various system locations and make itself autostarteable for reboots) 2. Contacting the HQ 3. Handshake and keys (generation which is usually done locally at run-time) 4. Encryption (e.g.: hard-coded RSA public key, 1 AES/RSA key generated locally, MBR overwrite to boot custom kernel, ...) which usually targets specific files (e.g: documents and images) 5. Extortion (6. Recovery which is easier on locker ransomware and computationally infeasible on crypto ones)

Ransomware live detection

For better results, feature selection (mainly being: API stats, dropped files extension, files operation, registry keys): - Reducing the number of features: simpler ML algorithms. - Shortening the training (and prediction) time, and, in many cases, to prevent overfitting. - Key to make the algorithm more efficient and achieve a better accuracy.

Finding the "nearest" pair of clusters

For two clusters ω_j and ω_k of sizes n_j and n_k: - Minimum distance (single linkage): d_min(ω_j, ω_k) := min_{x∈ω_j, y∈ω_k} ∥x − y∥ - Maximum distance (complete linkage): d_max(ω_j, ω_k) := max_{x∈ω_j, y∈ω_k} ∥x − y∥ - Average distance (average linkage): d_avg(ω_j, ω_k) := 1/(n_j ** n_k) *** ∑_{x∈ω_j ** ∑y∈ω_k} ∥x − y∥ - Mean distance (centroid linkage): d_mean(ω_j, ω_k) := ∥μ_j − μ_k∥ where μ_j and μ_k are the means of the 2 clusters

Kernel function

Function that corresponds to an inner product in some expanded feature space. Linear classifier relies on an inner product between vectors K(x_i, x_j) = x_i^T * x_j If every data-point is mapped into high-dimensional space via some transformation φ : x → φ(x), the inner product becomes: K(x_i, x_j) = φ(x_i)^T * φ(x_j) Why use kernels? - Make non-separable problem separable - Map data into better representational space Common kernels: linear, polynomial, Radial Basis Function (RBF, also known as Gaussian Kernel). With the RBF/Gaussian kernel, it is possible to create a non-linear separation in the original space (see ≈ circular shape in the figure) by solving a linear separation problem in an alternative space

Criterion function for clustering

Function to be optimised. The most widely used one is the sum-of-squared-errors over the clusters ω_j. This criterion measures how well the data set X = {x1, x2, ..., x_n} is represented by the cluster centres μ = {μ1, ..., μ_K}, (K ≤ n). Clustering methods that use this criterion are called minimum variance

Classification

Given a labelled dataset, find a model that separates instances into classes. Its output is often the probability of belonging to a certain class.

Call graph

Given a program P, its control flow graph is a directed graph C = (R, E). - R is the set of procedures - E ⊆ R × R is the set of edges indicating the relation caller-callee - a caller-callee edge from the caller procedure u to the callee procedure v is e = (u, v ) ∈ E

Regression

Given some points, try to generalize and predict real-valued numbers and finding the equation of the curve/line that represent the data distribution of the points. For that, we need a concept of error that we want to minimize. For polynomial curves, we need to choose an order M (M=0 usually being underfitted) then the weight w of each points are obtained by minimizing the considered error. Errors are fitted as *mean square errors*.

Ransomware

Goals: - render the victim's system unusable, - and ask the user to pay a ransom to revert the damage. Notable classes: locker-ransomware and crypto-ransomware.

ROC curve

Graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied

F1-Score

Harmonic mean of precision and recall

Self-emulating malware

Heuristics to detect the end of the unpacking are based on the "execution of previously written code". 1. The code of the malware is transformed in bytecode 2. Bytecode interpreted at run-time by a VM 3. Bytecode mutated in each sample

Hooking

Hijack the flow of the execution by modifying a code pointer. Examples: - User-space: IAT - Kernel-space: IDT (1+ handler(s): 0x2e → KiSystemService; cf. figure), MSR, SSDT (1+ descriptor(s): 0x74 → NtOpenFile) (easy to detect: if (IDT[0x2e] != KiSystemService || SSDT[0x74] != NtOpenFile) then ...

Anti-debugging

How can a process detect if it is currently being traced? Ptrace example (since processes can only have 1 parent): #include <stdio.h> #include <sys/ptrace.h> int main() { if (ptrace(PTRACE TRACEME, 0, NULL, NULL) < 0) { printf("You are debugging me... bye!\n"); return 1; } printf("Hello world!\n"); return 0; }

Thwarting linear sweep

How? - Increase the number of candidates by using *branch flipping*

Soft margin classification

If the training data is not linearly separable, slack variables ξ_i can be added to allow misclassification of difficult or noisy examples. So the idea is to allow some errors. But we still try to minimize training set errors, and to a place hyperplane "far" from each class (large margin

Prologue

Instructions block that is called as a function is invoked (by the caller) and typically contains: push %ebp mov %esp, %ebp sub $n, %esp

Epilogue

Instructions block that is called by the callee functions as it terminates. It: 1. deallocates local variables (%esp = %ebp), 2. restores the base pointer of the caller function, 3. resumes the execution from the saved return address. It typically consist of the following instructions: leave ret Or: mov %ebp, %esp pop %ebp ret

Signed integers

Integers in 2's complement

IDA Pro

Interactive Disassembler Professional. Recursive traversal disassembler; detailed (preliminary) analysis that includes: - Function boundaries, library calls and their arguments data types - Control-flow graph, call graph (proximity view) - Functions window, IDA view, hex view, . . . - Automatic tagging of string constants - Code and data cross references - Comments, variable/memory addresses rename It handles pretty much any architecture

Junk insertion

Introduces disassembly errors by inserting junk bytes at selected locations into the code stream where the disassembler expects code

Sec SVM

Intuition: - To have more evenly-distributed feature weights w - In this way, the attacker would need to modify many features to evade the classifier. Where: - R(f) is the regularization term, and L(f , D) is the hinge loss function, and C is the trade-off factor between R(f) and L(f , D) - D is the dataset, and each sample x_i has ground truth-label y_i ∈ {−1, +1} f is the classifier function, where f(x_i) = w^T_i * b_i is the distance from the hyperplane in the SVM - w^lb_k and w^ub_k are lower and upper bounds specific for each feature x k (as some features may be harder than others to modify

MonkeyRunner

It can: - Install an apk - Invoke an activity with the installed apk - Send keystrokes to type a text message - Click on arbitrary locations on the screen - Take screenshots (useful for feedback) - Wake up device if it goes to sleep But: - it is still notoriously difficult to write a good stimulation script - Android apps are highly interactive and you need to get right both the context and location of the stimulation in the GUI - A change in the GUI means a new test script needs to be developed

Recursive Traversal

It needs to make assumptions on what to disassemble first and focuses on the concept of control flow; instructions classified as: - *Sequential flow*: pass execution to the next instruction that immediately follows (add, mov, push, pop, . . . ) - *Conditional branching*: if the condition is true the branch is taken and the instruction pointer must change to reflect the target of the branch, otherwise it continues in a linear fashion (jnz, jne, . . . ). In static context this algorithm disassemble both paths - *Unconditional branching*: the branch is taken without any condition; the algorithm follows the (execution) flow (jmp) - *Functional call*: are like unconditional jumps but they return to the instruction immediately following the call - *Return*: every instructions which may modify the flow of the program add the target address to a list of deferred disassembly. When a return instruction is reached an address is popped from the list and the algorithm continues from there (recursive algorithm). Pros: - Distinguish code from data Cons: - Inability to follow indirect code paths (indirect code invocation issue) Used by: IDA Pro

Inversion

It occurs where two clusters are fused at a height below either of the individual clusters in the dendrogram. it leads to difficulties in visualization as well as in the interpretation of the dendrogram

CopperDroid

Key insights: - All interesting behaviours achieved through system calls: · Low-level, OS semantics (e.g. network access) · High-level, Android semantics (e.g. phone call) Novelty: - Automatically reconstruct behaviours from system calls - With no changes to the Android OS image

Line representation of a decision function

Line that maximizes the margin between the two classes. Let us consider the two 2-class case and represented them from the set {−1, 1} i.e. y_i ∈ {1, −1}. Define the hyperplane with maximum separation such that for the i-th sample

Locker-Ransomware

Lock the victims' computer to prevent them from using it.

Assembly

Low-level processor-specific symbolic language which is directly translated into binary format

Complete linkage

Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities

Basic block

Maximal sequence of consecutive instructions with a single entry and single exit without CTI interleaving in the block of code

Distance measure

Metric: function d(x, y) offered for measuring the distance between 2 vectors x and y is a metric if it satisfies the following properties: - d(x, y) ≥ 0 - d(x, y) = 0 ⇐⇒ x = y - d(x, y) = d(y , x) - d(x, y) ≤ d(x, z) + d(z, y) In vector spaces (where subtraction is allowed), we often define d(x, y) = ∥x − y∥ using a norm ∥ ... ∥ The most commonly used metric is the Minkowski distance

Single linkage

Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities (least balanced and popular)

Confusion matrix

One of the performance metrics

Cuckoo

Open source automated malware analysis system. It automatically runs and analyses files and collect comprehensive analysis results: outline what the malware does while running inside an isolated OS. Key features: - Completely automated. - Run concurrent analysis. - Able to trace processes recursively (e.g., child). - Customizable analysis process. - Create behavioural signatures. - Customize processing and reporting Use cases: - Can be used as a standalone application or can be integrated in larger frameworks: - extremely modular design. - It can be used to analyse: - generic Windows EXE; - DLL files; - PDF; - Microsoft Office documents; - URLs and HTML files; - PHP scripts; - CPL files; - VB scripts; - ZIP; - JAR; - Python files; - ... Events: - Windows API calls traces. - Copies of files. - Dump of the process memory. - Full memory dump. - Screenshots. - Network dump (PCAP format)

Data collection

Phase 1 of the ML pipeline. Unlike many research fields (e.g., social network analysis), in security it is hard to obtain datasets but "luckily for us", in the case of malware analysis it is easier. Possible data sources: - Private company data (e.g., network traffic, emails, ...) - Public datasets (good availability of malware, less for other types of security data): VirusShare, secrepo, VirusTotal, ...

ML pipeline

Phase 1: Data collection Phase 2: Pre-processing and feature engineering Phase 3: Model selection and training Phase 4: Testing and evaluation Phase 5: Evaluate robustness against time evolution and adversaries

Feature engineering

Phase 2 of the ML pipeline, it's where we need to define and extract *features* (input of our machine learning model which is usually defined from *domain knowledge*) from raw samples in our data collection. Some types of features: - *numerical*: filesize, number of API calls - *categorical*: a certificate feature that can take three values (signed, unsigned, signed with expired certificate). *Static* features: - Features extracted from metadata (Android Manifest file, ELF/PE) / code (API call frequencies, or graph structures such as CFG) *Dynamic* features: - Features extracted from dynamic analysis of the code: system call sequences, HTTP requests metadata, URLs called, ... *Note*: Static and dynamic features inherit all the weaknesses of static and dynamic analysis (e.g.: obfuscation for static analysis and coverage for dynamic analysis). Good features highlight commonalities between members of a class and differences between members of different classes

Model selection and training

Phase 3 of the ML pipeline which includes: - *Training* (used to learn a model) - *Validation* (used for hyper-parameter tuning of the ML algorithm) - *Testing* (used to see performance on a real environment). The model selection should be done by answering the following questions: - Do we have enough training data? Small -> K-Nearest Neighbours Big -> Neural Networks - Is the data linearly separable? Rule of thumb: choose the simplest model that can solve our problem

Testing and evaluation

Phase 4 of the ML pipeline

Evaluate robustness against time evolution and adversaries

Phase 5 of the ML pipeline. - Robustness against time: As time passes, malware authors develop new malware, and existing malware evolves into new versions and polymorphic variants. The distribution at "test time" may not reflect any more the distribution at "training time" - Robustness against adversaries: A sophisticated attacker may be aware that we are deploying a malware detection model based on ML. He may try to perform evasion of detection or poisoning of the training data. The detection performance decay over time. In order to fight concept drift: - Periodic re-training - Evaluate your classifier performance with respect to time However, it may not be enough: the data distribution of the arriving data may change rapidly and unexpectedly, so you will estimate a misleading "time decay". The adversarial ML can be seen as a worst-case situation of concept drift

Benign bot

Program used for Internet Relay Chat (IRC) and react to event in IRC channels and which typically offer useful services (e.g: NickServ))

Junk byte

Properties: - must be partial instructions - must be inserted in such a way that they are unreachable at runtime Candidate block: - have junk bytes inserted before it - execution cannot fall through the candidate block - basic block immediately before a candidate block must end with an unconditional branch

Wiper

Ransomware that is permanently destructive aka ''destructionware''

Sandbox

Security mechanism for separating running programs: - often used to execute untested or untrusted programs or code - no (or limited) risk to harm the host machine or operating system - provides a tightly controlled set of resources for guest programs to run in - it can be based on virtualization (which can be used to emulate something). Pros: - automate the whole analysis process; - process high volumes of malware; - get the actual executed code; - can be very effective if used smartly. Cons: - can be expensive; - some portions of the code might not be triggered; - environment could be detected.

Code overlapping

Sharing of code on different levels

Feature representation

Since ML algorithms need numbers and matrices as input and are often designed to work in Euclidean space. Thus features needs to be mapped into an Euclidean space and create a feature vector out of the interesting features. The most common way to do this is to count its occurrence (e.g.: an app calls the Telephony manager 20 times). The presence/absence of a feature is represented by a boolean (e.g.: an app calls the Telephony manager).

Emulator

Software program that simulates the functionality of another program or piece of hardware (e.g, CPU—see for instance QEMU). When a program P runs on top of emulated hardware, the system collects detailed information about the execution of P - Can potentially detect evasion attempts - Drawback: the software layer incurs a performance penalty. The implications of that on a whole system (sandbox): - One can install and run an actual OS on top of the emulator - Malware execute on top of a real OS - Fingerprinting of the analysis environment much more difficult to detect for malware - The interface offered by a processor is (much) simpler than the interface provided by a modern OS - A system emulator has great visibility - Visibility of every executed instruction and memory access: ability to faithfully reconstruct low - and high-level semantics - Challenges: - *Semantic gap*: instructions and memory accesses need to be mapped and associated to OS semantics (virtual machine introspection) - *Performance*: could potentially distinguish between trusted (e.g. kernel) and untrusted code; trusted code executed in a virtualise fashion

Hierarchical Clustering

Sometimes it is desirable to obtain a hierarchical representation of data, with clusters and sub-clusters arranged in a tree-structured fashion. Methods: - agglomerative (i.e. bottom-up) - divisive (i.e. top-down) Hierarchical methods actually produce several partitions; one for each level of the tree. However, for many applications we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram, based on a cutting criterion that can be defined using a threshold

Non-parametric clustering

Steps: - Defining a measure of (dis)similarity between observations - Defining a criterion function for clustering - Defining an algorithm to minimize (or maximize) the criterion function

DREBIN

Supervised learning classifier which relies on a linear SVM to separate malicious and benign applications. It uses static binary features extracted from the Android .apk (8 classes of features in particular)

SVM

Support Vector Machines: supervised classification algorithm which is highly efficient (just a convex optimization problem) and generalizes well. Objective: Find an optimal hyperplane that segregates linearly separable data that maximizes the separation. Inputs: - Set of training samples with each sample containing the same number of features - For each of the training samples, the ground truth y which tells the class to which it belongs Outputs: a set of weights (one for each feature) whose linear combination predicts the value of y Formal optimisation problem (cf. image) In which: - f(x) = w^T * x + b, where w is the vector of feature weights, and b is the bias - R(f) is called regularization term, used to avoid overfitting (i.e. to avoid that the classifiers learns weight parameters that are too specific for the training data) - L(f , D) is the hinge loss function computed on the training data D - C is the trade-off hyper-parameter between loss and regularization

Recall

TP / (TP + FN)

Precision

TP / (TP + FP)

Disassembly

Task consisting in taking a binary blob, setting code and data apart, and translating machine code to mnemonic instructions. With a disassembled program, we can: locate functions, recognize jumps, identify local variables, understand the program behaviour without running it. Issues: - Code and data in the same address space (how to distinguish them?) - Variable-length instruction - Indirect control transfer instructions - Basic blocks - At compile-time some information may disappear (e.g.: variable names, type information, macro & comments) - Identifying functions and function parameters

Code obfuscation

Techniques that preserve the program's semantics and functionality while, at the same time, making it more difficult for the analyst to extract and comprehend the program's structure

Attack Strategy

The attacker knowledge space Θ consists of: - the data D - the feature space X - The classification function f (including the train parameters). Depending on what the attacker knows (or does not), we will have different attack scenarios. Perfect knowledge: If the attacker knows θ = (D, X, f), then we say that the attacker has perfect knowledge of the system.

No Free Lunch Theorem

The best algorithm depends on the specific task

Centroid linkage

The dissimilarity between the centroid for cluster A (the mean vector) and the centroid for cluster B. It can result in undesirable inversions

Feature space

The general idea is that the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable

IRC Botnet infiltration

The infiltrator can connect to the C&C server as a bot and can see all the bots connected, he can also receive commands from the botmaster, and he has the ability to send commands to other bots

HTTP Botnet infiltration

The infiltrator can connect to the C&C server as a bot but he can't see the other bots connected, but the server-side infiltrator can

Underfitting

The model obtained by the ML algorithm can neither model the training data nor generalise. To avoid that, we can find the lowest error where training and testing error are similar and avoid erroneous assumptions and simple models about the training dataset that do not generalise. E.g.: Robbers carry firearms

Overfitting

The model obtained by the ML algorithm is too suited to the training set and not general enough. It can be detected when training errors and test errors diverge too much. It can be avoided by avoiding modelling noise in the training dataset. E.g.: Robbers wear masks, operate after dark and flee in sedans

Pre-processing

The other part of the 2nd phase of the ML pipeline. Many ML models require datasets to be standardised. Otherwise: - The optimisation function used may behave sub-optimally if the data isn't normalised and scaled (e.g: between 0 and 1). - ML estimators may behave unexpectedly if the input data is not normally distributed. - Many elements of the objective function assume features are centred around 0 and variance in the same order. - If a feature has variance that is much larger than other features, it would skew the estimator in its favour.

Virtualisation

The program P runs on actual hardware: - The virtualisation software (hypervisor) only controls and mediates the accesses of different programs (or different VMs) to the underlying hardware - The VMs are independent and isolated from each other - However, execution occupies actual physical resources and the hypervisor (and the malware analysis system) cannot run simultaneously—this hinders collections of detailed information about the execution of the monitored program - Hard to hide hypervisor to malicious code - However, execution is almost at native speed

Analysis tools

They can be run as alternative or alongside Cuckoo: - Kali Linux: network analysis & DNS server. - IDA Pro: industry-standard disassembler. • flow of the program; • list of imported Windows API functions; • list of strings found in the executable. - OllyDbg: Windows debugger. - PEview: provides an overview of a portable executable's sections (e.g. mistmatch in size due to packing). - PEiD: to identify the use of packers. - Sysinternals Process Monitor: • displays a very detailed view of file, registry, and network operations made by running processes; • dropped files or autorun installs can be tracked. - Sysinternals Process Explorer: • a much more powerful version of the windows task manager; • it adds support for viewing strings in memory, checking parents of processes, more powerful process termination; • and an integrity check for injected windows processes like svchost. - RegShot: allows an analyst to take two snapshots of the registry, • before and after the execution of malware, • comparative summary that indicates the process's registry operations.

Dendogram

Tree-structured graph used to visualise the result a hierarchical clustering calculation. A fuse (= merge) is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the (dis)similarity of the merged clusters. Set representation: {{x1, {x2, x3}}, {{{x4, x5}, {x6, x7}}, x8}} (however the set representation can't express the quantitative information).

Torpig botnet

Trojan horse - Distributed via the Mebroot "malware platform" - Injects itself into 29 different applications as DLL - Steals sensitive information (passwords, HTTP POST data) - HTTP injection for phishing - Uses "encrypted" HTTP as C&C protocol - Uses domain flux to locate C&C server Mebroot - Spreads via drive-by downloads - Sophisticated rootkit (overwrites master boot record)

Spam

Unsolicited email

K-Fold cross-validation

Useful for hyper-parameter optimization. Warning: it's not suitable for timestamped samples as it may cause future samples to be included in the training datasets

Non-linear SVM

Useful when datasets aren't linear (so will end up noisy in a linear space) so another approach is to map the data into a higher-dimensional space (e.g: 2D)

Cluster validity

Validity that is highly subjective in unsupervised learning in comparison to supervised learning where a clear objective function is known (e.g.: MSE). The choice of the (dis)similarity measure and criterion function will have a major impact on the final clustering produced by the algorithms

Thwarting recursive traversal

What can be exploited? - disassemblers assume that control transfer instructions behave reasonably (2 targets: the branch one and the fall through to the next instruction) - difficulty to identify indirect control transfers Techniques: - *branch functions* - *opaque predicates*: disguise an unconditional branch as a conditional branch that always go in one direction, using predicates that always evaluate to either true or false - *jump table spoofing*: insert artificial jump tables to mislead the disassembler; disassembler analyses jumps to table for identifying target address; fake jump table entries whose targets are locations of junk bytes

ptrace

it allows one process (the tracing process) to control another (the traced process). The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement breakpoint debugging and system call tracing.


Related study sets

9-1 Simplifying Radical Expressions

View Set

Jason Udemy CompTIA Security+ (SY0-601) Practice Exam #3

View Set

anesthesia II exotic small animal hw classmarker

View Set

Chapter 15: Care of Intraoperative Patients

View Set

environmental science exam 1 study guide

View Set