Malicious Software (IY3840)
Metamorphic malware
"Body-polymorphic": create semantically-equivalent completely different versions of code at each infection. It works by: 1. Analysing its own code 2. Split the code in blocks 3. Mutate each block separately
Supervised Learning (classification)
"Training with a teacher". - Known number of classes - Learning from a labelled training set - Used to classify future observations
Unsupervised Learning (Clustering)
"Training without a teacher" (no correct answers given, no prior knowledge) Given a collection of objects X = (x1, x2, ..., x_n) without class labels y_i, there are methods for building a model that captures the structure of the data while finding the "natural" grouping of instances with no. Uses: - Labelling large datasets can be a costly procedure -Class labels may not be known beforehand - Large datasets can be compressed into a small set of prototypes
Accuracy
(TP + TN) / (TP + FP + TN + FN) Where: - TP is True Positives - FP is False Positives - TN is True Negatives - FN is False Negatives However it can be misleading when datasets are very imbalanced (e.g.: a classifier that predicts always "benign" will have 99% accuracy on a dataset with 99% of goodware and 1% of malware)
Evasion attacks
*Formalisation:* - z* is the optimally modified malware sample for evasion - z' ∈ Ω(z) is the set of all possible modifications to the malware sample (as long as the code works) - ŵ is the vector of estimated feature weights - x' is the feature vector corresponding to the altered sample *Possible Scenarios:* - Zero-effort Attack: θ = () - DexGuard-based Obfuscation Attacks: θ = () + PRAGuard obfuscations - Mimicry Attack: θ = (D̂, X), where D̂ is a surrogate dataset and X is the feature space - Limited-Knowledge (LK) Attacks: θ = ( D̂, X , fˆ), where fˆ is the classifier but without known parameters - Perfect-Knowledge (PK) Attacks: θ = (D, X , f ), i.e., the attacker knows w and b parameters of the SVM
Registers
*IA-32*: - General purpose: - *a*ccumulator register (ax): used in arithmetic operations - *c*ounter register (cx): used in shift/rotate instructions and loops - *d*ata register (dx): used in arithmetic operations and I/O operations - *b*ase register (bx): used as a pointer to data (located in DS, when in segmented mode) - *s*tack *p*ointer register (sp): pointer to the top of the stack - stack *b*ase *p*ointer register (bp): used to point to the base of the stack - *s*ource *i*ndex register (si): used as a pointer to a source in stream operations - *d*estination *i*ndex register (di): used as a pointer to a destination in stream operations - (Extended) program counter/instruction pointer: %eip (contains the address of the next instruction to be executed if no branching is done - %eflags (ZF, SF, CF, ...) - Segment registers: CS, DS, SS, ES, FS , GS - Used to select segments (e.g. code, data, stack) - Program status and control: EFLAGS - The instruction pointer: EIP - Points to the next instruction to be executed - Cannot be read or set explicitly - It is modified by jump and call/return instructions - Can be read by executing a call and checking the value pushed on the stack - Floating point units and MMX/XMM registers Access modes: - 8-bit (only for AX, CV, DX, BX): A*L* for the lower half and A*H* for the high half of AX - 16-bit: ..X (so the regular abbreviations, e.g.: AX) - 32-bit: *E*... (e.g.: EAX) - 64-bit: *R*... (e.g.: RAX)
Euclidean distance
*L_2 norm*
Chebyshev distance
*L_∞ norm*
Minkowski distance
*Lk norm*: The choice of an appropriate value of *k* depends on the amount of emphasis that you would like to give to the larger differences between the components. Special cases: Manhattan or city-block distance, Euclidean, Chebyshev distance
Malware
*Malicious software*: unwanted software and executable code that is used to perform an unauthorized, often harmful, action on a computing device. It is an umbrella-term for various types of harmful software, including viruses, worms, Trojans, rootkits, and botnets. Traditional ones are written in ASM/C/macro code and have unprotected/unobfuscated payloads
strace
*System call tracer* which intercepts and records the system calls which are called by a process and the signals which are received by a process. - trace system calls, parameters, signals, ... - trace child processes - attach to a ran processes - leveraging ptrace
x86 ASM
- (Slightly) higher-level language than machine language. - Program is made of - directives: commands for the assembler .data identifies a section with variables - instructions: actual operations jmp 8048f3f 2 possible syntaxes, with different ordering of the operands! - AT&T syntax (objdump, GNU Assembler) - DOS/Intel syntax (Microsoft Assembler, Nasm) - Instructions can be modified using suffixes: *b*yte, *w*ord (16 bits), *l*ong (32 bits), *q*uad (64 bits) movl %ecx,%eax - Addresses are specified by using a segment selector and an offset: CS, DS, SS, ES, FS, GS Call segments (Code Segment, Data Segment, Stack Segment, G Segment) - Memory access is of form displacement(%base, %index, scale) where the result address is displacement+%base+%index*scale - movl 0xffffff98(%ebp),%eax copies the contents of the memory pointed by ebp - 104 into eax - mov (%eax),%eax copies the contents of the memory pointed by eax in eax - movl %eax,15(%edx,%ecx,2) moves the contents of eax into the memory at address 15 + edx + ecx * 2 - mov $0x804a0e4,%ebx copies the value 0x804a0e4 into ebx - mov 0x804a0e4,%eax copies the content of memory at address 0x804a0e4 into eax - mov ax, [di] copies the operand address
ML challenges (in security)
- *High cost of FNs*: a FN is a missed "attack", but reducing FNs implies increased FPs - *Hard to find public datasets*: mainly due to privacy reasons, there is a lack of datasets for security (except maybe for malware analysis) - *High cost of labelling*: analysing a malware sample or a traffic trace may require a huge amount of time - *Explainability is hard*: many events may be correlated but it is not trivial to determine causation - *Imbalanced datasets*: the vast majority of events/applications are benign, only a small fraction (< 0.1%) are malicious
Binder Protocol
- *IPC/RPC*: - Binder protocols enable fast inter-process communication - Allows apps to invoke other app component functions - Binder objects handled by Binder Driver in kernel - Serialized/marshalled passing through kernel - Results in input output control (ioctl) system calls - *Android Interface Definition Language (AIDL)*: - AIDL defines which/how services can be invoked remotely - Describes how to marshal method parameters
80x86 CPU Family
- 8088, 8086: 16 bit registers, real-mode only - 80286: 16-bit protected mode - 80386: 32-bit registers, 32-bit protected mode - 80486/Pentium/Pentium Pro: Adds few features, speed-up - Pentium MMX: Introduces the multimedia extensions (MMX) - Pentium II: Pentium Pro with MMX instructions - Pentium III: Speed-up, introduces the Streaming SIMD Extensions (SSE) - Pentium 4: Introduces the NetBurst architecture - Xeon: Introduces Hyper-Threading - Core: Multiple cores
Anti-analysis
- *Obfuscation*: many levels of packing. - *Anti-forensics*: - self-deletion from disk; - erase key from memory; - change time of the module to that of the kernel32.dll. - Anti-AV: tricks signature checks by spawning hollowed explorer.exe (RunPE). - Whitelisting security products: attackers want to be sure that security software is not encrypted
Segment registers
- *S*tack *S*egment: pointer to the stack - *C*ode *S*egment: pointer to the code - *D*ata *S*egment: pointer to the data - *E*xtra *S*egment: pointer to extra data - *F* *S*egment: pointer to more extra data - *G* *S*egment: pointer to still more extra data
Basic protection from ransomware
- *User education*: email attachments and social engineering. - *Patch systems*: operating system, applications, browser plugins, ... - *Backup files*: backup all important data regularly. - *Remove admin rights*. - *Protected folder*.
Android ransomware families
- Android Defender: fake AV (first example of ransomware for Android). - Police Ransomware: illegal activity detected. - Simplocker: encrypts files on memory card. - Lockerpin: set or change the PIN on the device. - Dogspectus: silently installs on Android devices, via malvertising.
Forced multi-path exploration
- Assumption: the behaviour of the program depends on the output of the syscalls it executes. - Track dependencies between syscalls output and program variables - Detect untaken paths and force the execution of these paths by computing new program states that satisfy the path conditions
Adversarial Taxonomy
- Attacker's Goals - Security violation - integrity (malware classified as benign) - availability (benign classified as malware) - privacy (user data leaked) - Attack specificity (targeted, indiscriminate) - Attacker's Knowledge - training data - feature set, feature extraction/selection - learning algorithm, parameters - Attacker's Capability - exploratory: only at test time - causative: poison training
Domain flux
- Bots periodically generates new C&C domain names. - The local date (system time) is often used as input. - Botmaster needs to register one of these domains and respond properly so that bots recognize valid C&C server. - Defenders must register all domains to take down botnet
Polymorphic malware
- Change layout with each infection - Payload is encrypted - Using different key for each infection - Makes static string analysis practically impossible - Encryption routine must change too or detection is trivial
Taint analysis
- Concerned about tracking how interesting data flow throughout a program's execution - Taint source (the interesting data source) - Propagation rules (the how-to) - Taint sinks (where the data flows to, allow to enforce security policies) - At the basis of many dynamic behaviour malware analysis framework - Explicit (data) flows: x = y; - Control-dependent (data) flows: if (x == <value>) { expr; } - Implicit flows (w gets the same value of y): x = 0; z = 0; if (y == 1) x = 1; else z = 1; if (x == 0) w = 0; if (z == 0) w = 1 - Very successful when protecting benign programs - Open to a number of evasions when applied on malicious programs
Instruction Classes
- Data transfer: mov, xchg, push, pop - Binary arithmetic: add, sub, imul, mul, idiv, div, inc, dec - Logical: and, or, xor, not - Control transfer: jmp, jne, call, ret, int, iret - Input/output: in, out
Algorithmic-agnostic Unpacking
- Dynamic analysis - Emulation/tracing of the sample execution until the "termination" of the packing routine. Examples: OmniUnpack, Renovo, Justin, PolyUnpack
Memory models
- Flat MM: Memory is considered a single, continuous address space from 0 to 2^32 - 1 (4GB) - Segmented MM: Memory is a group of independent address spaces called segments, each addressable separately
HOW NOT TO CODE YOUR RANSOMWARE
- Forget to delete the original files (or to wipe them). - Erase everything but forget about shadow copies. - Delete everything but the encryption key. - Key management issues. - Design your own encryption.
Branch functions
- Given a finite map φ = {a1 → b1, ..., a_n → b_n}, a branch function f_φ is a function such that: a1: jmp b1 -> a1: call f_φ a2: jmp b2 -> a2: call f_φ ... a_n: jmp b_n -> a_n: call f_φ - Obscure the control flow (hard to reconstruct the original map φ) - Misleading disassembler: junk bytes can be introduced after the call instruction
Conditional Code Obfuscation
- Historically, encryption, polymorphism and other obfuscation schemes have been primarily employed to thwart anti-virus tools and static analysis based approaches. - Dynamic analysis based approaches inherently overcome all anti-static analysis obfuscations, but they only observe a single execution path. - Malware can exploit this limitation by employing trigger-based behaviors such as time-bombs, logic-bombs, bot-command inputs, and testing the presence of analyzers, to hide its intended behaviour. Fortunately, recent analysers provide a powerful way to discover trigger based malicious behaviour in arbitrary malicious programs: Exploration of multiple paths during execution of a malware Analysers improvements: - First, analysers may be equipped with decryptor to reduce the search space of keys by taking the input domain into account. - Another approach can be to move more towards input-aware analysis. Rather than capturing binaries only, collection mechanisms should capture interaction the binary with its environment if possible. In case of bots, having related network traces. - Existing honey-pots already have the capability to capture network activity. Recording system interaction can provide more information about the inputs required by the binary
System call
- Interface between a user-space application and a service that the kernel provides. - Generally something that only the kernel has the privilege to do. - Provide useful information about process behaviour It require a control transfer to the OS: - int (Linux: int 0x80, Windows: int 0x2e) - sysenter/sysexit (≥ Intel Pentium ii), syscall/sysret (≥ AMD K6); It's invoked by placing: a system call number in %eax, parameters into general purpose registers
Stack Layout (x86)
- LIFO data structure - "Grows" towards lower memory addresses - Stack-management registers: - stack pointer (%esp): contains the address of the last element on the stack - frame pointer (%ebp): points to the activation record of the current procedure - Assembly instructions that manipulate the stack: - push: decrements %esp, then stores an element at the address contained in %esp - pop: reads the element at address %esp, then increments %esp
AT&T instruction syntax
- Label: mnemonic source(s), destination # comment - Numerical constants are prefixed with a $ - Hexadecimal numbers start with 0x - Binary numbers start with 0b - Registers are denoted by %
Endianess
- Little: *L*east *S*ignificant *B*yte last (LSB in Intel) - Big: Least significant byte first and *M*ost *S*ignificant *B*yte last (MSB)
Linear Sweep
- Locate instructions: where one instruction begins, another ends. - Assume that everything in a section marked as code actually represents machine instructions - Starts to disassemble from the first byte of the code section in a linear fashion - Disassembles one instruction after another until the end of the section is reached Pros: - It provides complete coverage of a program's code sections Cons: - No effort is made to understand the program's control flow - Compiler often mixes code with data (e.g. switch statement ∼ jump table) as it can't distinguish from code and data. When an error occurs the disassembler eventually ends up re-synchronizing with the actual instruction stream (self-repairing) Used by: objdump
Packing
- Malicious code hidden by 1 + layers of compression/encryption. - Decompression/decryption performed at runtime. Unpacking requires a lot of knowledge of all families and need algorithmic-agnostic unpacking techniques.
Trojan horse
- Malicious program disguised as a legitimate software - Many different malicious actions: spy on sensitive user data, hide presence (e.g., rootkit), allow remote access (e.g., Back Orifice, NetBus)
Behaviour-based malware detection
- Monitor the events that characterize the execution of the program (e.g. the system calls executed) - Infer the behavior of the program from these events - Detect high-level malicious behaviours - Can detect novel malware since the majority of them share the same high-level behaviours (e.g. rely spam, steal sensitive information)
Debugging
- Monitored single process execution. - Fine-grained debugging levels (LOC, single-step) - *Breakpoint* stops your program whenever a particular point in the program is reached. - *Watchpoint* stops your program whenever the value of a variable or expression changes. - *Catchpoint* stops your program whenever a particular event occurs. - Analyze CPU environment (memory, registers). E.g.: dbg (GNU/Linux) or for Windows: ImmunityDebugger, OllyDbg, SoftIce, WinDbg
Fast flux
- Offline, disinfected or problematic agents are replaced with others - The botnet is typically composed of millions of agents - The identity of the code components of the infrastructure is well protected - *Multiple domains are used by the same botnet* (it is not sufficient to shut down a domain)
Iterative Optimisation
- Once a criterion function has been defined, we must find a partition of the data set that minimizes the criterion. - Exhaustive enumeration of all partitions, which guarantees the optimal solution, is infeasible. The common approach is: 1. Find some reasonable initial partition 2. Move observations from one cluster to another in order to reduce the criterion function Groups of methods: - Flat clustering algorithms (produce a set of disjoint clusters, those are the most widely used such as K-means) - Hierarchical clustering algorithms (broadly divided in agglomerative and divisive approaches, and the result is hierarchy of nested clusters)
Android actions
Android functionality: - Largely achieved through IPC/ICC (ioctl) - The Binder protocol is crucial to this
Clustering
- Process of organising objects into groups whose members are similar in some way (high/low intra-cluster similarity) - Clusterings are usually not "right" or "wrong"—different clusterings can reveal different things about the data - Some clustering criteria/algorithms have probabilistic interpretations E.g.: non-parametric clustering, hierarchical clustering Type: Unsupervised
Malware objectives
- Profit-oriented - Information stealing (e.g.: spyware, botnets) - Resource consumption (e.g.: botnets) - Resource rental (e.g.: botnets) - Ransom (e.g.: ransomware) - Maintain access to a compromised system such as Rootkits
Virus
- Self-replicating - Host-dependent for infection E.g.: Boot (Brain virus), overwrite, parasitic, cavity, entry point obfuscation, code integration (W95/Zmist virus)
Worm
- Self-replicating, spreads (autonomously) over network - Exploits vulnerabilities affecting a large number of hosts - Sends itself via email E.g.: Internet worm, Netsky, Sobig, Code Red, Blaster, Slammer
Breakpoint
- Software interrupts (traps) on x86. - When the CPU executes an int insn the control transfers to the routine associated - The return address for the ISR points to the insn following the trap instruction. It's a mechanism used for suspending program execution to examine registers and memory locations
Dynamic techniques
- The program must be executed and monitored: - Interaction with the environment - Interaction with the OS (e.g. system calls)
Feature classes
- Start-up installation - Use of cryptographic API - File operations - Collecting user computer's information - Intervention with other processes (malicious code injection, process termination) - Disabling system recovery - ransom note (looking for typical strings, such as encrypted, protected, ransom, RSA, AES, ...) - evasion (debugging/virtualised/sandboxed environment)
Dynamic analysis
- Studying a program's properties by executing it (allowing us to observe actual executions to infer properties and behaviours (*under approximation*)) - Environment-limited analysis - Ability to monitor the code (and data flow) as it executes - Allow to perform precise security analysis thanks to run-time information - Debugging (finding bugs) - Instrumentation: - Add extra semantic-preserving code to a program or a process - Taint-tracking Goals of that system: - *Visibility*: a sandbox must see as much as possible of the execution of a program. Otherwise, it risks of missing interesting, potentially malicious, behaviours - *Resistance to detection*: monitoring should be hard to detect and environment hard to fingerprint - *Scalability*: with 500k+ malware samples per day, analysis must scale up - The execution of one sample does not interfere with the execution of subsequent malware programs - Analyses should be automated
Static analysis
- Studying a program's properties without executing it (*over approximation*) - Reverse engineering may be hampered (e.g.: obfuscation, encryption) Issues: - Opaque predicates: conditions' outcome known upfront, but hard to deduce statically—more complex CFGs - Anti-analysis, e.g., anti-disassembly, CFG flattening, but also packing (see next)—incomplete CFG - Indirect calls or jumps—partial CFG exploration
Execution Levels
- There are different privilege levels - These are used to separate user-land execution from kernel-level execution - Subroutines at higher privilege levels can be accessed through gates and require special setup - Usually kernel-mode is mapped to level 0 and user-mode to level 3 - System calls typically cause transition from user to kernel space
Malware fight goals
- Understand malware behaviours - (Automatically identifying and classifying families of malware - Automatically generating effective malware detection models
Rootkit
- Used to keep access to a compromised system - Usually hides files (usually a malware), processes, network connections (user/kernel level), registry keys, services, processes, ...
AUC
Area Under the ROC Curve which is used as performance metric. The higher, the better. Random classifier has AUC=0.5
Environment interaction
- lsof lists on its standard output file information about files opened by processes. - netstat displays the contents of various network-related data structures. - ltrace intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process
Analysis process
1. Collect malware samples 2. Static/Dynamic analysis 3. Extract (and generalize) malicious behaviour (host/network) 4. Generate and deploy detection models Problems: - Lack of general definition of malicious behaviour - Cat-and-mouse game: attackers have much freedom - Victims often (unwittingly) help attackers
Sandbox based analysis
1. Execute a suspicious program in a sandbox (typically an emulator) 2. Monitor the execution using VM introspection 3. Identify suspicious and malicious behaviours E.g.: Anubis, CWSandbox, Cuckoo Box, BitBlaze Limitations: only the behaviours associated to the taken paths can be monitored
Linear separation
1. Map the samples with the features to a vector space (e.g.: one on x and the other one on y) 2. Separate using a model (best separating hyperplane/line)
Sinkholing
1. Purchase hosting from two different hosting providers known to be unresponsive to complaints 2. Register wd.com and wd.net with two different registrars 3. Set up Apache web servers to receive bot requests 4. Record all network traffic 5. Automatically download and remove data from your hosting providers 6. Enable hosts a week early
Function invocation
1. The caller pushes the parameters on the stack 2. The caller saves the return address on the stack, and then it jumps to the callee (e.g. call <strcpy>) 3. The callee executes a prologue, that consists of the following operations: - Save %ebp on the stack - %ebp = %esp - Allocate space for local variables
Data collection principles
1. The sinkholed botnet should be operated so that any harm and/or damage to victims and targets of attacks would be minimized - Always respond with okn message - Never send new/blank configuration files 2. The sinkholed botnet should collect enough information to enable notification and remediation of affected parties - Work with law enforcement (FBI and DoD Cybercrime units) - Work with bank security officers - Work with ISPs
Exploit kit attack
2nd most used attack vector for ransomware (after emails)
EFLAGS
32-bit register used as a collection of bits representing boolean values to store the results of operations and the state of the processor (ID, VIP, VIF, AC, VM, RF, NT, IOPL, OF, DF, IF, TF, SF, ZS, AF, PF, CF) - *C*arry *F*lag: set if the last arithmetic operation carried (add) or borrowed (sub) a bit beyond the size of the register - *P*arity *F*lag: set if the number of set bits in the LSB is a multiple of 2 - *A*djust *F*lag: carry of Binary Code Decimal (BCD) numbers arithmetic operations - *Z*ero *F*lag: set if the result of an operation is 0 - *S*ign *F*lag: set if the result of an operation is negative - *T*rap *F*lag: set if there's step-by-step debugging - *I*nterruption *F*lag: set if interrupts are enabled - *D*irection *F*lag: stream direction. If set, string operations will decrement their pointer instead of incrementing it - *O*verflow *F*lag: set if signed arithmetic operations result in a value too large for the register to contain - *I*/*O* *P*rivilege *L*evel of the current process - *N*ested *T*ask flag: controls chaining of interrupts, set if the current process is linked to the next process - *R*esume *F*lag: response to debug exceptions - *V*irtual-8086 Mode, set if in 8086 compatibility mode - *A*lignment *C*heck, set if alignment checking of memory references is done - *V*irtual *I*nterrupt *F*lag: virtual image of IF - *V*irtual *I*nterrupt *P*ending flag: set if an interrupt is pending - *ID*entification flag: support for CPUID instruction if can be set
Machine Learning
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Informally: - It's a sub-field of AI - a set of algorithms that can automatically learn rules from data, to represent models All algorithms rely on sums and products of vectors and matrices, and often the algorithms have geometrical interpretations. Not all information is relevant. It tries to use only the most discriminative features/dimensions. At the end, the algorithm tries to find an optimal model that represents the input data by "minimizing" some error function
Random Forest
A kind of supervised learning which is one of the most popular and effective algorithms, because it includes several "tricks" by design to improve generalization and reduce overfitting. It uses divide and conquer decision tree intuition. Algorithm: 1. For b = 1 to B: (a) Draw a *bootstrap sample Z** of size N from the training data. 4 (b) Grow a random-forest tree T_b to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size n_min is reached. i. Select m variables at random for the p variables. ii. Pick the best variable/split-point among the m. iii. Split the node into two daughter nodes. 2. Output the ensemble of trees {T_b}^B_1. To make a prediction at a new point x: Classification: Let Ĉ_b(x) be the class prediction of the b-th random-forest tree. Then Ĉ^B_rf(x) = majority vote{Ĉ_b(x)}^B Differences to Standard Decision Tree - Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e. some samples will probably occur multiple times in new data set). - For each split, consider only m randomly selected variables - Don't prune (pruning in SDT, i.e. removing parts of the tree, is used to reduce risks of overfitting. In RF, this is taken care by constructing an ensemble of trees on different (bootstrapped) data, and different variables - Fit B trees in such a way, and use majority voting to aggregate results
Botnet
A network of compromised devices (bots) controlled by a bot master (via C&C channels, fast/domain flux or push/pull/P2P). Each bots have the same domain generation algorithm and 3 fixed domains to be used if all else fails. It gets created via infection (worm, trojan via P2P, drive-by downloads, existing backdoor) and spreading
Red-pill
A program capable of detecting if it is executed in an emulator. void main() { redpill = ''\x08\x7c\xe3\x04...''; if (((void (*)())redpill)()) { // Executed on physical CPU return CPU; } else { // Executed on emulated CPU return EMU; } }
Signature-based detection
AV maintains a database of signatures (e.g. byte patterns, regular expressions). A program is considered malicious if it matches a signature
Manhattan distance
Also known as city-block distance (defined via *L_1 norm*)
Adversarial ML
An attacker may try to evade detection or poison training data). Spam filtering may be used on features linked to the presence/absence of words, where the attacker could obfuscate bad words and insert good ones. Defence to that: - Reactive: - timely detection of attacks - frequent re-training - decision verification - Proactive: - Security-by-Design (against white-box attacks [no probing]): security/robust learning, attack detection) which has effects on decision boundaries (noise-specific margin or/and enclosure of legitimate training classes) - Security-by-Obscurity (against grey-box and black-box attacks [probing]): information hiding, randomisation, detection of probing attacks
Dataset
An ideally representative of the real-world population, who's statistically significant (10k+ samples). The dataset should also have a realistic goodware/malware ratio (e.g.: 1-10 malware for every 100-1k goodware) and reliable ground-truth labels (if available). If it has a poor quality then the results of the analysis would be meaningless; that is, there's no general conclusion. The problem with public ones is that data may be poisoned by attackers which may include "bogus" samples to make the classifier learn wrong rules
Drive-by downloads
Attacks against web browser and/or vulnerable plug-ins which is typically launched via malicious client-side scripts (JavaScript, VBScript) that was injected into legitimate sites (e.g: via SQL injection). Sometimes it's hosted on malicious sites or embedded into ads. It can also be linked to redirection where the landing page redirects to a malicious site which allows exploit customisation. Propagation techniques: - Remote exploit + this - Rogue antivirus (fake one on a website)
Bot
Autonomous programs performing tasks
Average linkage
Average inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities
Agglomerative Clustering
Begin with n observations and a measure (such as Euclidean distance) of all the n(n − 1) / 2 pairwise dissimilarities. Treat each observation as its own cluster. For i = n, n - 1, ..., 2: 1. Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are the least dissimilar (that is, most similar). Fuse these 2 clusters. The dissimilarity between these 2 clusters indicates the height in the dendrogram at which the fusion should be placed 2. Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters
Dynamic Analysis for Android
Challenges: - Android apps are interactive and hard to stimulate (pragmatically, we stimulate what we can, e.g.: using MonkeyRunner) - State-modifying actions manifest at multiple abstractions - Traditional OS interactions (e.g. filesystem/network interactions) - Android-specific behaviours (e.g. SMS, phone calls)
BotMining
Clustering analysis of network traffic and structure-independent botnet detection. It's assumed that bots within the same botnet are characterized by similar malicious activities and C&C communications. The C-plane monitor captures network flows and records information on who is talking to whom. Each flow contains: times/duration, IP/port, number of packets, bytes transferred. The A-plane monitor logs information on who is doing what, it analyses: - Outbound traffic through the monitored network - Detecting several malicious activities the internal hosts may perform. The C-plane cluster is responsible for: - Reading the logs generated by the C-plane monitoring - Finding clusters of machines that share similar communication patterns (performs basic filtering, performs white listing, multi-step clustering). The A-plane cluster has a client list with malicious activity and cluster according to activity type (scan, spam, DDoS, binary downloading, exploit downloading) or activity features Cross-plane correlation: The idea is to cross-check clusters in two plans to find intersections. A score s(h) is computer for each host h
C&C
Command & Control. - Centralised control: IRC, HTTP - Distributed control: P2P - Push (The bot silently waits for commands from the "commander") vs Pull (The bot repeatedly queries the "commander" to see if there is a new work to do). Bots locate it based: - on an hardcoded IP address - FastFlux: Hardcoded FQDN or dynamically generated FQDNs (1 FQDN → 1 or more IP addresses) - DomainFlux: Hardcoded URL or dynamically generated URLs - Search keys in the P2P network The location measures can be prevented by: (network level)/DNS/HTTP ACLs
CFG
Control Flow Graph. Given a program P, its control flow graph is a directed graph G = (V , E ) representing all the paths a program might traverse during its execution. - V is the set of basic blocks - E ⊆ V × V is the set of edges representing control flow between basic blocks - a control flow edge from block u to v is e = (u, v) ∈ E
DKOM
Direct Kernel Object Manipulation: in memory alteration of a kernel structure with no hook/patch needed. But then malware.exe disappears from the list of running processes (by delinking itself from both eprocesses). The scheduling is thread-based
Distance to a line
Distance of a point (x0, y0) to a line Ax + By + C = 0
Distance to a hyperplane
Distance of a point x (with right arrow on top) to a hyperplane w^T * X + b = 0 *Hyperplane*: subspace whose dimension is one less than that of its ambient space
DDoS
Distributed Denial of Service
Data definition
Data objects are defined in a data segment using the syntax: label type data1, data2, ... Examples: .data myvar .long 0x12345678, 0x23456789 bar .word 0x1234 mystr .asciz "foo"!
History
Early 90s (IRC bots) -> 98-99-00 (Trojan horse & remote control, DDoS tools & distribution) -> 01-now (worms & spreading)
Crypto-Ransomware
Encrypt personal files to make them inaccessible. Stages of infection: 0. Break-in (phishing/spam, exploit kits, self-propagation, exploiting server vulnerabilities, malvertising, ...) 1. Installation (copies itself into various system locations and make itself autostarteable for reboots) 2. Contacting the HQ 3. Handshake and keys (generation which is usually done locally at run-time) 4. Encryption (e.g.: hard-coded RSA public key, 1 AES/RSA key generated locally, MBR overwrite to boot custom kernel, ...) which usually targets specific files (e.g: documents and images) 5. Extortion (6. Recovery which is easier on locker ransomware and computationally infeasible on crypto ones)
Ransomware live detection
For better results, feature selection (mainly being: API stats, dropped files extension, files operation, registry keys): - Reducing the number of features: simpler ML algorithms. - Shortening the training (and prediction) time, and, in many cases, to prevent overfitting. - Key to make the algorithm more efficient and achieve a better accuracy.
Finding the "nearest" pair of clusters
For two clusters ω_j and ω_k of sizes n_j and n_k: - Minimum distance (single linkage): d_min(ω_j, ω_k) := min_{x∈ω_j, y∈ω_k} ∥x − y∥ - Maximum distance (complete linkage): d_max(ω_j, ω_k) := max_{x∈ω_j, y∈ω_k} ∥x − y∥ - Average distance (average linkage): d_avg(ω_j, ω_k) := 1/(n_j ** n_k) *** ∑_{x∈ω_j ** ∑y∈ω_k} ∥x − y∥ - Mean distance (centroid linkage): d_mean(ω_j, ω_k) := ∥μ_j − μ_k∥ where μ_j and μ_k are the means of the 2 clusters
Kernel function
Function that corresponds to an inner product in some expanded feature space. Linear classifier relies on an inner product between vectors K(x_i, x_j) = x_i^T * x_j If every data-point is mapped into high-dimensional space via some transformation φ : x → φ(x), the inner product becomes: K(x_i, x_j) = φ(x_i)^T * φ(x_j) Why use kernels? - Make non-separable problem separable - Map data into better representational space Common kernels: linear, polynomial, Radial Basis Function (RBF, also known as Gaussian Kernel). With the RBF/Gaussian kernel, it is possible to create a non-linear separation in the original space (see ≈ circular shape in the figure) by solving a linear separation problem in an alternative space
Criterion function for clustering
Function to be optimised. The most widely used one is the sum-of-squared-errors over the clusters ω_j. This criterion measures how well the data set X = {x1, x2, ..., x_n} is represented by the cluster centres μ = {μ1, ..., μ_K}, (K ≤ n). Clustering methods that use this criterion are called minimum variance
Classification
Given a labelled dataset, find a model that separates instances into classes. Its output is often the probability of belonging to a certain class.
Call graph
Given a program P, its control flow graph is a directed graph C = (R, E). - R is the set of procedures - E ⊆ R × R is the set of edges indicating the relation caller-callee - a caller-callee edge from the caller procedure u to the callee procedure v is e = (u, v ) ∈ E
Regression
Given some points, try to generalize and predict real-valued numbers and finding the equation of the curve/line that represent the data distribution of the points. For that, we need a concept of error that we want to minimize. For polynomial curves, we need to choose an order M (M=0 usually being underfitted) then the weight w of each points are obtained by minimizing the considered error. Errors are fitted as *mean square errors*.
Ransomware
Goals: - render the victim's system unusable, - and ask the user to pay a ransom to revert the damage. Notable classes: locker-ransomware and crypto-ransomware.
ROC curve
Graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied
F1-Score
Harmonic mean of precision and recall
Self-emulating malware
Heuristics to detect the end of the unpacking are based on the "execution of previously written code". 1. The code of the malware is transformed in bytecode 2. Bytecode interpreted at run-time by a VM 3. Bytecode mutated in each sample
Hooking
Hijack the flow of the execution by modifying a code pointer. Examples: - User-space: IAT - Kernel-space: IDT (1+ handler(s): 0x2e → KiSystemService; cf. figure), MSR, SSDT (1+ descriptor(s): 0x74 → NtOpenFile) (easy to detect: if (IDT[0x2e] != KiSystemService || SSDT[0x74] != NtOpenFile) then ...
Anti-debugging
How can a process detect if it is currently being traced? Ptrace example (since processes can only have 1 parent): #include <stdio.h> #include <sys/ptrace.h> int main() { if (ptrace(PTRACE TRACEME, 0, NULL, NULL) < 0) { printf("You are debugging me... bye!\n"); return 1; } printf("Hello world!\n"); return 0; }
Thwarting linear sweep
How? - Increase the number of candidates by using *branch flipping*
Soft margin classification
If the training data is not linearly separable, slack variables ξ_i can be added to allow misclassification of difficult or noisy examples. So the idea is to allow some errors. But we still try to minimize training set errors, and to a place hyperplane "far" from each class (large margin
Prologue
Instructions block that is called as a function is invoked (by the caller) and typically contains: push %ebp mov %esp, %ebp sub $n, %esp
Epilogue
Instructions block that is called by the callee functions as it terminates. It: 1. deallocates local variables (%esp = %ebp), 2. restores the base pointer of the caller function, 3. resumes the execution from the saved return address. It typically consist of the following instructions: leave ret Or: mov %ebp, %esp pop %ebp ret
Signed integers
Integers in 2's complement
IDA Pro
Interactive Disassembler Professional. Recursive traversal disassembler; detailed (preliminary) analysis that includes: - Function boundaries, library calls and their arguments data types - Control-flow graph, call graph (proximity view) - Functions window, IDA view, hex view, . . . - Automatic tagging of string constants - Code and data cross references - Comments, variable/memory addresses rename It handles pretty much any architecture
Junk insertion
Introduces disassembly errors by inserting junk bytes at selected locations into the code stream where the disassembler expects code
Sec SVM
Intuition: - To have more evenly-distributed feature weights w - In this way, the attacker would need to modify many features to evade the classifier. Where: - R(f) is the regularization term, and L(f , D) is the hinge loss function, and C is the trade-off factor between R(f) and L(f , D) - D is the dataset, and each sample x_i has ground truth-label y_i ∈ {−1, +1} f is the classifier function, where f(x_i) = w^T_i * b_i is the distance from the hyperplane in the SVM - w^lb_k and w^ub_k are lower and upper bounds specific for each feature x k (as some features may be harder than others to modify
MonkeyRunner
It can: - Install an apk - Invoke an activity with the installed apk - Send keystrokes to type a text message - Click on arbitrary locations on the screen - Take screenshots (useful for feedback) - Wake up device if it goes to sleep But: - it is still notoriously difficult to write a good stimulation script - Android apps are highly interactive and you need to get right both the context and location of the stimulation in the GUI - A change in the GUI means a new test script needs to be developed
Recursive Traversal
It needs to make assumptions on what to disassemble first and focuses on the concept of control flow; instructions classified as: - *Sequential flow*: pass execution to the next instruction that immediately follows (add, mov, push, pop, . . . ) - *Conditional branching*: if the condition is true the branch is taken and the instruction pointer must change to reflect the target of the branch, otherwise it continues in a linear fashion (jnz, jne, . . . ). In static context this algorithm disassemble both paths - *Unconditional branching*: the branch is taken without any condition; the algorithm follows the (execution) flow (jmp) - *Functional call*: are like unconditional jumps but they return to the instruction immediately following the call - *Return*: every instructions which may modify the flow of the program add the target address to a list of deferred disassembly. When a return instruction is reached an address is popped from the list and the algorithm continues from there (recursive algorithm). Pros: - Distinguish code from data Cons: - Inability to follow indirect code paths (indirect code invocation issue) Used by: IDA Pro
Inversion
It occurs where two clusters are fused at a height below either of the individual clusters in the dendrogram. it leads to difficulties in visualization as well as in the interpretation of the dendrogram
CopperDroid
Key insights: - All interesting behaviours achieved through system calls: · Low-level, OS semantics (e.g. network access) · High-level, Android semantics (e.g. phone call) Novelty: - Automatically reconstruct behaviours from system calls - With no changes to the Android OS image
Line representation of a decision function
Line that maximizes the margin between the two classes. Let us consider the two 2-class case and represented them from the set {−1, 1} i.e. y_i ∈ {1, −1}. Define the hyperplane with maximum separation such that for the i-th sample
Locker-Ransomware
Lock the victims' computer to prevent them from using it.
Assembly
Low-level processor-specific symbolic language which is directly translated into binary format
Complete linkage
Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities
Basic block
Maximal sequence of consecutive instructions with a single entry and single exit without CTI interleaving in the block of code
Distance measure
Metric: function d(x, y) offered for measuring the distance between 2 vectors x and y is a metric if it satisfies the following properties: - d(x, y) ≥ 0 - d(x, y) = 0 ⇐⇒ x = y - d(x, y) = d(y , x) - d(x, y) ≤ d(x, z) + d(z, y) In vector spaces (where subtraction is allowed), we often define d(x, y) = ∥x − y∥ using a norm ∥ ... ∥ The most commonly used metric is the Minkowski distance
Single linkage
Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities (least balanced and popular)
Confusion matrix
One of the performance metrics
Cuckoo
Open source automated malware analysis system. It automatically runs and analyses files and collect comprehensive analysis results: outline what the malware does while running inside an isolated OS. Key features: - Completely automated. - Run concurrent analysis. - Able to trace processes recursively (e.g., child). - Customizable analysis process. - Create behavioural signatures. - Customize processing and reporting Use cases: - Can be used as a standalone application or can be integrated in larger frameworks: - extremely modular design. - It can be used to analyse: - generic Windows EXE; - DLL files; - PDF; - Microsoft Office documents; - URLs and HTML files; - PHP scripts; - CPL files; - VB scripts; - ZIP; - JAR; - Python files; - ... Events: - Windows API calls traces. - Copies of files. - Dump of the process memory. - Full memory dump. - Screenshots. - Network dump (PCAP format)
Data collection
Phase 1 of the ML pipeline. Unlike many research fields (e.g., social network analysis), in security it is hard to obtain datasets but "luckily for us", in the case of malware analysis it is easier. Possible data sources: - Private company data (e.g., network traffic, emails, ...) - Public datasets (good availability of malware, less for other types of security data): VirusShare, secrepo, VirusTotal, ...
ML pipeline
Phase 1: Data collection Phase 2: Pre-processing and feature engineering Phase 3: Model selection and training Phase 4: Testing and evaluation Phase 5: Evaluate robustness against time evolution and adversaries
Feature engineering
Phase 2 of the ML pipeline, it's where we need to define and extract *features* (input of our machine learning model which is usually defined from *domain knowledge*) from raw samples in our data collection. Some types of features: - *numerical*: filesize, number of API calls - *categorical*: a certificate feature that can take three values (signed, unsigned, signed with expired certificate). *Static* features: - Features extracted from metadata (Android Manifest file, ELF/PE) / code (API call frequencies, or graph structures such as CFG) *Dynamic* features: - Features extracted from dynamic analysis of the code: system call sequences, HTTP requests metadata, URLs called, ... *Note*: Static and dynamic features inherit all the weaknesses of static and dynamic analysis (e.g.: obfuscation for static analysis and coverage for dynamic analysis). Good features highlight commonalities between members of a class and differences between members of different classes
Model selection and training
Phase 3 of the ML pipeline which includes: - *Training* (used to learn a model) - *Validation* (used for hyper-parameter tuning of the ML algorithm) - *Testing* (used to see performance on a real environment). The model selection should be done by answering the following questions: - Do we have enough training data? Small -> K-Nearest Neighbours Big -> Neural Networks - Is the data linearly separable? Rule of thumb: choose the simplest model that can solve our problem
Testing and evaluation
Phase 4 of the ML pipeline
Evaluate robustness against time evolution and adversaries
Phase 5 of the ML pipeline. - Robustness against time: As time passes, malware authors develop new malware, and existing malware evolves into new versions and polymorphic variants. The distribution at "test time" may not reflect any more the distribution at "training time" - Robustness against adversaries: A sophisticated attacker may be aware that we are deploying a malware detection model based on ML. He may try to perform evasion of detection or poisoning of the training data. The detection performance decay over time. In order to fight concept drift: - Periodic re-training - Evaluate your classifier performance with respect to time However, it may not be enough: the data distribution of the arriving data may change rapidly and unexpectedly, so you will estimate a misleading "time decay". The adversarial ML can be seen as a worst-case situation of concept drift
Benign bot
Program used for Internet Relay Chat (IRC) and react to event in IRC channels and which typically offer useful services (e.g: NickServ))
Junk byte
Properties: - must be partial instructions - must be inserted in such a way that they are unreachable at runtime Candidate block: - have junk bytes inserted before it - execution cannot fall through the candidate block - basic block immediately before a candidate block must end with an unconditional branch
Wiper
Ransomware that is permanently destructive aka ''destructionware''
Sandbox
Security mechanism for separating running programs: - often used to execute untested or untrusted programs or code - no (or limited) risk to harm the host machine or operating system - provides a tightly controlled set of resources for guest programs to run in - it can be based on virtualization (which can be used to emulate something). Pros: - automate the whole analysis process; - process high volumes of malware; - get the actual executed code; - can be very effective if used smartly. Cons: - can be expensive; - some portions of the code might not be triggered; - environment could be detected.
Code overlapping
Sharing of code on different levels
Feature representation
Since ML algorithms need numbers and matrices as input and are often designed to work in Euclidean space. Thus features needs to be mapped into an Euclidean space and create a feature vector out of the interesting features. The most common way to do this is to count its occurrence (e.g.: an app calls the Telephony manager 20 times). The presence/absence of a feature is represented by a boolean (e.g.: an app calls the Telephony manager).
Emulator
Software program that simulates the functionality of another program or piece of hardware (e.g, CPU—see for instance QEMU). When a program P runs on top of emulated hardware, the system collects detailed information about the execution of P - Can potentially detect evasion attempts - Drawback: the software layer incurs a performance penalty. The implications of that on a whole system (sandbox): - One can install and run an actual OS on top of the emulator - Malware execute on top of a real OS - Fingerprinting of the analysis environment much more difficult to detect for malware - The interface offered by a processor is (much) simpler than the interface provided by a modern OS - A system emulator has great visibility - Visibility of every executed instruction and memory access: ability to faithfully reconstruct low - and high-level semantics - Challenges: - *Semantic gap*: instructions and memory accesses need to be mapped and associated to OS semantics (virtual machine introspection) - *Performance*: could potentially distinguish between trusted (e.g. kernel) and untrusted code; trusted code executed in a virtualise fashion
Hierarchical Clustering
Sometimes it is desirable to obtain a hierarchical representation of data, with clusters and sub-clusters arranged in a tree-structured fashion. Methods: - agglomerative (i.e. bottom-up) - divisive (i.e. top-down) Hierarchical methods actually produce several partitions; one for each level of the tree. However, for many applications we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram, based on a cutting criterion that can be defined using a threshold
Non-parametric clustering
Steps: - Defining a measure of (dis)similarity between observations - Defining a criterion function for clustering - Defining an algorithm to minimize (or maximize) the criterion function
DREBIN
Supervised learning classifier which relies on a linear SVM to separate malicious and benign applications. It uses static binary features extracted from the Android .apk (8 classes of features in particular)
SVM
Support Vector Machines: supervised classification algorithm which is highly efficient (just a convex optimization problem) and generalizes well. Objective: Find an optimal hyperplane that segregates linearly separable data that maximizes the separation. Inputs: - Set of training samples with each sample containing the same number of features - For each of the training samples, the ground truth y which tells the class to which it belongs Outputs: a set of weights (one for each feature) whose linear combination predicts the value of y Formal optimisation problem (cf. image) In which: - f(x) = w^T * x + b, where w is the vector of feature weights, and b is the bias - R(f) is called regularization term, used to avoid overfitting (i.e. to avoid that the classifiers learns weight parameters that are too specific for the training data) - L(f , D) is the hinge loss function computed on the training data D - C is the trade-off hyper-parameter between loss and regularization
Recall
TP / (TP + FN)
Precision
TP / (TP + FP)
Disassembly
Task consisting in taking a binary blob, setting code and data apart, and translating machine code to mnemonic instructions. With a disassembled program, we can: locate functions, recognize jumps, identify local variables, understand the program behaviour without running it. Issues: - Code and data in the same address space (how to distinguish them?) - Variable-length instruction - Indirect control transfer instructions - Basic blocks - At compile-time some information may disappear (e.g.: variable names, type information, macro & comments) - Identifying functions and function parameters
Code obfuscation
Techniques that preserve the program's semantics and functionality while, at the same time, making it more difficult for the analyst to extract and comprehend the program's structure
Attack Strategy
The attacker knowledge space Θ consists of: - the data D - the feature space X - The classification function f (including the train parameters). Depending on what the attacker knows (or does not), we will have different attack scenarios. Perfect knowledge: If the attacker knows θ = (D, X, f), then we say that the attacker has perfect knowledge of the system.
No Free Lunch Theorem
The best algorithm depends on the specific task
Centroid linkage
The dissimilarity between the centroid for cluster A (the mean vector) and the centroid for cluster B. It can result in undesirable inversions
Feature space
The general idea is that the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable
IRC Botnet infiltration
The infiltrator can connect to the C&C server as a bot and can see all the bots connected, he can also receive commands from the botmaster, and he has the ability to send commands to other bots
HTTP Botnet infiltration
The infiltrator can connect to the C&C server as a bot but he can't see the other bots connected, but the server-side infiltrator can
Underfitting
The model obtained by the ML algorithm can neither model the training data nor generalise. To avoid that, we can find the lowest error where training and testing error are similar and avoid erroneous assumptions and simple models about the training dataset that do not generalise. E.g.: Robbers carry firearms
Overfitting
The model obtained by the ML algorithm is too suited to the training set and not general enough. It can be detected when training errors and test errors diverge too much. It can be avoided by avoiding modelling noise in the training dataset. E.g.: Robbers wear masks, operate after dark and flee in sedans
Pre-processing
The other part of the 2nd phase of the ML pipeline. Many ML models require datasets to be standardised. Otherwise: - The optimisation function used may behave sub-optimally if the data isn't normalised and scaled (e.g: between 0 and 1). - ML estimators may behave unexpectedly if the input data is not normally distributed. - Many elements of the objective function assume features are centred around 0 and variance in the same order. - If a feature has variance that is much larger than other features, it would skew the estimator in its favour.
Virtualisation
The program P runs on actual hardware: - The virtualisation software (hypervisor) only controls and mediates the accesses of different programs (or different VMs) to the underlying hardware - The VMs are independent and isolated from each other - However, execution occupies actual physical resources and the hypervisor (and the malware analysis system) cannot run simultaneously—this hinders collections of detailed information about the execution of the monitored program - Hard to hide hypervisor to malicious code - However, execution is almost at native speed
Analysis tools
They can be run as alternative or alongside Cuckoo: - Kali Linux: network analysis & DNS server. - IDA Pro: industry-standard disassembler. • flow of the program; • list of imported Windows API functions; • list of strings found in the executable. - OllyDbg: Windows debugger. - PEview: provides an overview of a portable executable's sections (e.g. mistmatch in size due to packing). - PEiD: to identify the use of packers. - Sysinternals Process Monitor: • displays a very detailed view of file, registry, and network operations made by running processes; • dropped files or autorun installs can be tracked. - Sysinternals Process Explorer: • a much more powerful version of the windows task manager; • it adds support for viewing strings in memory, checking parents of processes, more powerful process termination; • and an integrity check for injected windows processes like svchost. - RegShot: allows an analyst to take two snapshots of the registry, • before and after the execution of malware, • comparative summary that indicates the process's registry operations.
Dendogram
Tree-structured graph used to visualise the result a hierarchical clustering calculation. A fuse (= merge) is shown as a horizontal line connecting two clusters. The y-axis coordinate of the line corresponds to the (dis)similarity of the merged clusters. Set representation: {{x1, {x2, x3}}, {{{x4, x5}, {x6, x7}}, x8}} (however the set representation can't express the quantitative information).
Torpig botnet
Trojan horse - Distributed via the Mebroot "malware platform" - Injects itself into 29 different applications as DLL - Steals sensitive information (passwords, HTTP POST data) - HTTP injection for phishing - Uses "encrypted" HTTP as C&C protocol - Uses domain flux to locate C&C server Mebroot - Spreads via drive-by downloads - Sophisticated rootkit (overwrites master boot record)
Spam
Unsolicited email
K-Fold cross-validation
Useful for hyper-parameter optimization. Warning: it's not suitable for timestamped samples as it may cause future samples to be included in the training datasets
Non-linear SVM
Useful when datasets aren't linear (so will end up noisy in a linear space) so another approach is to map the data into a higher-dimensional space (e.g: 2D)
Cluster validity
Validity that is highly subjective in unsupervised learning in comparison to supervised learning where a clear objective function is known (e.g.: MSE). The choice of the (dis)similarity measure and criterion function will have a major impact on the final clustering produced by the algorithms
Thwarting recursive traversal
What can be exploited? - disassemblers assume that control transfer instructions behave reasonably (2 targets: the branch one and the fall through to the next instruction) - difficulty to identify indirect control transfers Techniques: - *branch functions* - *opaque predicates*: disguise an unconditional branch as a conditional branch that always go in one direction, using predicates that always evaluate to either true or false - *jump table spoofing*: insert artificial jump tables to mislead the disassembler; disassembler analyses jumps to table for identifying target address; fake jump table entries whose targets are locations of junk bytes
ptrace
it allows one process (the tracing process) to control another (the traced process). The ptrace() system call provides a means by which a parent process may observe and control the execution of another process, and examine and change its core image and registers. It is primarily used to implement breakpoint debugging and system call tracing.