Code obfuscation is a staple tool in malware creation where code fragments are altered substantially to make them appear different from the original, while keeping the semantics unaffected. A majority of the obfuscated code detection methods use program structure as a signature for detection of unknown codes. They usually ignore the most important feature, which is the semantics of the code, to match two code fragments or programs for obfuscation. Obfuscated code detection is a special case of the semantic code clone detection task. We propose a detection framework for detecting both Java code obfuscation and clone using machine learning. We use features extracted from Java bytecode dependency graphs (BDG), program dependency graphs (PDG) and abstract syntax trees (AST). BDGs and PDGs are two representations of the semantics or meaning of a Java program. ASTs capture the structural aspects of a program. We use several publicly available code clone and obfuscated code datasets to validate the effectiveness of our framework. We use different assessment parameters to evaluate the detection quality of our proposed model. Experimental results are excellent when compared with contemporary obfuscated code and code clone detectors.
Abdullah Sheneamer received the BSc degree in Computer Science in 2008 from King Abdulaziz University, Saudi Arabia, the MSc degree in Computer Science in 2012 from University of Colorado at Colorado Springs, USA, and joined University of Colorado at Colorado Springs in 2013 as a graduate student. His research interests include data mining, machine learning, and software engineering. His current work focuses on software clone and code obfuscation detection using machine learning approach.