Digital Forensics and Cyber Crime. Fifth International Conference, ICDF2C 2013, Moscow, Russia, September 26-27, 2013, Revised Selected Papers

Research Article

Identifying Forensically Uninteresting Files Using a Large Corpus

Download
416 downloads
  • @INPROCEEDINGS{10.1007/978-3-319-14289-0_7,
        author={Neil Rowe},
        title={Identifying Forensically Uninteresting Files Using a Large Corpus},
        proceedings={Digital Forensics and Cyber Crime. Fifth International Conference, ICDF2C 2013, Moscow, Russia, September 26-27, 2013, Revised Selected Papers},
        proceedings_a={ICDF2C},
        year={2015},
        month={2},
        keywords={Digital forensics Metadata Files Corpus Data reduction Hashes Triage Whitelists Classification},
        doi={10.1007/978-3-319-14289-0_7}
    }
    
  • Neil Rowe
    Year: 2015
    Identifying Forensically Uninteresting Files Using a Large Corpus
    ICDF2C
    Springer
    DOI: 10.1007/978-3-319-14289-0_7
Neil Rowe1,*
  • 1: U.S. Naval Postgraduate School
*Contact email: ncrowe@nps.edu

Abstract

For digital forensics, eliminating the uninteresting is often more critical than finding the interesting. We define “uninteresting” as containing no useful information about users of a drive, a definition which applies to most criminal investigations. Matching file hash values to those in published hash sets is the standard method, but these sets have limited coverage. This work compared nine automated methods of finding additional uninteresting files: (1) frequent hash values, (2) frequent paths, (3) frequent filename-directory pairs, (4) unusually busy times for a drive, (5) unusually busy weeks for a corpus, (6) unusually frequent file sizes, (7) membership in directories containing mostly-known files, (8) known uninteresting directories, and (9) uninteresting extensions. Tests were run on an international corpus of 83.8 million files, and after removing the 25.1 % of files with hash values in the National Software Reference Library, an additional 54.7 % were eliminated that matched two of our nine criteria, few of whose hash values were in two commercial hash sets. False negatives were estimated at 0.1 % and false positives at 19.0 %. We confirmed the generality of our methods by showing a good correlation between results obtained separately on two halves of our corpus. This work provides two kinds of results: 8.4 million hash values of uninteresting files in our own corpus, and programs for finding uninteresting files on new corpora.