Identifying forensically uninteresting files in a large corpus

For digital forensics, eliminating the uninteresting is often more critical than finding the interesting since there is so much more of it. Published software-file hash values like those of the National Software Reference Library (NSRL) have limited scope. We discuss methods based on analysis of file context using the metadata of a large corpus. Tests were done with an international corpus of 262.7 million files obtained from 4018 drives. For malware investigations, we identify clues to malware in context, and show that using a Bayesian ranking formula on metadata can increase recall by 5.1 while increasing precision by 1.7 times over inspecting executables alone. For more general investigations, we show that using together two of nine criteria for uninteresting files, with exceptions for some special interesting files, can exclude 77.4% of our corpus instead of the 23.8% that were excluded by NSRL. For a test set of 19,784 randomly selected files from our corpus that were manually inspected, false positives after file exclusion (interesting files identified as uninteresting) were 0.18% and false negatives (uninteresting files identified as interesting) were 29.31% using our methods. The generality of the methods was confirmed by separately testing two halves of our corpus. Few of our excluded files were matched in two commercial hash sets. This work provides both new uninteresting hash values and programs for finding more.


Introduction
As digital forensics has grown, larger and larger corpora of drive data are available.To speed subsequent processing, it is essential that the drive triage process first eliminate from consideration those files that are clearly unrelated to an investigation [13].This can be done either by directly eliminating files from drive images or by removing their indexing.We define as "uninteresting" those files whose contents do not provide forensically useful information about usage of a drive in the form of either user-created or user-discriminating information.
Mostly these are operating-system and applications-software files plus common Internet downloads.(Metadata on uninteresting files may still be interesting as in indicating time usage patterns.) This definition applies to most criminal investigations and data mining tasks.It can be further refined for malware investigations where the "user" of interest is the malware.
We can confirm that files are uninteresting by opening them and inspecting them.Additional files may also be uninteresting depending on the type of investigation, such as medical records in an investigation of accounting fraud.Uninteresting files usually comprise most of a drive, so eliminating them significantly reduces the size of the investigation.Unfortunately, uninteresting files occur in many places on a drive, so finding the uninteresting is not always straightforward.
Most decisions about interestingness can be made from file-directory metadata without examining file contents.That is important because directory metadata requires EAI Endorsed Transactions on _________________ MM -MM YYYY | Volume __ | Issue __ | e_ N. C. Rowe 2 roughly 0.1% of the storage of file contents.Directory metadata can provide the name of a file, its path, its times, and its size, and this can give us a good idea of the nature of a file [1].We also include in this the hash value computed on the contents of the file, which enables recognition of file copies.Forensic tools like SleuthKit routinely extract directory metadata and hash values from drive images.
We can generally eliminate files whose hash values match those in published "whitelisting" sets [8].However, published hash values miss many kinds of files.This paper will discuss methods for improving this performance by additional filtering based on analysis of a large corpus of drives, in particular by correlating files across it.This provides both a new set of hash values and new methods for finding them.

Previous work
The standard forensic approach today is to eliminate from consideration those files whose hash values match those in the Reference Data Set of the National Software Reference Library (NSRL-RDS) from the U.S. National Institute of Standards and Technology (NIST).The quality of the data provided in the NSRL is high [10].However, tests found that it did not provide much coverage [16].Less than one file of four in our international corpus appeared in the NSRL, and there were surprising gaps in the coverage of well-known software.In part this is due to NIST's usual approach of purchasing software, installing it, and finding hash values for the files left on a drive.This will not find files created only during software execution, most Internet downloads, and user-specific configuration files.Furthermore, the fraction of files recognized by NSRL on a typical drive is decreasing as storage capacity increases.To fill the gap, commercial vendors like bit9.com and hashsets.comsell additional hash values beyond NSRL.
The work [4] investigates the problem of recognizing uninteresting files and suggests that pieces of files need to be hashed separately, a technique that considerably increases the workload.The work [19] details efficient methods for indexing and matching hash values found on files.Many of the issues are similar to the important problems of file deduplication [12] and file-existence checking [20] for which file hashes are useful.Analogous work has examined elimination of uninteresting network packets from analysis [6].
The work [19] investigated methods for improving a hash set of uninteresting files by using locality and time of origin to rule out portions of the hash values in the NSRL, and their experiments showed they could reduce the size of the hash set by 51.8% without significantly impacting performance.They also identified as uninteresting those files occurring on multiple drives, similarly to [16].Their experiments were based on less than one million files, a weakness since files in cyberspace are highly varied.A more serious weakness is that they used human expertise to provide guidance in indicating uninteresting files, and then trained a model.This seems risky because it may miss forensic evidence that is atypical or unanticipated.Legal requirements also often dictate that forensic evidence be complete, in which case elimination of forensic evidence must be done by better-justified methods than heuristic ones.

Experimental setup
The experiments reported here, except for some in section 5.6, were done with a corpus assembled in January 2015.It consisted of 4018 drives with 262.7 million files having 35.80 million distinct hash values.It included the January 2015 version of the Real Drive Corpus [5] (3397 drives and 104 million files purchased as used equipment) supplemented with files from classroom and general laboratory computers at our school (157 drives and 126 million files) and miscellaneous sources including our laboratory (464 drives and 33 million files).The school computers were centrally managed and had much software in common, thus providing data representative of large organizations.The miscellaneous sources included files extracted from compressed archives in the Real Drive Corpus including ZIP, GZIP, RAR, and CAB formats.
We extracted directory metadata with SleuthKit and the Fiwalk tool for the non-school drives and with our own extraction programs calling upon the operating system for the school drives.All these drives had normal users, and we saw little concealment or camouflage on them.Thus hash values on their contents should not show any manipulation, an issue important in some forensic applications [7].We still checked, however (see Table 7).
We also obtained the April 2015 version of the NSRL-RDS from www.nsrl.nist.gov.Our malware work used SHA-1hash values and our general-file work used MD5 hash values.Both are widely used and are catalogued for the NSRL.
The programs reported here were implemented in Python 3 with only default packages.

Finding uninteresting files in malware investigations
For malware investigations, uninteresting files are those not containing malware nor affected by malware.A sufficient condition for most files is if their hash values are unmodified from their initial values on installing the file.But this can entail looking up a large number of hash values, and there are many nonmalicious reasons to change a file's contents.Thus it is valuable to have more specific criteria for when a file is worth checking.Although there has been much work on malware detection [3,9,11], it is almost entirely focused on analysis of file and packet contents, and methods that examine the smaller amount of metadata and hashes could be a useful first step.

Testing malware clues
The following methods were used to identify malware in our corpus [17]: • Files in our corpus whose SHA-1 hash values were tagged as "threats" in the database of the Bit9 Forensic Service (www.bit9.com).398,949 distinct hash values of malware were found in the 31 million distinct hash values in our 2015 corpus.Bit9 identified 238,704, Open Malware matched 4,786, VirusShare matched 145,449, Symantec identified 1,401, and ClamAV identified 877.Surprisingly, there was little overlap between the malware identified by the five methods.
For testing, we created a control set from a random sample of 303,322 distinct hash codes from files from our 2015 corpus minus those that appeared in any of the malware sets.While this did not exclude unrecognized malware, the low frequency of recognized malware suggests that the unrecognized malware was unlikely to have much statistical effect on the comparison results.A taxonomy of extensions, top-level directories, and immediate directories was used that we have been developing [16].
Table 1 shows the results of testing of a variety of possible metadata clues to malware.Only clues with some observed promise are shown.The quantity listed is the number of standard deviations for the occurrence of malware greater than the expected value, 0.0013 (the fraction of malware in the corpus) times the size of the sample.The count used was the number of distinct malware hash values associated with the clue, since we saw drives where the same malware hash value occurred in hundreds of files it had infected.The five malware identification methods clearly seem to be addressing different kinds of files, consistent with the results of [11] on a larger number of malware detection methods but fewer files.Taking as valid those clues occurring more than two standard deviations in the same direction on at least three of the five methods, the positive clues were files whose size had a natural logarithm of more than 15, files at the top level of the directory hierarchy, deleted files (not helpful because many were deleted by anti-malware software), files where the file extension category was incompatible with its type based on its header and other "magic numbers", files created at odd creation times for their directory, files with singleoccurrence hash values, files with unusual characters in their paths, executables, files related to hardware, temporary files, and files not in major categories.Negative clues were files at level 10 or more in the file hierarchy, double extensions, files with no extension, video extensions, engineeringrelated extensions, game top-level directories, operatingsystem immediate directories, backup immediate directories, and data-related immediate directories.
One surprising result was that the number of drives on which malware occurred could be considerable (Figure 1).One malware occurred on 296 drives in our corpus, and many other kinds of malware that occurred on 10 or more drives.
This result challenges the notion of using "reputation" as a factor in discovering possible new malware, since usually reputation is estimated as the number of places in which something occurs.

Building a better quick scan
These results can reduce the time to find malware on a system.Malware could hide anywhere, but our conditional probabilities enable us to rank its likelihood from context so we can try the most likely places first.This is useful in designing "quick scans" for malware in which we only check part of a drive.
To compute odds of each clue, the set of hash values in our corpus was split randomly.
Files were found corresponding to the two half-sets of hash values, about 124 million files each.Conditional probabilities for the clues discussed above were calculated and converted into odds for one half of the corpus.Additional clues that were tested were the actual file extension, top-level directory, bottomlevel directory, and file name.Clues relating to the times of the file were excluded, however, because prediction is the goal and there is no guarantee that current time patterns will reoccur.Clues were only included if they occurred at least R times and were significant at a level greater than 2.0 standard deviations above or below the expected value.Clues were then tested for each file in the other half of the corpus.Assessment was by a normalization of the Naïve Bayes odds formula: Here o mans odds, M means "file was malicious", and C means clue.Odds were calculated with Laplace-smoothing constant K: Here n means count and O means "file is nonmalicious".Normalization was necessary because files varied in the number of significant clues they presented.Two constants R and K need to be optimized.R is the threshold for reliable counts on clues, and K represents the "background noise" of the clue.We did experiments on a different random sample of 30% of our corpus to vary R and K and measure the F-score (Table 2).There was not much variation in effect, but the best values appeared to be R=15 and K=30 and these were used in subsequent experiments.To test ability to rank malware, 100 evenly spaced threshold values on the combined odds were chosen and recall (fraction of malware over the threshold) and precision (fraction of files over the threshold that were malware) were calculated.Recall is important because a high value reduces the need and rate of doing full scans for malware, but precision is important too since a low value requires more files to be scanned unnecessarily.F-score is the classic way to trade them off.Malware was defined by our consensus list of malicious hashcodes, the union of the results of the five malware-identification methods.
We conducted this experiment three times on three random partitions of our corpus (with a total of 612,818 instances of malware and 128,776,919 instances of nonmalware for training), using one half for training and one half for testing.The recall values were 0.343, 0.305 and 0.333; the precision values were 0.213, 0.211, and 0.211; and the resulting F-scores were 0.263, 0.249, and 0.259.So there was not much variation in the results, and this supports the generality of our corpus for training purposes.But if one is willing to accept a much lower precision of 0.010 with our methods, we can obtain a better recall in finding malware of 0.650.By comparison, selecting only the executable files gave 0.005 precision (for 22,940,397 executables total) and 0.190 recall (for 116,235 malicious executables) for an F-score of 0.0097.Hence our methods give 5.1 times better precision with 1.7 times better recall over inspecting executables alone.Similarly, selecting only the files in operating-system top-level directories gave 0.003 precision and 0.189 recall, and selecting only the files in applications top-level directories gave 0.00031 precision and 0.056 recall, so searching for malware in particular directories is an even poorer strategy.A possible objection is that malware in executables, the operating system, and applications directories is more serious than in other places, but this is questionable since malware loads from many kinds of files today.
Our clues are straightforward to compute, and can be done on a drive once upon setup, then recalculated every time a file changes.Note they will be significantly faster to obtain than signatures of files because most involve metadata, with only a few clues requiring computation of a hash value on a file, something often computed routinely in investigations.

Finding uninteresting files in standard investigations
In standard criminal investigations, files can be judged interesting or uninteresting by a much wider range of criteria.However, the criteria can be considerably stronger than with malware; for instance, a file that occurs on 100 different drives is unlikely to provide the evidential specificity to help in a criminal investigation.Again we use the definition that uninteresting files do not contain usercreated nor user-discriminating data.

Proposed uninteresting-file identification methods
Nine methods to identify uninteresting files and then their hash values were investigated as summarized in Table 3. Parameters of these methods were set by the experiments reported in section 5.4.The methods were: • HA, frequent hashes: Files on many different drives with the same hash value on their contents.Hash values that occur on only two drives in a corpus could suggest sharing of information between investigative targets.But hash values occurring more often are likely to be distributions from a central source and are unlikely to be forensically interesting.
An example in our corpus was C161336552062A51C5130ECAB3F59BF3 which occurred on five drives as Documents and Settings/Administrator/ Local Settings/Temporary Internet Files/ Content.IE5/ ZBX73TSW/tabs [ • TM, files with clustered creation times: Files created within a short period of time on the same drive.Such clusters suggest automated copying from an external source, particularly if the rate of creation exceeded human limits.An example from our corpus were seven files created on one drive within one second under the directory Program Files/Adobe/Adobe Flash CS3/adobe_epic/ personalization: pl_PL, pl_PL/., pl_PL/.., pt_BR, pt_BR/., pt_BR/.., and pt_PT.All were 56 bytes, and two hash values were not in NSRL.Creation times are more useful than access and modification times because they are often installation times.
• WK, files created in busy weeks: Files created unusually frequently in particular weeks across all drives, which suggest software updates.A period of a week is appropriate since it takes several days for most users to download an update.Figure 2 shows some example sharp peaks in a typical distribution of creation times by week in our corpus.We first find "busy" weeks, then "busy" directories (full path minus the file name) in the busy weeks, those whose frequency of creation was a threshold number of times greater than their average creation time per week.The hash values  for files in those busy directories on those busy days are then the proposed as uninteresting.
13.6% of the files in the corpus were identified as deleted.We excluded these as sources of new hash values with three exceptions because [16] found that deleted files can have corrupted metadata.
Directories were often missing for deleted files, and we even saw inconsistencies in the sizes of reported by SleuthKit for deleted files.However, the same hash value or the same path appearing repeatedly is unlikely to be a coincidence even if they were all deleted, and long directory paths are likely correct, so we ignored deletion status in methods HA, PA, and TD.
Files for which drive acquisition could not find hash values can also be eliminated on some criteria; missing hash values can occur for deleted files with incomplete metadata.For those we applied the BD, TD, and EX criteria to individual file records, and eliminated files matching at least two criteria.HA will not work with missing hash values; PA will not work with the frequent missing fronts of deleted files; TM and WK will not work with unreliable timestamps; SZ will not work with unreliable file sizes; and CD will not work when whole directories are deleted.
We also eliminated default directory files in this final filtering.

Identifying explicitly interesting files
Since reducing false positives with uninteresting files (interesting files that were incorrectly identified as uninteresting) can be very important in a forensic investigation, despite our low observed rates, we investigated six methods to explicitly identify potentially interesting files, some of which were studied in [18].The first five look for clues of deliberate concealment [2], a definite possibility in many criminal investigations.Several software packages also claim to find anomalous files (e.g.Redwood, try.lab41.org),and related work has addressed the finding of traces of intrusions in storage [14].
(i) Files with hash values that occurred only once in our corpus where their same path occurs many times with a predominant different hash value.These could be deliberate camouflaged.(ii) Hashes that occurred just once in our corpus under a different file name than the predominant one for that hash.These could be deliberate renaming.(iii) Files created in atypical weeks for the predominant week of their directory.These could be deliberately hidden files.(iv) Files whose extension is inconsistent with the type assigned by "magic-number" header-tail analysis of the file contents such as by the Linux "file" utility.We used our extension taxonomy to classify extensions, and built a mapping from magic-number descriptions to the extension taxonomy classes (2,820 mappings were necessary) for comparison.These could be attempts at camouflage.(v) Files whose hashes had inconsistent sizes.These were always associated with faulty information about deleted files for our corpus, but could also indicate deliberate attempts to conceal in other corpori.(vi) Files with directories or extensions (when not in software directories) that are flagged explicitly as interesting.Directory examples are directories for encryption, disk wiping, and hacking tools.Extension examples are JPG and HTM when not in software directories.

Coverage and redundancy of the hash sets
We inspected the hash values in our corpus that matched the NSRL RDS and concluded they were highly reliable since, in random samples we saw no interesting files incorrectly included.This makes sense because the collection technique of the NSRL (buying the software, installing it, and inspecting its files) is a highly reliable at identifying forensically uninteresting files.So we eliminated files whose hash values matched NSRL entries as a first step.This reduced the number of distinct hashes from 35.80 million to 33.42 million.
At this point any further sets of uninteresting hash values can be applied, including the hashes of eliminated files from previous runs of our software.But in the experiments here, the methods described in sections 5.1 and 5.2 were applied to the full remaining data.To analyse the coverage and redundancy of the hash sets identified as uninteresting, we computed the sizes of their intersections (Table 4).The code labels are defined in section 5.1.It can be seen that there is some overlap but not a large amount, which argues for the use of all the methods together.To investigate the accuracy of our methods, we constructed a test set of 19,784 files by randomly sampling our corpus after elimination of all files in the NSRL.We laboriously inspected the metadata of the test set, and identified 18,223 of these as definitely uninteresting for all but special kinds of forensic investigation.We manually inspected the contents of as many as we could of the remaining 1,562 (some drive images were faulty), using associated software where possible such as image viewers for pictures, document viewers for documents, and a hexadecimal editor for the remainder.
This process identified 270 more uninteresting files, for a final total of 1,292 interesting or possibly interesting files in the test set (6.53%) and 18,492 definitely uninteresting files.Some files were encoded and unclear in function, so the 1,292 could well have included more uninteresting files.
Our software described herein eliminated all but 5,658 of the test set as uninteresting using what we determined to be optimal parameter settings.After all filtering and adding back in of interesting hash values, 23 of the 14,126 eliminated files were actually or possibly interesting (false positives) according to our manual inspection, for a precision of 0.9982 for eliminated files.5,396 of the 6,751 uneliminated files were actually uninteresting (false negatives), for a recall of 0.7069 for eliminated files.Precision here is considerably more important than recall since mistaken elimination of files removes them from an investigation whereas mistaken failure to eliminate files just increases the workload of the investigator a little, so the high observed precision is very encouraging.The 23 false positives included a frequently-seen government-retirement guide (whose presence could indicate a desire to retire), a cache file for an accounting system (but it could be a default), a geographical-information cache (also possibly a default), a document that was part of a large library downloaded at one time, and 11 unidentifiable deleted files that were missing directory information.The 5,396 false negatives included software installed in atpyical places rather than "Program Files" and "Applications", as well as documents and Web pages associated with software.
Table 5 shows the results of tests on the precision and recall for each of our methods for identifying uninteresting files (Table 5), varying the key parameters to see their effects.Measurements were made with the list of hash values delivered by the method.Here "mindrives" is the minimum number of drives on which the data occurs, "segcount" is the number of segments on the right end of the path that define the immediate directory, "weekmult" is how many times busier a week must be than the average week, "pathmult" is the minimum number of times of occurrence more than the typical path, "mindev" is the minimum number of standard deviations more than the mean number of occurrences that are required, "mindircount" is the minimum-size directory examined, and "fracmin" is the minimum fraction of the files in the directory that are already known to be uninteresting.
Note that the "mindrives" parameter should be proportional to the number of drives in a corpus.Our corpus had 4018 drives, so mindrives should be multiplied by the number of drives in a new corpus divided by 4018.The other parameters do not need such adjustment.
It can be seen that false positives and false negatives trade off consistently as parameters vary, so the best set of parameters reflects the relative weight of precision to recall.It is impossible to eliminate false negatives completely for most corpori because there are several legitimate reasons for them: • For HA, PA, and BD, some files that appear frequently may still be interesting, as for instance documents explaining how to make a bomb on drives obtained from criminals.• For WK and SZ, interesting files may be loaded by coincidence during busy weeks or coincidentally have the same size as standard formats.
• For TM and CD, interesting files may occur in the midst of mostly uninteresting files by software design.• For TD and EX, users may try to deceive investigators by camouflaging files.However, deception can be detected by other methods.
Table 6 shows the effects of varying the required number of clues to identify a file as uninteresting before we removed the explicitly interesting files from the hash sets.Requiring two clues appears to be best since recall decreases significantly for larger numbers with little effect on the nearperfect precision.Higher false positives could be acceptable in preliminary investigation of a corpus, but might be unacceptable in criminal investigation because of the possibility of excluding potential key evidence in a case.
Table 6 also shows the effect of choosing all low values for the parameters, all intermediate ("medium") values for the parameters, and all high values for the parameters.This tested the synergistic effect of having multiple clues.We further tested the effects of requiring 2 out of 8 of the clues, excluding each clue in turn, but it made little difference to the precision while decreasing the recall significantly.The last row statistics for the final result, after adding the uninteresting hashless files and default-directory files such as "..", and then subtracting the interesting files as discussed in the next section.

Accuracy of our methods of finding interesting hash values
Table 7 shows results on the accuracy of the six methods for detecting interesting hash values.Their performance was poor with the exception of the last method.For these experiments, true positives were defined as files identified both as interesting in our test set and by our methods, false positives were files identified as interesting only by our methods, and false negatives were files identified as interesting only in our test set.Here "mincount" is the minimum size of the directory considered, and "minfrac" is the minimum fraction of the most common hash or path.Results are likely poor due to the minimal concealment in our data.We used only the hash values from the last method in the final results reported here.However, files obtained during criminal and intelligence investigations are more likely to show concealment, and such files could be critical in an investigation.It is thus recommended that all these interesting-file methods should be run whenever the uninteresting-file methods are too.

Additional analysis
Overall statistics on file and hash eliminations are given in Table 8.The full run our corpus files with the optimal parameters took 2748 minutes (just short of two days) as a single process on a single Redhat Enterprise Linux 7 system.This is a one-time cost that does not affect the speed of online usage of the hash values.The test machine had 515 gigabytes of main memory shared with other researchers, and we found we had to limit main memory usage to 200 gigabytes maximum by splitting files as necessary.Input was DFXML-format metadata (as defined at www.nsrl.nist.org).Much of the processing could be partitioned to separate processors to decrease completion time if desired.We compared the types of files before and after filtering out uninteresting files using our aforementioned taxonomy on file extensions.Operating-system files went from 6.1% to 1.0%, installation and update files went from 1.5% to 0.4%, executables went from 10.9% to 3.3%, program source (including scripts) from 9.3% to 7.0%, XML from 3.0% to 2.2%, integer extensions from 0.8% to 0.4%, indices from 4.4% to 0.1%, and games from 1.5% to 0.1%.Camera images increased from 3.6% to 10.3%, Web files and links from 9.6% to 12.2%, documents from 2.9% to 6.8%, audio from 1.4% to 2.8%, video from 0.3% to 1.1%, email and messaging from 0.1% to 0.4%, temporaries from 0.9% to 1.6%, logs from 0.3% to 0.7%, database-related files from 0.4% to 0.7%, geography-related files from 0.1% to 0.3%, and extensionless files from 9.7% to 16.8%.All other filetype percentages did not change significantly, including notably graphics, configuration files, copies, securityrelated files, and files with multiuse extensions.Note that a hash value appearing with multiple paths was eliminated if there were two clues for any of the paths, not necessarily for all of the paths.
A criticism made of some hash-value collections is that their values will rarely occur again.So an important test for our proposed new "uninteresting" hash values is to compare those acquired from different drives.For this we split the corpus into two pieces C1 and C2 based on the drives, roughly drives processed before 2012 and those processed in 2012 and 2013.We did not include any drives from our school in this experiment because they are centrally managed and share much software.
We extracted uninteresting hash values using our methods for both separately, and then compared them.Table 9 shows the results.This table provides two kinds of indicators of the generality of a hash-obtaining method.One is the average of the second and third columns, which indicates the overlap between the hash sets.On this SZ is the weakest method and WK is second weakest, which makes sense because these methods seek clustered downloads and some clusters are forensically interesting.
Another indicator is the ratios of column 7 to column 4 and column 6 to column 5, which indicate the degree to which the hash values found generalize from a training set to a test set.The average for the above data was 0.789 for the average of column 7 to column 4, and 0.938 for the average of column 6 to column 5, which indicates a good degree of generality.No methods were unusually poor on the second indicator.
We also compared the uninteresting hashes found for the 2013 version of the corpus with those of two commercial hashsets, the April 2013 download of the Bit9 Cyber Forensics Service (www.bit9.com)and the June 2012 download of the hash list of Hashsets.com.We computed statistics on the file types represented in the two sets and confirmed broad coverage of a variety of file types in the sets, not just coverage of executables.Nonetheless, hashsets.commatched hashes on 5,906 of the remaining hashes, an unimpressive 0.06%.Bit9 recognized 607,693 hash values not in NSRL out of the 10,342,915 that it recognized, of which 93,436 (0.95% of the remaining hashes) were not found by our methods.So Bit9 is not much more help in eliminating uninteresting files beyond our own methods which eliminated millions of files, something important to know since it is expensive to purchase.
Of the 13.28 million hash values of the NSRL RDS in our corpus, only 1.45 million (10.9%) were also found by our methods without help from NSRL.So there is still a justification for building the NSRL even though it requires more manual labour per hash than our methods.

Proposed file-elimination protocol
We suggest then the following protocol for eliminating uninteresting files from a corpus in a standard investigation: (i) Run methods PA, BD, TM, WK, and SZ to generate hash sets of candidate uninteresting files on the full corpus.(ii) Eliminate all files in the corpus whose hash values are in NSRL, our list at digitalcorpora.org,and any other confirmed "uninteresting" lists available.(iii) Run the methods HA, CD, TD, and EX on the remaining files to generate additional hash values of uninteresting files.These methods do not benefit from seeing the entire corpus.(iv) Find hash values that occur in at least two uninteresting hash sets, and remove files from the corpus with those hash (v) Eliminate default-directory files and those without hash values that match on two of the three criteria BD, TD, and EX.(vi) To the remaining files, add files matching the "interesting-directory" and "interesting-extension" criteria.(vii) Save the final list of eliminated hash codes for bootstrapping with future drives when doing step (ii).

Conclusions
Although uninterestingness of a file is a subjective concept, most forensic investigators have a precise definition for each investigation that is usually based whether a file contains user-created or user-discriminating information.It appears that relatively simple methods can be used to automate this intuition, and can eliminate considerable numbers of uninteresting files beyond using the NSRL hash library alone.On our corpus, NSRL eliminated 23.8% of the hashes while our methods eliminated an additional 53.6%, while keeping false positives (incorrectly eliminated files) to 0.18%.Our methods do need a large corpus of file examples; however, more and more file data is becoming available to researchers.It also appears that commercial hash sets are of limited additional value to most forensic investigations if the methods proposed here are used.
Our methods can eliminate files unique to a drive, but they also will provide hashes that should be useful for other corpora.Investigators can choose which methods to use based on their investigative targets, can set thresholds based on their tolerance for error, and can choose to eliminate further files based on time and locale as in [19].
We have published a list of our uninteresting hashes for free download on digitalcorpora.org.Further methods for identifying additional uninteresting files are definitely possible given the low 6.53% "potentially interesting" rate in our test set.Future directions are to extend the ideas to hashes on portions of files [15] and to many-to-one mappings recognizing similar but not identical files such as pictures.

•
Files in our corpus whose computed hash values matched those of malicious software in the Open Malware corpus (oc.gtisc.gatech.edu:8080) of about 3 million files.• Files in our corpus whose computed hash values matched those of malicious software in the VirusShare database (virusshare.com) of about 18 million files, after mapping its MD5 hash values to SHA-1.• Files identified as threats by Symantec antivirus software (www.symantec.com/endpoint-protection) in a sample of files extracted from the corpus.The sample was downloaded to a home computer with the antivirus software installed, and every file that Symantec complained about was recorded.Only a sample could be tested because the corpus is too big to store online and extraction of files is time-consuming.The sample included about 300,000 random files plus 30,000 embedded files of type zip, gzip, cab, 7z, and bz2 because of their higher fraction of malware.Also included were 7,331 files from the Open Malware corpus whose hashcodes matched those of our corpus files, of which only 721 were flagged as malicious by Symantec.• Files identified as threats by ClamAV open-source antivirus software (www.clamav.net) in the same sample of files tested by Symantec.

Figure 1 .
Figure 1.Observed fraction of malware versus number of drives on which a hash value appears.Highest peak is 0.08 at 16 drives.Drives 200

Figure 2 .
Figure 2. Portion of histogram of creation times per week in our corpus.

Rowe 4 Table 1 .
Strengths of various malware clues, measured as number of standard deviations plus or minus of the expected random frequency, counting by hash values.

Table 2 .
Effects of varying R (minimum count) and K (damping constant) on malware F-score.

Table 3 .
Methods for identifying uninteresting files.

Table 4 .
Intersection sizes of uninteresting hash sets in our corpus, in millions.

Table 5 .
Effects of parameter variation on performance of different methods of identifying uninteresting files.

Table 6 .
Additional effects of parameter variation on the accuracy of identifying uninteresting files.

Table 8 .
Overall statistics on file and hash elimination and their data reduction percentages.

Table 9 .
Statistical comparison of hashes derived from partition of our corpus into two halves, where HA=hashes, PA=paths, TM=time, SZ=size, BD=bottom-level directory, CD=directory context, TD=top-level directory, EX=extension.