Forensic Keyword Crawl:

  • The Keyword Crawl queries the specified starting URL(s) and searches all displayed text for the specified keywords. The keyword search proceeds from the starting URLs and follows all links on those pages according to the depth specified, the allowed domain names list, and the denied URL strings specifications, while continuing to searching for the keywords on all subsequently followed pages.
  • Web pages, where keyword matches are found, are logged in the user's results output file.
  • The user's crawl results output file is emailed to the user when the crawl is finished and is limited to 10M in size.
  • Free:All crawls of 10 pages or less are free. Use this to experiment before executing longer crawls.
  • Longer crawls are billed only by the number of pages actually crawled.
  • Crawl Speed:Completion time of a crawl can depend on a number of factors, such as the number of pages being crawled, the crawler server load, a website's page response times, and miscellaneous processing overhead, etc. In addition, the crawler’s queries are regulated to not overload the site being crawled. In general, you can estimate that the crawler will take at least 3 seconds per page crawled or about 5 minutes per hundred pages.
  • Crawler Efficiency:The effectiveness of the keyword crawler can be highly dependent on the website content and structure. This keyword crawler is designed to be successfully utilized on the broadest set of websites. Still, you may encounter sites that the keyword crawler is less effective in locating the specified keywords. Contact me if you encounter problems crawling specific sites.
  • Tutorials:See the following link for more tutorial information: https://www.moellerventures.com/index.php/tutorials
  • Custom Crawls:Contact me if you need a custom keyword crawl designed specifically to your needs.

Simple Keyword Crawl: The most basic keyword crawl can be executed by entering a starting URL and a keyword, and use the defaults for the remaining input fields. The Allowed Domain Names field will default to the domain name of the starting URL, and will thus restrict the crawl to the website (i.e. domain name) of the starting URL. The Deny Domain Names and URL(s) Strings field will default to a blank field, thus indicating that no domain name or URL string should be denied. The Crawl Depth field will default to 1 and the Pages Crawled Limit field will default to 10 (i.e. the 10-page free crawl). 

As a specific example, enter the following.

As specified, this will crawl the http://www.fuzionathletics.com site and look for the keyword "fuzion". It will not crawl any links that lead to a URL with a domain name other than fuzionathletics.com. This crawl will only search 1 depth level below the starting URL and will stop once it has crawled 10 pages. The output results will show all instances where the keyword matched any text for the 10 pages that were crawled. All 10 pages will be contained within Fuzion website and utilize URLs with the fuzionathletics.com domain name.

Using the Deny Domain Names and URL(s) String Field: To exemplify the use of the Deny Domain Names and URL(s) Strings field, we'll construct a crawl that will find 10 off-site links that have a specified keyword. We'll use the Simple Keyword Crawl example above, and use exactly the same starting URL and specified keyword. Except with this crawl we'll specify "fuzionathletics.com" in the Deny Domain Names and URL(s) Strings field. This indicates to the crawler that it should not follow any links with "fuzionathletics.com" in the URL. It's important to note that this Deny Domain Name and URL(s) Strings field specifies a string that can match anywhere in the URL. So if, say, we were to also specify "polevault" in the deny field, and the string "polevault" was used in another URL, maybe as a web page specifier, then that URL would NOT be followed or crawled. But with "fuzionathletics.com" specified as a deny string, the crawler will effectively not follow any links on the fuzionathletics.com website, and will only follow links that lead off the website.

For this example enter the following.

  • Starting URL: https: "http://www.fuzionathletics.com"
  • Keyword: "fuzion"
  • Deny Domain Names and URL(s) Strings: "fuzionathletics.com"
  • Keep all other input fields with their default values.

For this crawl, the output results will show all instances where the keyword "fuzion" matches text that is displayed on web pages that are have links present on the http://www.fuzionathletics.com page (not more than 1 level below this page), and don't utilize "fuzionathletics.com" anywhere in the URLs. So the output results for this crawl will show keyword matches on websites such as Instagram, Twitter, and YouTube, that are linked to from the Fuzion website and show the respective Fuzion information on those sites, and happen to utilize the keyword "fuzion". For this crawl the results will be limited to the first 10 pages found.

Crawling Pages / Sites with Non-Text Information: Webpages that contains non-text information can be problematic for the Forensic Keyword Crawler function. To avoid extensive delays associated with the download and analysis of webpages that contain information such as audio, video, or application files, the crawler is designed to first check the page content type before downloading the webpage. The results file will show all the pages that were queried and will flag the ones with non-text content types by displaying the content type indicator in red text.  For some websites, such as file repositories, there may be entire webpages (URLs) with nothing but non-text content type file links.  In these situations, it is often useful to list those URLs in the “Deny Domain Names and URL(s) String” field so that the crawler completely avoids those URLs altogether and focuses the keyword search on more relevant pages.

Crawling the Dark Web: The keyword crawler is capable of crawling the Dark Web. Simply enter the Dark Web URL(s) in the Starting URL(s) specification fields in the form, just as you would with a regular URL. Then complete the remainder of the form, keyword(s), depth, etc. and click submit.

As a specific example, enter the following.

  • Starting URL: “https://3g2upl4pq6kufc4m.onion”
    • This is the Dark Web link for the DuckDuckGo search engine.
  • Keyword: "duck"
  • Keep all other input fields with their default values.

A Note on Crawl Depth: Crawl depth can have a dramatic effect on the time it takes for a crawl to complete, primarily because every level has the potential to contain many more links. The keyword crawler is designed to search breadth first. So the crawler will attempt to crawl all the links at a given level before moving to a deeper level. The parsing and presentation of the keyword matches in the user’s output file can depend on the response time of the web pages (and other factors) but in general keyword crawls specified with a larger (deeper) crawl depth will typically list the keyword matches found on the starting URL pages first followed by the matches at deeper levels.