The Data Science Trend and A Cloud-Based Crawler as a Platform for Intellectual Property and Competitive Intelligence Research
James H. Moeller
December 15, 2017
(Approximate Read-Time: 7 minutes, Word Count: 1,334.)
 
The Data Science Challenge
Lately it seems that the media is full of new reports on the changing landscape of jobs and how robotics, artificial intelligence, and data science, will be taking over many aspects of our lives. Examples include driving automobiles, taking restaurant orders, diagnosing medical issues, providing legal advice, and making finance and investment recommendations, just to name a few. Exactly when this will all happen is debatable. But if you’re gauging technology demonstrations and initial implementations, there’s no doubt that progress is happening quickly and that the technology is already being implemented in many businesses today.

While some job sectors will be affected more dramatically than others, for me and my business, the current transition is really a continuation of technology adoption that I’ve been working with for decades. It’s a continual challenge of how to utilize new technology to make work easier, more efficient, and more effective. So, it’s with the pursuit of that challenge that I’m integrating more data science into my consulting practice with the intent of adding value to what I provide in terms of intellectual property and competitive intelligence research.
 
Recently I’ve been experimenting with an open-source Web crawler system named Scrapy (https://scrapy.org) and I’ve used that to implement a crawler data service available on my website (https://www.MoellerVentures.com). What I’ve designed is a cloud-based crawler with a browser-accessible, form front-end that works across all platforms (desktop, tablet, and mobile). It’s intended to be a focused, deep-dive keyword crawler, as compared to broader, keyword-oriented, search engines. Additionally, it can serve as a platform for future applications leveraging third-party capabilities and data sources. It can utilize the data science, AI and machine learning services from companies such as Amazon, Microsoft, Google, IBM and others. In combination with these services, it can also integration API-accessible data sources from organization like the U.S. Patent and Trademark Office (USPTO) and the Federal Communications Commission (FCC). Those will be topics of future blog posts.
 
The Crawler
The process of information retrieval is the beginning of most data science projects. So, I wanted to initially pursue a platform that could execute this information retrieval from the widest available sources, including the public web (websites that are publicly available and largely cataloged by the search engines), the deep web (websites where information is contained behind password protection or some other gateway), and the dark web (websites that are only accessible via the “.onion” Tor network). The utilization of the Scrapy crawling engine enables this information retrieval functionality.
 
To be credible in today’s data-centric world requires a professional computing platform, and that means an implementation in the cloud. For small and medium size businesses in particular, the advantages associated with purchasing, maintaining, scaling and securing a computing infrastructure make cloud computing the only viable solution. I’m using Microsoft’s Azure cloud (https://azure.microsoft.com/en-us/) for my crawler compute functions and storage implementation, and Liquid Web (https://www.liquidweb.com) for my website hosting.
 
My Forensic Keyword Crawler function is available under the Search & Data Services menu on my website. It implements a focused keyword search on webpage text from public and dark web sites. On the website form, the user enters a starting URL, parameters defining how links are followed, the keywords to search for, the depth of the crawl, and the maximum number of pages to crawl. All crawls of 10 pages or less are free, and longer crawls are only charged by the number of pages actually crawled. The crawler queries the starting URL(s) and searches all displayed text for the specified keywords. The keyword search proceeds from the starting URLs and follows all links on those pages according to the link-following parameters, while continuing to search for the keywords on all subsequently followed pages. The crawl can continue to deeper link levels as defined by the crawl depth specification. All webpages searched are logged in user's results output file, with matched keywords noted under each webpage listing. The results are emailed to the user when the crawl is complete.
 
So, the crawler can help you search for information not readily available via search engines or a website search box. Say you want to search a specific website or portion of a website, and that site doesn’t provide a search function, or a search capability that includes the pages in which you’re interested, then you can utilize the Forensic Keyword Crawler to specifically search the webpages of interest.
 
As a competitive intelligence example, assume your pages of interest are the webpages composing the press releases portion of a competitor’s website, complete with previous years archives represented as additional links to press releases segmented by year. You want to search all press releases for a certain set of keywords. To do so you can specify the press releases main webpage as the starting URL in the Forensic Keyword Crawl and set the link-following parameters to only follow links on this same website (to contain the crawl to the relevant press releases archives). Then specify the keywords to search for, and the appropriate crawl depth corresponding to the link depth of the archives (i.e. the total link depth below the main press releases page). The crawler will then go off and execute that crawl and, when completed, email you the results consisting of all the pages searched and the matched keywords on each webpage.
 
As an intellectual property research example, suppose you want to search a specific set of online research documents for some intellectual property related keywords. Those documents also contain hyperlinks to additional reference documents, that may be located on completely different websites, that you also want to search. In this situation, you can specify the URLs of the initial research documents as the starting URLs of the crawl and set the link-following parameters to follow any links (i.e. even follow links that lead off the website of the starting URLs). Then specify your keywords and the depth level. The crawler will execute the keyword search on all the research documents as well as the linked reference documents that reside on different websites. Again, the results will list all the searched webpages and the match keywords on each page.
 
My website contains other tutorials on using the Forensic Keyword Crawler as well as additional examples of crawling the dark web and the USPTO’s (U.S. Patent and Trademark Office) weekly Official Gazette Notices. See the following link for more information. https://www.moellerventures.com/index.php/tutorials
 
The Take-Aways
The main intent of the Forensic Keyword Crawler is as a focused, deep-dive, search mechanism that can aid in finding very specific information that’s not readily available via other search capabilities or services. So, in situations where there is an interest to search specific websites or even specific webpages for detailed keyword-oriented information (typical in intellectual property and competitive intelligence research), the Forensic Keyword Crawler can be utilized to potentially find that information. In addition, easy access and setup of the crawler is readily available via a browser-accessible web form that works on your desktop, tablet, or mobile phone. Finally, the systems running these services are implemented in the cloud and thus provide the reliability, scalability, and security inherent with cloud implementations.
 
The keyword crawler is good example of the core functionality of the Scrapy crawling engine, however as a platform it can be utilized much more extensively. With information retrieval being an important initial process in any data science project, the additional capabilities can be utilized to enable a broader set of future applications. These include integrating with third-party data science platforms and retrieving data from a variety of API-accessible sources.
 
Finally, the macro growth trends associated with data science applications are undeniable. While the overall economic and individual job sector effects will be varied in impact and timing, for my business, the time to be involved is now. That said, I also believe there’s a longer-term vision that needs to be applied, as third-party solutions and platforms intersection with specific applications, and as market sectors shift to be more data centric and automation oriented.
 

Further Reading: