Common Crawl is a free data source on AWS, which contains full exports of HTTP information gained through worldwide scans of the internet on a regular basis. However, with nearly 600TB of data to parse, it’s no small lift to crunch through all that data. Warcannon was created to allow for easy parallelization when parsing a Common Crawl dataset, opening the WARCs, performing regex checks, and storing the results. Warcannon can scale to any number of nodes, and securely stores all the results in S3.
Check it out on Github.