Data Forest logo
Data parsing case image
Home page  / Cases / Data parsing

Data parsing

We helped a law consulting company create a unique instrument to collect and store data from millions of pages from 5 different court sites. The scraped information included PDF, Word, JPG, and other files. The scripts were automated, so the collected files were updated when information changed.

14.8 mln

pages processed daily

43 sec

updates checking
Data parsing case image

About the client

Law Consulting company based in South America. Client automate, manage, classify and store files of court cases, documents and contracts of all kinds via AI algorithms.

Tech stack

Python logo
Python
Celery logo
Celery
Pandas logo
Pandas
PostqreSQL logo
PostgreSQL
Elasticsearch logo
Elasticsearch
GCP logo
GCP

Challenges & solutions

Challenge

Scrape data out of millions of pages and documents from five separate judicial websites; don’t overload them.

Daily collect updated data from the judicial cases files. Not only the structured data but also built-in PDF, Word, JPG files, etc.

Keep scripts continuously running, add new cases files and update old ones if something had changed.

Solution

Created a distributed system architecture with Linux nodes and dynamic pipeline which makes managing high peaks and set priorities possible.

Created an algorithm, which scrapes new files immediately during the daytime based on traffic and makes massive updates during the nighttime.

Created a cloud SQL database with daily dumps to Elasticsearch to keep data we use; files directly uploaded to Elasticsearch.

Implemented proxies and AI technologies used to overcome bot protection and process 14.8 million pages daily. Daily we download about 14 Gb of important data.

Data parsing first slider image
Data parsing second slider image
gradient quote marks

These guys are fully dedicated to their client's success and go the extra mile to ensure things are done right.

Sebastian Torrealba photo

Sebastian Torrealba

CEO, Co-Founder DeepIA, Software for the Digital Transformation

Steps of providing
data scraping services

Consultation icon

Step 1 of 5

Free consultation

It's a good time to get info about each other, share values and discuss your project in detail. We will advise you on a solution and try to help to understand if we are a perfect match for you.
Analysis icon

Step 2 of 5

Discovering and feasibility analysis

One of our core values is flexibility, hence we work with either one page high level requirements or with a full pack of tech docs.  At this stage, we need to ensure that we understand the full scope of the project. Receive from you or perform a set of interviews and prepare the following documents: list of features with detailed description and acceptance criteria; list of fields that need to be scraped, solution architecture. Ultimately we make a project plan which we strictly follow. We are a result-oriented company, and that is one of our core values as well.
Solutions icon

Step 3 of 5

Solution development

At this stage, we develop the scraping engine core logic. We run multiple tests to ensure that the solution is working properly. We map the fields and run the scraping. While scraping, we keep the full raw data so the final model can be enlarged easily. Ultimately we store data in any database and run quality assurance tests.
Data delivery icon

Step 4 of 5

Data delivery

After quality assurance tests are completed, we deliver data and solutions to the client. Though we have over 15 years of expertise in data engineering, we expect client’s participation in the project. While developing and crawling data, we provide midterm results so you can always see where we are and provide us with feedback. By the way, a high-level of communication is also our core value.
Support improvement icon

Step 5 of 5

Support and continuous improvement

We understand how crucial the solutions that we code for our clients are! Our goal is to build long-term relations, so we provide guarantees and support agreements. What is more, we are always happy to assist with further developments and statistics show that for us, 97% of our clients return to us with new projects.

Success stories

Check out a few case studies that show why DATAFOREST will meet your business needs.

Store heatmap

The electronics retailer, wanted to improve their sales and customer service by analyzing the flow of people into their stores. We created a system using Machine Learning, image detection, and face recognition. The system tracks visitors' movements and the most viewed shelves and products. This information helps the store to focus on selling popular products and to avoid unpopular ones, ultimately improving the sales process.
9%

increase in sales

100%

dead zones removed

Jared D. photo

Jared D.

CEO Consumer Electronics Retail
View case study
Store heatmap case image
gradient quote marks

DATAFOREST provides meaningful shopper-behavior Insights. They are very responsive and effective, trying to engineer and offer the best fit solution.

Operating Supplement

We developed an ETL solution for a manufacturing company that combined all required data sources and made it possible to analyze information and identify bottlenecks of the process.
30+

supplier integrations

43%

cost reduction

David Schwarz photo

David Schwarz

Product Owner Biomat, Manufacturing Company
View case study
Operating Supplement case image
gradient quote marks

DATAFOREST has the best data engineering expertise we have seen on the market in recent years.

Supply chain dashboard

The client needed to optimize the work of employees by building a data source integration and reporting system to use at different management levels. Ultimately, we developed a system that unifies relevant data from all sources and stores them in a structured form, which saves more than 900 hours of manual work monthly.
900h+

manual work reduced

100+

system integrations

Michelle Nguyen photo

Michelle Nguyen

Senior Supply Chain Transformation Manager Unilever, World’s Largest Consumer Goods Company
View case study
Supply chain dashboard case image
gradient quote marks

Their technical knowledge and skills offer great advantages. The entire team has been extremely professional.

Infrastructure Audit & Intelligent Notifications

An eCommerce company had issues with managing its complex IT infrastructure across multiple cloud providers. We helped to analyze the current architecture and develop a strategy for unification, scaling, monitoring, and notifications. As a result, we implemented a single cloud provider, CI/CD process, server unification, security and vulnerability mitigation actions, and improved reaction speed and reliability by 200%.
200%

performance boost

24/7

monitoring

Dean Schapiro photo

Dean Schapiro

Co-Founder, CTO Ecom Innovators, eCommerce company
View case study
Infrastructure audit case image
gradient quote marks

Not only are they experts in their domains, but they are also provide perfect outcomes.

Latest publications

All publications
Preview article image
18 min

Web Price Scraping and Price Monitoring Guide by DATAFOREST

Preview article image
18 min

Efficient Cloud Infrastructure Monitoring: Secure and Scalable Solutions

Preview article image
16 min

15 Best Data Warehousing Tools: Cloud-based Platform & Services - DATAFOREST

We’d love to hear from you

Share the project details – like scope, mockups, or business challenges.
We will carefully check and get back to you with the next steps.

DATAFOREST worker
DataForest, Head of Sales Department
DataForest worker
DataForest company founder
top arrow icon