AutoScraper is a Smart, Automatic, Fast and Lightweight Web Scraper for Python.
Developed by Alireza Mika, it can be downloaded at https://github.com/alirezamika/autoscraper
Despite the availability of tools such as Beautiful Soup Web Scraping is difficult.
A library such Beautiful Soup helps you to:
But a Web Scraper doesn't write the query for you.
The purpose of a web page is to be consumed by humans not machines:
What if a library could learn from an example and then can write the scrap query for you : it's "the reason d'être of AutoScraper".
I want to create a Web scraper for the Web site Quora to all the questions about a subject.
1pip install git+https://github.com/alirezamika/autoscraper.git
1from autoscraper import AutoScraper23# Parameters4url = "https://www.quora.com/search?q=deep%20learning&time=year"5model_name = "model_quora"67wanted_list = ["When will deep learning finally die out?"]89# We instanciate the AutoScraper10scraper = AutoScraper()1112# We train the Scraper13# Here we can also pass html content via the html parameter instead of the url (html=html_content)14result = scraper.build(url, wanted_list)1516# We display the results if any17if(result):18 print("🚀 Great a query has been inferred !! Great gob.")19 print(result)2021# If no result we leave with an error code22if(result == None):23 print("Sorry no query can be inferred ... 😿")24 exit(-1)2526# We save the model for future use27print(f"💿 > Save the model {model_name}")28scraper.save(model_name)
1python3 demo_train.py
1🚀 Great a query has been inferred !! Great gob.2['When will deep learning finally die out?', 'What newly developed machine learning models could surpass deep learning?', 'What is the future of machine learning/deep learning startups?', 'How promising is deep learning?', 'How can a regression problem be solved with deep learning?', 'What is the brutal truth about deep learning?', 'Why is there still no theory underlying deep learning?', 'What are the frameworks for deep learning modelling?', 'What is deep learning in terms of programming?']3💿 > Save the model model_quora
A model has been saved in the preceding step that contains all the rules of scraping.
Now, we can apply our model on a page that shares the same structure with the page we have used during the training phase.
1from autoscraper import AutoScraper23# AutoScraper must be installed with4# pip install git+https://github.com/alirezamika/autoscraper.git56question = "france"7time = "year"8url = f"https://www.quora.com/search?q={question}&time={time}"9model_name = "model_quora"1011scraper = AutoScraper()12scraper.load(f"./{model_name}")13# Get all the results in the page similar to our model14results = scraper.get_result_similar(url)1516# if no results17if results:18 for r in results:19 print(r)20else:21 print("No result found")
1python3 demo.py
1Is France really as useless at war as portrayed in America?2France fined Google 166M. Can Google just say no and not pay it? What are they going to do, ban Google in France?3Is there freedom of expression in France?4How is France dealing with Covid-19?5What country is the oldest ally to France?6Is France really 'littered' with abandoned chateaux?7Why is France considered the most advanced country of Europe?8Will Germany and France leave the European Union following Brexit?9American expats to France, is France what it is cracked up to be?
The scraper must be trained again if the structure of the page changes.
The real advantage of the is approach is to be very reactive when a new format is available and to propose a new model quickly to continue the data extraction.
The library is very new. It's not perfect, but a big thanks to Alireza Mika for this great approach.
We deliver high quality blog posts written by professionals monthly. And we promise no spam.