Collecting data from the web
There is an abundance of information available on the web. This information is often displayed in human-readable ways on web pages. Much of this information is of interest for your scientific research. You can therefore collect the information on the web pages – and/or the underlying data – and use it in your research project.
You can use web data for your research, for example, because these data:
- are more recent and more easily available, compared to official documents
- are by definition only available online, like social media posts etc.
- cannot be collected at a large scale using regular data collection methods
Collecting data from the web is best done automatically, because it is repeatable and efficient. An automatic collection method is also less error-prone and can be easily shared with others.
How to collect web data
The original and still very common way to present information on the web is on a web page. One website can consist of many web pages. Web pages are represented in Hypertext Markup Language (HTML) and can contain text, images and other objects like interactive elements. Web pages can be large and not all their content is relevant, so they must first be parsed to extract data useful for research.
The collection process of web data typically involves the following steps:
- locate the url of the web page holding interesting content
- request and open the web page
- locate the content in the page source
- extract the content
- save the content in an appropriate format
Below we list some ways to collect data from the web. For some approaches, the initial steps are already covered and all you need to do is download the data. Alternatively, you can build your own collection method. This is more complicated, but also offers more control.
Repositories and APIs
Many organizations, companies and government agencies have created public repositories where they provide easy access to data. These data can be manually downloaded, and often there is also an Application Programming Interface (API) available for automated access. Some examples are:
- Statistics Netherlands (CBS)
- Copernicus Open Access Hub: European satellite program
- Pubmed: database biomedical literature
Working with an API means that you can directly download the data you need, without first having to locate, request and extract the information. You may still need to write your own program to select, get and edit the data you need. For many APIs, there are wrappers available in different programming languages. These wrappers contain functions to execute API calls and process the results. Note that here might be a fee associated with API access.
Web scraping tools
Web scraping tools are off-the-shelf software to extract information from websites. You do not need any programming knowledge to use these tools. There are three main types of tools available.
The first two types of tools are browser plug-in and installable software. These tools allow you to go to a website, click on the elements you want to scrape and download them. In doing so, you perform all the collection steps semi-automatically. Browser plug-ins and installable scraping software can be either free or paid. Their scalability is limited, making them especially suitable for simple or small scraping tasks. The third type is cloud-based web-scraping. This type of tools can handle large-scale web scraping and the results are stored in the cloud. You can try out many of these tools for free, but large-scale scraping projects can get expensive.
Examples of the three different types of tools are:
Web pages are constantly changing. If you want to retrieve information from past web pages, or earlier versions of a page, you can use a web archive. Scraping archives is convenient, because many web pages of interest are available in one place and they allow you to obtain data over a longer time period in one scrape. However, not all pages of a website may be archived, and archives won’t contain posts and articles that were removed from the site in the meantime (because of dubious content), or pages that require logging in. Also, the period covered and maintenance of these archives may vary per website.
Common Crawl and the Internet Archive are non-profit organizations that continuously crawl the web and make their archives and datasets available to the public for free. Both Common Crawl and the Internet Archive provide API access to their URL index server then allow you to request and extract the information yourself. News sites and forums may also have archives where they store outdated posts and articles.
Custom web scraping
If there is no easy way to retrieve the data you need, you can write your own scraping program. It is worth checking to see if someone else has already written a scraper for your data source. If not, there are plenty of tools and free tutorials to get you started, regardless of your preferred programming language.
Common libraries for Python include:
- Requests- request web pages
- Beautiful Soup – parse HTML/XML
- Selenium – automate interaction with websites
- Scrapy – framework for crawling, scraping and parsing web pages
Common libraries for R include:
If you are planning to use web scraping, there are some practical things to consider when preparing and running the collection process.
It is important to consider the period over which you want to collect your data. If the content of a web page is fairly consistent over time, scraping just one version of each page will suffice. But on many web pages, content changes often and may disappear over time. If you are particularly interested in this information, you need to scrape pages regularly and over a longer period of time.
Web scraping usually involves obtaining large numbers of files. This is exactly why it is important to estimate in advance the costs, the time schedule and the size of your results set. You could, for example, run a pilot project to better make this assessment.
Legal and Ethical
The information you collect from the web is already publicly available, but delivered in a way that is optimized for humans. Web scraping is a way of collecting the information optimized for machines. When scraping the web:
- Do not reuse or republish the data in a way that violates copyright or intellectual property rights.
- Respect the terms of service for the site you are trying to scrape.
- Have a reasonable request rate.
- Avoid scraping private areas of the website.
- Make sure you comply to privacy and ethical regulations, see also the guidelines (in Dutch) prepared by the Data School
If you are not sure if you are allowed to scrape the data, please contact the privacy officer or legal expert at your faculty.
Hosting your scraping process
When you are scraping a limited number of web pages over a short period of time, it is perfectly possible to use your own laptop. You can use either existing scraping tools or a custom script. When you are collecting data over a long period of time and/or when you are targeting many websites or sites with a large number of pages, you may want to run your collection process remotely.
A private server, for example in your faculty or department, may be a cheap solution. It is easy to gain access to and you have full control over the collection process. The scalability is, however, limited by the number of cores available in the server, and possibly by the available storage space.
Public cloud providers, like Amazon Web Services (AWS) or Google Cloud, offer scalable infrastructure which allow you to scrape big amounts of data in a limited amount of time. But they are also a good solution when the collection process needs to run for a long time. As their services are cloud-based, you can access them from any operating system and any browser. There are also some disadvantages. Platforms like AWS and Google are huge and offer a wide range of services. Therefore, it takes time to learn to work with them and shaping and monitoring your scraping process may be complicated. Moreover, when your web scraping needs grow, these services can become very expensive.
Designing your scraping process
The more pages you are scraping, the more important it is to pay attention to the design of the process.
The step of requesting and opening a web page takes time. Therefore, it is efficient to run multiple scraping tasks in parallel. There are several libraries available that you can use to include parallelism in your scraping code. If you are using cloud services you can even distribute your scraping process over multiple machines, for example through serverless computing. If you have multiple scraping tasks running in parallel, it is advisable to use a queuing system to coordinate the scheduling of the scraping tasks.
Web scraping involves dealing with many external factors over which you have no control. A website may be broken or may no longer exist, the address may be wrong, or there may be issues with the internet connection. You can minimize these factors by implementing error handling. For example, a task can be retried if it fails or can be set aside without stopping the entire scraping process. In addition, you can implement logging in your code to keep track of the progress of the scraping process.
The size of a normal web page represented in html is typically 2 to 3 MB. In your research project, you will probably only need certain elements of the page, such as the text or the links. Consider extracting this content during the scraping process. That way, you save storage space, and your results will be easier to process.
It is recommended that you store both your scraping results and your log lines directly in a database. This will allow you to query, inspect and manage your data.
If you are hosting your scraping process remotely, your scraping results will probably be stored at that remote location as well. Consider where you want to conduct the data analysis. Is it technically possible and cost-effective to do it in the same remote location? If not, is it possible to transfer the data to another location and at what cost?
Help me choose
These questions will help you determine which data collection approach is suitable for you:
- What type of data do you need to answer your research question?
- Where can you find these data?
- Is there an existing archive or repository?
- Does the data owner provide access through an API?
- Which websites provide these data?
- Are you allowed to collect these data?
- Which time period does the collected data need to cover?
- Are there existing tools or approaches you can use?
- Do you have access to adequate resources, e.g.
- Expertise and time to set up the collection process
- Infrastructure at your department for collection and storage
- Money to pay for tools or cloud services
Contact us for tailored advice and support.