Side Draft

Using a Raspberry Pi as Mini-Server for scraping Used Vehicle Data

Introduction

This article marks the start of our new blog “SideDraft”, where we show you some of our side projects that can range from programming to automate the boring to building mini-machines.

In this post we make a python program that automatically collects data and statistics about used cars. We will use a raspberry pi zero to make a mini-server that continuously scrapes data from a website for selling used vehicles. The goal is to collect basic data, like vehicle model and listing price, which can be used to get better insight and make a data-driven decision when buying a new vehicle.

The Python Code

We are using a python script that runs automatically at specified time intervals. The scripts reads data from the website “autoscout24”. After applying some filters, the website shows multiple vehicles on the same webpage, together with some basic details about these vehicles. We will use this page as a starting point and collect the details that are listed for each vehicle.

Scraping the website

The code snippet below shows how the script gets the webpage as html and uses beautifulsoup4 to parse it to text. We are using some of the functions built in beautifulsoup4 to search for specific html classes. By pressing F12, our web browser can display the same html text that is being read by our script. After some searching in the html code, we find that each vehicle that is listed on the page can be distinguished by an instance of the  “cldt-summary-full-item-main” class. The find_all function creates an array that includes html text of all instances of this specific class as its elements. A visual representation of one such element is highlighted in blue on the image above.

				
					#Start with page 1 and scroll through 20 pages.
pagenumber = 1                  
for i in range(20):             

# Convert page number to string, to build URL
    pagenumber_string = str(pagenumber)
# Read Web URL as starting point
    response = get('https://www.autoscout24.be/nl/lst/?sort=age&desc=1&ustate=N%2CU&size=20&page=' + pagenumber_string + '&cy=B&atype=C&')

#Parse page with BS
    html_soup = BeautifulSoup(response.text, 'html.parser')

# Get array of all cars listed on page, based on html class.
    cars = html_soup.find_all('div', class_="cldt-summary-full-item-main")

				
			

The next step is to loop over this array to get data for each vehicle. A similar method as showed above is used to substract data from the html code.

				
					# Reset CARID to 0 for each new page
    CarID = 0
# Loop over all cars on the page
    for n in cars:

# Select Car
        first = cars[CarID]

# Get Car Name
        CarName_holder = first.find_all('h2', class_="cldt-summary-makemodel sc-font-bold sc-ellipsis")
        CarName = CarName_holder[0].get_text()

				
			

Saving the Data

For this first version, we will save the data to a CSV file. Because the script will be run multiple times consecutively and we want to avoid double entries, the first step is to check if the CSV file that is used for storing data, does already contain data. If there is previous data, a unique identifier for each previous entry is read to compare with new data. We will use the URL link to the detailed vehicle page as unique identifier.

				
					#Generate CSV heading
fieldnames=["Timestamp","Page Number", "CarID", "UniqueURL", "CarName", "CarPrice", "CarMileage", "BuildMonth", "BuildYear", "Power", "Usage", "Transmission", "Fuel"]
#Initiate array of UID
UID = []

#Prepare Output file
with open("CarData.csv", "r+", newline="") as csvoutput:
        read_prepare = csv.reader(csvoutput,delimiter=",")

        #Read contents into array
        rows = []
        for row in read_prepare:
                rows.append(row)

        #If the array is empty, the file has not been used before, write heading to file.                
        if not rows:
                write_prepare = csv.writer(csvoutput, delimiter=",")
                write_prepare.writerow(fieldnames)
        #If the file has been used before, get Unique ID (URL) of each entry to avoid double entries.
        else:
                for i in rows:
                        #print(i)
                        UID.append(i[3])

				
			

 

For each vehicle, the URL link is being compared to already existing URL links. If the URL link is not yet in the dataset a new row containing vehicle data is written to the csv file. To keep track of data collection, the details can be printed in the console.

				
					#Write to CSV
        with open("CarData.csv", "a+",  newline="") as csvfile:
                
                writer = csv.DictWriter(csvfile,fieldnames=fieldnames)

                if UniqueURL not in UID

                        writer.writerow({
                        "Timestamp": datetime.datetime.now(),
                        "Page Number": pagenumber_string,
                        "CarID": str(CarID),
                        "UniqueURL": UniqueURL, 
                        "CarName": CarName,
                        "CarPrice":CarPrice,
                        "CarMileage": CarMileage,
                        "BuildMonth": BuildMonth,
                        "BuildYear": BuildYear,
                        "Power":Power,
                        "Usage":Usage,
                        "Transmission":Transmission,
                        "Fuel":Fuel
 
				
			

Raspberry Pi as Mini-Server

We want to collect data about each vehicle that is being listed on the website, therefore it will have to checked on a regular basis for new vehicles. This website only allows for 20 pages of 20 vehicles, so we want to perform the routine multiple times per day in order to avoid missing new entries.

We could run the script manually multiple times a day, but there is high probability that it will be forgotten and it just isn’t very convenient. We want to automate this, with the smallest possible footprint. Therefore, we decided to use a Raspberry Pi Zero that works as a mini-server to collect data 24/7. The extra small size low power usage and low price (+-€10) means it is perfect. We will soon forget about its existence while it is placed in a corner of our office while connected to a phone charger and doing all the hard work for us.

 

We install VNC server on the rpi so we can access it at any time for monitoring and to download the collected data.

Also See:
https://www.raspberrypi.org/documentation/remote-access/vnc/

On the raspberry pi, we create an additional script that starts our script to collect used car data and then sleeps for a specified time.

Results

In a couple of days, we managed to collect data from almost 8500 used vehicle listings. It can be used for multiple purposes, for example finding the ideal car to buy or just having fun with data.

As an example we created some excel graphs to compare similar cars from different countries. In this case, we take a look at small cars, similar to a VW polo. We included cars from 5 different countries in the comparison. This will just be quick example, where for now we ignore more complicated statistics and more advanced techniques for comparing different datasets. We used Excel to provide a quick visual representation of how the price of similar cars, from different manufacturers degrades over time (build year) and mileage. We created scatter plots and then used an exponential trendline to give an estimate function.

The graph for Mileage shows that the VW Polo has the highest listing price when new, but also keeps most value when the miles tick away. The Renault Clio, Toyota Yaris, Ford Fiesta and Skoda Fabia (almost identical to the Polo), seem very similar with the Ford Fiesta seeming to keep it’s value a little better with increasing mileage. Surprisingly, this trend seems to indicate that the Kia Rio is somewhere in between the VW polo and the other vehicles in this class.

 

 

When looking at build year instead of mileage, the story seems a little different. From looking at the graph, it seems like the Yaris is the most expensive car when new by quite a margin. However, it is worth noting that the dataset included one Yaris GR (2021) with a price of €35.000 that drives up this end of the exponential trendline. When not considering this special edition, it would be closer to the Polo and Fabia. The trendlines seems to indicate that again, the Polo is again clearly the vehicle that best keeps its value over time, but this time almost matched by the Yaris. Again, the 4 other vehicles seem very similar to each other, but now with the Rio showing the worst depreciation.

We can and will not make any conclusions based on these visual trend representations, they are just an example of what the data can be used for. In reality, the listing price, which is actually a perceived value by the seller, is a combination of car age, mileage and many other factors.

What's Next?

We showed an example of what can be achieved with the collected data. The idea is to keep collecting data, will create additional challenges to keep the data manageable. The CSV format is an easy solution to showcase the possibilities, but would eventually need to be replaced with an actual database. This would also enable us to make a simple application to run automatic queries on the data.

Ultimately, the goal would be to use the data to predict a correct price for any vehicle, given some vehicle details. To achieve this, we would probably have to release some more advanced statistics and techniques on the data. Importantly, we always need to keep in mind that it can never be fully correct, everything is based on perception of the seller, we do not know if the vehicle did actually sell and for which price.

What would you like to see?