The Great Automotive Price Hunt: A Web-Scraping Saga (Part 1) Project intro Overview of the Web Scraping Task The Art of Web Scraping

OpenAI DALLE-2: “The image should showcase Sherlock Holmes because the central figure, wearing his iconic attire. The laptop screen, stuffed with lines of code, must have a futuristic and charming appearance, with an array of colours and patterns representing the intricacies of the digital realm”

Picture this: It’s one other extraordinary day. You’ve just brewed your third cup of coffee and are sitting in front of your trusty laptop, browsing through a forest of used automobile listings. Suddenly, a thought sparks like a rogue neuron in your caffeinated brain: “How are these automobile prices determined, anyway?”

Is it the divine intervention of vehicular gods, or is there a more mundane, possibly algorithmic pattern behind the pricing of those metallic beasts? As a knowledge scientist, your instinct to dissect and understand this automotive mystery kicks in. It’s possible you’ll not have a detective’s trench coat or Sherlock’s deerstalker hat, but you could have Python, and that’s an equally potent weapon of deduction!

Welcome to Part 1 of our exciting series, “The Great Automotive Price Hunt,” where we embark on a journey to unravel the key behind used automobile prices. But, before we reach the thrilling vistas of predictive modeling and machine learning algorithms, we want to collect our clues. As Arthur Conan Doyle wrote in Sherlock Holmes, “It’s a capital mistake to theorize before one has data.” To this end, we’ll turn to our trusty data detective’s kit: web scraping.

So, what’s the plot of today’s episode? Our protagonist, Python, equipped along with his loyal companions — BeautifulSoup and — will bravely navigate through the treacherous HTML trees of the Hungarian used automobile website www.hasznaltauto.hu, extracting precious nuggets of data.

This saga won’t end with today’s tale. Oh no, we’re just getting began! This series will proceed, with future chapters delving into exploratory data evaluation, hypothesis testing, model constructing, and, in fact, predictions.

But for now, fasten your seatbelts, placed on your data goggles, and let’s dive into the thrilling world of web scraping!

Before you shrug off this mission considering, “But I don’t speak Hungarian!” allow me to share a little bit of wisdom: you don’t need to know the local dialect to understand the landscape or in our case, to extract helpful information. Remember, we’re data explorers, not linguistic experts!

Let’s take a look at the primary screenshot I’ve prepared.

An example of a automobile commercial on www.hasznaltauto.hu. Let’s see what information we will collect from here.

: That is our magic door to every individual automobile’s data room. We’ll gather these URLs like a squirrel gathering acorns for winter. Later, we’ll pop open each URL to feast on the detailed data hidden inside.

Consider these because the “personal details” of our cars. We’ve got every part from the value, vehicle history, color, kind of fuel, gearbox, and rather more.

Now, let’s move on to the second screenshot.

The second half of the webpage with additional information in regards to the automobile. This doesn’t have a predefined structure, the advertiser can write anything in regards to the automobile.

This section holds unique details that the advertiser adds. Features like electric windows, passenger airbags, and alloy wheels — all paint an image of the automobile beyond just the ‘technical’ details. The longer descriptions provide a possibility to glean much more information, using some natural language processing perhaps. After which there’s information in regards to the sale process itself.

Now, it’s time to follow the map and collect our riches, piece by piece. But how will we do this? Can we start a clicking marathon, tirelessly visiting each automobile’s webpage, and jotting down every detail? Well, that’s the stone-age way of doing it! We, the info scientists, have a greater weapon in our arsenal — web scraping. You’ll find the entire scaping code on GitHub now I just explain crucial steps of the scraping process.

My task was easy, yet daunting: scour the virtual expanses of the positioning to unearth about automobile listings. To optimize my search, I set specific parameters: I used to be only all for cars, not minibusses or motorcycles, or some other kind of vehicle, and I needed the vehicles to have a verifiable history. After setting the correct search parameters, I retrieved my base URL which is solely the primary page of the search results:

BASE_URL = "https://www.hasznaltauto.hu/talalatilista/REALLY_LONG_HASH/page"

As a knowledge detective, I’m more all for how can I get all of the search result URLs if 1-page only lists 100 cars (not 30 thousand). After I jumped to the following search result page, the URL ended with “page2” and for the following one, “page3.” I stumbled upon a consistent pattern that may allow me to iterate through every page of the search results seamlessly.

url = self.BASE_URL + str(page_number)
response = requests.get(url)

But this was just the start of our investigation. Taking a look at a webpage and understanding its secrets are two different tasks. I employed the BeautifulSoup parser, to dissect the webpage’s content:

soup = BeautifulSoup(response.content, "html.parser")
matches = soup.find_all("a", {"class": ""})
hrefs = pd.Series(
[match["href"] for match in matches if match.has_attr("href")]
)

BeautifulSoup is a Python library, our tool to parse the intricate HTML, akin to an archaeologist’s brush on a dig site. It transforms raw HTML right into a tree of Python objects for us to explore. The “soup” object houses all HTML content. The tag in HTML is used for hyperlinks, which permit users to navigate from one page to a different. It stands for “anchor”.

The class is a typical attribute that could be used with any HTML tag. In our case, its value was all the time an empty string. The precise URL text is contained by the href attribute of every matched element we got from the find_all function.

But not all links would result in the best doors. I needed to separate the wheat from the chaff. An important clue here was that each one automobile advertisements began with “www.hasznaltauto.hu/szemelyauto.” A fast filter operation was like a magic spell that left me only with the best links:

hrefs = hrefs[
hrefs.str.contains("www.hasznaltauto.hu/szemelyauto", =False)
].to_list()

Web scraping is really an art form, where attention to the minutest of details could make the difference between a case solved and a wild goose chase. For instance, in my mission to update my collection of links a couple of days later, the algorithm appeared to flag each link as latest!

That is where cache busting is available in. By adding a novel string (just like the random text at the top of the URL), the browser is tricked into considering that this can be a latest resource and hence fetches a fresh copy from the server, bypassing the cache.
For instance,

https://www.hasznaltauto.hu/szemelyauto/volkswagen/…

Nevertheless, this extra piece of text is as irrelevant to our investigation so I can erase every part from the “#sid” part.

all_links = all_links.str.replace("#sid.*", "")

As a knowledge detective and interpreter of coded mysteries, I encounter each useful clues and various red herrings along the way in which. One such encounter occurred once we met our first person of interest — a fancy HTML table tag that goes by the category ‘hirdetesadatok’. The table was not initially very talkative, so I called upon a master codebreaker in my repertoire — the `pd.read_html()` function. And oh, it didn’t disappoint! With a single blow, it unlocked the secrets hidden inside the HTML tag, converting it right into a well-structured Pandas DataFrame, making our job as detectives easier.

# Get the table in html format
table_html = str(soup.find_all('table', {'class': 'hirdetesadatok'})[0])# Read the table with pandas
advertisement_data = pd.read_html(table_html)[0]

First 6 rows of the created dataframe advertisement_data.

Nevertheless, upon inspecting the info, we got here across some rogue entries. You see, the crafty website developers had decided to incorporate some information that was of no particular use to our quest — headers within the table like ‘Ár, költségek’ (price, expenses) and ‘Általános adatok’ (general data). Luckily, our trusty magnifying glass revealed a pattern: these rogue entries were missing the classic colon at the top, the trademark of a key-value pair. It’s all the time the little things!

# Clean the info
advertisement_data.columns = ['key', 'value']
advertisement_data = advertisement_data[
advertisement_data['key'].str.accommodates(':$', =True)
]

So finally we’ve got pretty helpful data from a automobile:

Cleaned commercial data from the automobile’s webpage. (Feature names are translated)

To delve deeper into this thrilling world of code and mystery, head over to the total GitHub code. You possibly can check the complete scraping script there and take a look at it. I already collected greater than 30 thousand commercial data with it.

The plot thickens, and the mystery deepens. The following chapter? Descriptive evaluation of the info. You won’t wish to miss it. The sport, dear reader, is afoot!

The Great Automotive Price Hunt: A Web-Scraping Saga (Part 1) Project intro Overview of the Web Scraping Task The Art of Web Scraping

What are your thoughts on this topic?
Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

AI in Finance and Its Impact on Worker Retention

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

The Great Automotive Price Hunt: A Web-Scraping Saga (Part 1) Project intro Overview of the Web Scraping Task The Art of Web Scraping

What are your thoughts on this topic? Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.