Scraping Content from HTML Source in Drupal
Web scraping, also known as content scraping, involves extracting data from your current website or an external site. This technique is especially valuable during website updates or migrations.
Continuing our previous blog post on migrating from Drupal 7 to Drupal 10, this article details how to scrape content from HTML sources in Drupal.
How To Scrape Content from HTML Source in Drupal
Suppose we have a situation where the description field (Body) of the source Drupal has the following HTML structure:
1
2
3
<h1>Ea enim facilisi lenis nobis</h1>
<p><img alt="" src="/sites/default/files/image_body.jpeg" style="height:454px; width:228px" /></p>
<p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>
For our new Drupal 10 site, we'll extract data and store it in dedicated fields, then remove the content from the body.
The result should look like this:
field_subtitle: Ea enim facilisi lenis nobis
body:
1
<p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>
field_image: /sites/default/files/image_body.jpeg
Choosing the Migration Process
To streamline our migration, we first extract the data and then remove it from the new body field, as described above. This can be done using existing migration plugins available in the community.
The migrate_plus module comes with some extra process plugins that can help us with this task.
In this case, we are going to use dom, dom_select, and dom_remove.
- dom: This process plugin allows you to import a string as a DOM document and vice versa. You will see it in action in the example.
- dom_select: Using an xpath selector allows you to select a part of a DOM document for further use.
- dom_remove: Similar to the previous process plugin we can delete a part of a document using an xpath selector. Usually after this task, we need to convert back the DOM document into a string to be stored in a normal text field.
Finally, we are going to use the plugin provided by the module Migrate Files (extended), and also image_import.
- image_import: This process plugin allows to import an image from a remote/local site, downloading the image, and saving it in the new Drupal 10 site, without doing extra steps, right in the same migration file.
Migration File
We will build on the same example of our previous article on migrating from Drupal 7 to 10, adapting it to the original purpose of this article.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
id: example_articles
label: Migrate Articles from Drupal 7
migration_tags:
- Example
source:
plugin: d7_node
node_type: article
constants:
remote_url: http://drupal7.lndo.site
file_destination: 'public://images/'
process:
uuid: uuid
langcode: langcode
revision_timestamp: revision_timestamp
revision_uid: revision_uid
revision_log: revision_log
status: status
title: title
created: created
changed: changed
promote: promote
sticky: sticky
default_langcode: default_langcode
revision_default: revision_default
revision_translation_affected: revision_translation_affected
path: path
field_tags:
plugin: sub_process
source: field_tags
process:
target_id:
plugin: migration_lookup
migration: example_tags
source: tid
body/value:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_remove
mode: element
selector: '//h1'
-
plugin: dom_remove
mode: element
selector: '//p'
limit: 1
-
plugin: dom
method: export
body/format:
plugin: default_value
default_value: basic_html
field_subtitle:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_select
selector: //h1
_image_path:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_select
selector: //img/@src
_image_url:
plugin: concat
source:
- constants/remote_url
- '@_image_path/0'
field_image:
plugin: image_import
source: '@_image_url'
destination: 'constants/file_destination'
uid:
plugin: migration_lookup
migration: example_users
source: node_uid
no_stub: true
destination:
plugin: entity:node
default_bundle: article
migration_dependencies:
required:
- example_users
- example_files
- example_media_images
- example_tags
Removing Content from the Original Body
In this section, we are linking multiple process plugins. Initially, we convert the string into a DOM object for subsequent processing.
In the second and third steps, we manipulate the DOM object using the XPath selectors. For guidance, here is a good cheat sheet of selectors for reference.
We start by removing the <h1> and the first <p> element, which includes an <img> tag, using the limit property to restrict this to just one occurrence.
After the modifications are made, we convert the DOM object back into a string to store it in the body field:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
body/value:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_remove
mode: element
selector: '//h1'
-
plugin: dom_remove
mode: element
selector: '//p'
limit: 1
-
plugin: dom
method: export
Extracting Info to Store into Separate Fields
This section focuses on the process of isolating specific data from the content and systematically storing it into designated fields within our system. Let’s start with the title.
Extracting the Title
Extracting the title from the content, that we will store in the field_subtitle.
It's the same as the previous field we need first to convert the string into a DOM object and then select the text enclosed in the <h1> tag, to do that we use the process plugin `dom_select`:
1
2
3
4
5
6
7
8
field_subtitle:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_select
selector: //h1
Extracting the Image
To extract the image we are going to use the same dom_select process plugin and the file_import to store the image in an image field:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
_image_path:
-
plugin: dom
method: import
source: body/0/value
-
plugin: dom_select
selector: //img/@src
_image_url:
plugin: concat
source:
- constants/remote_url
- '@_image_path/0'
field_image:
plugin: image_import
source: '@_image_url'
destination: 'constants/file_destination'
I prefixed the first two fields with an underscore _ just to indicate that we are using those fields to store temporary data to be used later.
In ‘image_path’ we extract the image path directory from the body. For the sake of this example, let’s suppose the body has the relative URL of the image in the img tag. we use the XPath selector //img/@src to capture the source URL.
For _image_url, we combine the extracted path with the base URL to form a complete URL to the image.
Lastly, we import the image into the field_image using the image_import process plugin provided by Migrate Files (extended).
Finally, simply run the following command:
drush migrate:import example_articles
At Octahedroid We Can Help With Your Drupal Website Migration
With over a decade of Drupal expertise, our team at Octahedroid is perfectly equipped to support your migration to Drupal 10. We understand the needs of users like you who depend on Drupal's powerful features for managing complex user permissions, workflows, and digital functionalities.
Interested in learning how we can assist you with all things Drupal? Contact us to find out more about our services.
About the author
Related posts
The Future of Web Application Development Beyond 2025
By Jorge Valdez, January 7, 2025Web applications have experienced many changes throughout the years, from static websites to AI-driven tools, PWAs, IoT integration, and more. Discover key trends shaping web application development’s future and learn why traditional websites remain relevant in a hybrid digital ecosystem.
Web Applications Types and Examples To Know in 2025
By Jorge Valdez, December 11, 2024In 2025, web apps have various types and examples, but you don’t need to know them all. Our latest article focuses on the most important ones for developers. Learn what really matters in the world of web applications in this article!
What are the alternatives for organizations after Drupal 7’s End of Life?
Watch our on-demand webinar and get all the answers.