Scraping Content from HTML Source in Drupal

By Gerardo Cadau

May 9, 2024

Scraping Content From HTML Source in Drupal

Continuing our series on migrating from Drupal 7 to Drupal 10, we take you through our step-by-step method for scraping content from HTML sources in Drupal.

Web scraping, also known as content scraping, involves extracting data from your current website or an external site. This technique is especially valuable during website updates or migrations.

Continuing our previous blog post on migrating from Drupal 7 to Drupal 10, this article details how to scrape content from HTML sources in Drupal.

How To Scrape Content from HTML Source in Drupal

Suppose we have a situation where the description field (Body) of the source Drupal has the following HTML structure:

1
2
3

<h1>Ea enim facilisi lenis nobis</h1>
<p><img alt="" src="/sites/default/files/image_body.jpeg" style="height:454px; width:228px" /></p>
<p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>

For our new Drupal 10 site, we'll extract data and store it in dedicated fields, then remove the content from the body.

The result should look like this:

field_subtitle: Ea enim facilisi lenis nobis

body:

<p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>

field_image: /sites/default/files/image_body.jpeg

Choosing the Migration Process

To streamline our migration, we first extract the data and then remove it from the new body field, as described above. This can be done using existing migration plugins available in the community.

The migrate_plus module comes with some extra process plugins that can help us with this task.

In this case, we are going to use dom, dom_select, and dom_remove.

dom: This process plugin allows you to import a string as a DOM document and vice versa. You will see it in action in the example.
dom_select: Using an xpath selector allows you to select a part of a DOM document for further use.
dom_remove: Similar to the previous process plugin we can delete a part of a document using an xpath selector. Usually after this task, we need to convert back the DOM document into a string to be stored in a normal text field.

Finally, we are going to use the plugin provided by the module Migrate Files (extended), and also image_import.

image_import: This process plugin allows to import an image from a remote/local site, downloading the image, and saving it in the new Drupal 10 site, without doing extra steps, right in the same migration file.

Migration File

We will build on the same example of our previous article on migrating from Drupal 7 to 10, adapting it to the original purpose of this article.

id: example_articles
label: Migrate Articles from Drupal 7
migration_tags:
  - Example
source:
  plugin: d7_node
  node_type: article
  constants:
    remote_url: http://drupal7.lndo.site
    file_destination: 'public://images/'
process:
  uuid: uuid
  langcode: langcode
  revision_timestamp: revision_timestamp
  revision_uid: revision_uid
  revision_log: revision_log
  status: status
  title: title
  created: created
  changed: changed
  promote: promote
  sticky: sticky
  default_langcode: default_langcode
  revision_default: revision_default
  revision_translation_affected: revision_translation_affected
  path: path
  field_tags:
    plugin: sub_process
    source: field_tags
    process:
      target_id:
        plugin: migration_lookup
        migration: example_tags
        source: tid
  body/value:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_remove
      mode: element
      selector: '//h1'
    -
      plugin: dom_remove
      mode: element
      selector: '//p'
      limit: 1
    -
      plugin: dom
      method: export
  body/format:
    plugin: default_value
    default_value: basic_html
  field_subtitle:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_select
      selector: //h1
  _image_path:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_select
      selector: //img/@src
  _image_url:
    plugin: concat
    source:
      - constants/remote_url
      - '@_image_path/0'
  field_image:
    plugin: image_import
    source: '@_image_url'
    destination: 'constants/file_destination'
  uid:
    plugin: migration_lookup
    migration: example_users
    source: node_uid
    no_stub: true
destination:
  plugin: entity:node
  default_bundle: article
migration_dependencies:
  required:
    - example_users
    - example_files
    - example_media_images
    - example_tags

Removing Content from the Original Body

In this section, we are linking multiple process plugins. Initially, we convert the string into a DOM object for subsequent processing.

In the second and third steps, we manipulate the DOM object using the XPath selectors. For guidance, here is a good cheat sheet of selectors for reference.

We start by removing the <h1> and the first <p> element, which includes an <img> tag, using the limit property to restrict this to just one occurrence.

After the modifications are made, we convert the DOM object back into a string to store it in the body field:

body/value:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_remove
      mode: element
      selector: '//h1'
    -
      plugin: dom_remove
      mode: element
      selector: '//p'
      limit: 1
    -
      plugin: dom
      method: export

Extracting Info to Store into Separate Fields

This section focuses on the process of isolating specific data from the content and systematically storing it into designated fields within our system. Let’s start with the title.

Extracting the Title

Extracting the title from the content, that we will store in the field_subtitle.

It's the same as the previous field we need first to convert the string into a DOM object and then select the text enclosed in the <h1> tag, to do that we use the process plugin `dom_select`:

field_subtitle:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_select
      selector: //h1

Extracting the Image

To extract the image we are going to use the same dom_select process plugin and the file_import to store the image in an image field:

  _image_path:
    -
      plugin: dom
      method: import
      source: body/0/value
    -
      plugin: dom_select
      selector: //img/@src
  _image_url:
    plugin: concat
    source:
      - constants/remote_url
      - '@_image_path/0'
  field_image:
    plugin: image_import
    source: '@_image_url'
    destination: 'constants/file_destination'

I prefixed the first two fields with an underscore _ just to indicate that we are using those fields to store temporary data to be used later.

In ‘image_path’ we extract the image path directory from the body. For the sake of this example, let’s suppose the body has the relative URL of the image in the img tag. we use the XPath selector //img/@src to capture the source URL.

For _image_url, we combine the extracted path with the base URL to form a complete URL to the image.

Lastly, we import the image into the field_image using the image_import process plugin provided by Migrate Files (extended).

Finally, simply run the following command:

drush migrate:import example_articles

At Octahedroid We Can Help With Your Drupal Website Migration

With over a decade of Drupal expertise, our team at Octahedroid is perfectly equipped to support your migration to Drupal 10. We understand the needs of users like you who depend on Drupal's powerful features for managing complex user permissions, workflows, and digital functionalities.

Interested in learning how we can assist you with all things Drupal? Contact us to find out more about our services.

About the author

Gerardo Cadau, Software Engineer

Senior Backend Dev, Tech Lead, and World Champion. I solve problems and optimize software development processes. Pragmatic, I prioritize getting things done.

Share with others

The Future of Web Application Development Beyond 2025

By Jorge Valdez, January 8, 2025

Web applications have experienced many changes throughout the years, from static websites to AI-driven tools, PWAs, IoT integration, and more. Discover key trends shaping web application development’s future and learn why traditional websites remain relevant in a hybrid digital ecosystem.

Read Post

Web Applications Types and Examples To Know in 2025

By Jorge Valdez, December 11, 2024

In 2025, web apps have various types and examples, but you don’t need to know them all. Our latest article focuses on the most important ones for developers. Learn what really matters in the world of web applications in this article!