dexi.io is the ultimate platform for web scraping!
Dexi robots don’t just allow you to read data from a web page. The “standard” and most powerful robot in dexi.io (for web scraping), Extractor robots, allow you to perform logins and searches, select elements in dropdown lists and dates in calendars, hover over elements, click buttons, submit forms, wait for elements to appear, loop over and paginate through results and extract the results you want, plain text or binary, formatted the way you want. Extractor robots also support XML.
Building Extractor robots in dexi.io is done with a simple but powerful point-and-click editor showing the web page in the top part and a “developer console” you are used to in your favourite browser in the bottom part.
Building Extractor robots in dexi.io is done with a simple but powerful point-and-click editor
Classic web scraping examples
But Extractor robots can do even more! Examples of more complex interactions include:
Use the CAPTCHA Add-On to bypass CAPTCHAs “I am not a robot” prompts
The platform also allows you to control network requests, e.g. block unneeded resources to optimise speed or to prevent an error in the website source code allowing the robot execution to succeed.
Extractor robots are not just a sequence of steps. Like a program written in a programming language like Java or JavaScript, robots can have conditions (called branches) and loops that control the flow of execution. This includes scopes, i.e. that for example an element in a loop, is only accessible within that loop.
Pipes robots allow you to define a graph of actions to allow for very flexible robot design. More on Pipes later.
Easily set up a sequence of steps, loops, to control the flow of execution.
Once a robot is working as intended it can be executed to get actual results. Robots can be executed with different configurations, most importantly with multiple input values, effectively executing the robot multiple times, say, with different search values or dates.
Other values that can be configured include:
The results of an execution can be viewed directly in the UI, downloaded in common file formats (csv/xls/json) or, as mentioned above, sent to and stored in a number of different places. More on this later.
A single robot can have multiple configurations which can be executed independently.
The Extractor robot editor shows the steps, loops and branches of the robot. Just like you would debug a program using a debugger (in an IDE), the state of an Extractor robot can be inspected and debugged directly in the editor (keyboard shortcuts are supported):
If an execution of a robot has failed a log will show all events which can help you debug the robot.
Tip! Dexi uses sophisticated anonymization techniques to hide its presence but if the target site has detected the robot, a typical solution is to change the proxy.
But wait, there’s more! The dexi.io platform also comes with advanced data processing and normalisation capabilities.
Pipes
Via Pipes robots it is possible to define a completely custom robot execution flow performing arbitrarily complex data processing and transformation logic. For example, a Pipes robot could execute an Extractor robot, loop over its results, call an external web service for a specific field in each result, do some custom formatting of the web service result and save the “enriched” results in a data set.
Other features in Pipes robots include:
Pipes robots allow you to define a graph of actions to allow for very flexible robot designs
Extracting web data typically means huge amounts of data and that data is often heterogeneous and comes in various qualities.
For example, one website might provide certain information about a product (name, price, description) whereas another website provides less information (name, price). Furthermore different spellings or e.g. formattings of a product name can provide a challenge for normalising/standardising data. Examples: “Samsung Galaxy” vs “Samsung Gallaxy” and “Tab S2” vs “S2 Tab”.
Dexi provides a number of different ways to overcome these challenges.
Data Sets
Data sets allow millions of rows to be stored and queried efficiently. A dexi.io data set can be seen as a table in a relational/SQL database, a collection in a NoSQL database or a sheet in a spreadsheet. A data set has a data type defining the fields for each row in the data set. Rows can be created (added), viewed (read), modified (updated) and deleted (ie. CRUDed).
A dexi.io data set includes an additional feature that makes it more advanced and powerful than its traditional counterpart: it contains a dynamic key configuration which allows data deduplication and record linkage operations to be performed.
The key configuration, or just key, can consist of multiple fields. Depending on the data type of the fields included in the key different comparison methods are available, e.g. Levenshtein (edit) distance for strings. A threshold defines whether two values are automatically considered duplicates, should be manually verified or are considered distinct.
The key configuration can be changed, e.g. “narrowing down” or “widening” the key, and another deduplication or record linkage operation run to update the data set to reflect the new key.
The way to normalise the fields of the results from executions of different robots is to use data types. Data types can be used as both input and output for robots as well as for data sets and dictionaries.
Data types support standard primitive types, e.g. numbers and booleans, as well as complex/object values. AutoBots using data types answer the question of how to normalise the results across a number of robots extracting data from different domains, e.g. product information from amazon.com, alibaba.com and bestbuy.com. At least one example URL per domain is provided, an Extractor robot is created for each domain, and the output data type of the AutoBot ensures a common format for results.
With autobots you can normalise the results across a number of robots extracting data from different domains
Dictionaries
A dictionary provides similar functionality to a data set with a key configuration but can, as the name implies, be used to easily perform lookups of keys to values. It is often used to correct misspellings like “Galaxy” vs “Gallaxy”.
Lookups in a dictionary can be exact or “fuzzy”, i.e. using the same Levenshtein distance as for key configurations described above, or can even be done by tokenizing the key and lookup, effectively performing a “contains” query word by word. Finally, keys can be regular expressions such that a lookup of “Tesla Model 3” will match the key “Tesla Model [S|X|3]”.
Results from robot executions can be delivered in a number of different ways for manual consumption by a human or automatic consumption by a program:
Besides retrieving the results of an execution as mentioned above the API also supports e.g.: