As is customary, we begin by creating our Gradle project. This time, we'll include the Jsoup library as a dependency.
Jsoup is a Java library for working with real-world HTML, according to its website — https://jsoup.org. It uses the finest of HTML5 DOM techniques and CSS selectors to create a highly easy API for requesting URLs, extracting, and modifying data.
Then there's our finished product.
build.gradle will seems to be.
buildscript { ext.kotlin_version = '1.6.10' ext.vertx_version = '4.2.5' repositories { mavenCentral() } dependencies { classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version" } group = 'WebScraper' version = '1.0.0' } plugins { id 'org.jetbrains.kotlin.jvm' version '1.6.10' id 'java' } repositories { mavenCentral() } sourceCompatibility = JavaVersion.VERSION_11 targetCompatibility = JavaVersion.VERSION_11 dependencies { // Kotlin implementation 'org.jetbrains.kotlin:kotlin-stdlib' // Vertx Core implementation "io.vertx:vertx-core:$vertx_version" implementation "io.vertx:vertx-lang-kotlin:$vertx_version" // Vertx Web implementation "io.vertx:vertx-web:$vertx_version" // Vertx Rxjava implementation "io.vertx:vertx-rx-java3:$vertx_version" implementation "io.vertx:vertx-rx-java3-gen:$vertx_version" // Jsoup implementation 'org.jsoup:jsoup:1.14.3' } jar { duplicatesStrategy = DuplicatesStrategy.EXCLUDE manifest { attributes 'Implementation-Title': group attributes 'Implementation-Version': archiveVersion attributes 'Main-Class': 'AppKt' } from { configurations.runtimeClasspath.collect { it.isDirectory() ? it : zipTree(it) } } } task cleanAndJar { group = 'build' description = 'Clean and create jar' dependsOn clean dependsOn jar } compileKotlin { kotlinOptions { jvmTarget = JavaVersion.VERSION_11 } }
This is how our source files will be organized.
- src | - main | | - java | | - kotlin | | | - verticle | | | | - HttpServerVerticle.kt | | | | - MainVerticle.kt | | | - App.kt
The main function that will operate the service is App.kt.
Our REST API code will be written in the class HttpServerVerticle.kt.
The primary Vert.x instance that will execute the HttpServerVerticle class is MainVerticle.kt.
In our HttpServerVerticle class, we'll use Jsoup to construct the web scraper.
package verticle import io.vertx.core.Promise import io.vertx.core.impl.logging.LoggerFactory import io.vertx.core.json.JsonArray import io.vertx.core.json.JsonObject import io.vertx.rxjava3.core.AbstractVerticle import io.vertx.rxjava3.ext.web.Router import io.vertx.rxjava3.ext.web.RoutingContext import org.jsoup.Jsoup import java.text.SimpleDateFormat class HttpServerVerticle : AbstractVerticle() { private val logger = LoggerFactory.getLogger(HttpServerVerticle::class.java) override fun start(promise: Promise) { val router = Router.router(vertx).apply { get("/web-scraper/scrape").handler(this@HttpServerVerticle::scrape) } vertx .createHttpServer() .requestHandler(router) .rxListen(8282) .subscribe( { promise.complete() }, { failure -> promise.fail(failure.cause) }) } private fun scrape(context: RoutingContext) { val responses = JsonArray() responses.addAll(scrapeCnbc()) responses.addAll(scrapeGeekWire()) responses.addAll(scrapeTechStartups()) context.response().statusCode = 200 context.response().putHeader("Content-Type", "application/json") context.response().end(responses.encode()) } private fun scrapeCnbc(): JsonArray { val responses = JsonArray() val document = Jsoup.connect("https://www.cnbc.com/startups/").get() val pageRowElements = document.select("div.PageBuilder-pageRow") for (pageRowElement in pageRowElements) { try { val moduleHeaderTitle = pageRowElement.selectFirst("h2.ModuleHeader-title") if (moduleHeaderTitle?.text().equals("More In Start-ups")) { val elements = pageRowElement.select("div.Card-standardBreakerCard") for (element in elements) { val title = element.selectFirst("a.Card-title") val datePost = element.selectFirst("span.Card-time") val dateFormat = SimpleDateFormat("EEE, MMM d yyyy") val dateString = datePost?.text()?.replace("st|nd|rd|th".toRegex(), "") val date = dateFormat.parse(dateString) responses.add(JsonObject().apply { put("title", title?.text()) put("link", title?.attr("href")) put("timestamp", date.time) }) } } } catch (e: Exception) { logger.error(e.message) } } return responses } private fun scrapeGeekWire(): JsonArray { val responses = JsonArray() val document = Jsoup.connect("https://www.geekwire.com/startups/").get() val articleElements = document.select("article.type-post") for (articleElement in articleElements) { try { val title = articleElement.selectFirst("h2.entry-title > a") val datePost = articleElement.selectFirst("time.published") val dateFormat = SimpleDateFormat("MMM d, yyyy") val dateString = datePost?.text() val date = dateFormat.parse(dateString) responses.add(JsonObject().apply { put("title", title?.text()) put("link", title?.attr("href")) put("timestamp", date.time) }) } catch (e: Exception) { logger.error(e.message) } } return responses } private fun scrapeTechStartups(): JsonArray { val responses = JsonArray() val document = Jsoup.connect("https://techstartups.com/").get() val postElements = document.select("div.post.type-post") for (postElement in postElements) { try { val title = postElement.selectFirst("h5 > a") val datePost = postElement.selectFirst("span.post_info_date") val dateFormat = SimpleDateFormat("MMM d, yyyy") val dateString = datePost?.text()?.replace("Posted On ", "") val date = dateFormat.parse(dateString) responses.add(JsonObject().apply { put("title", title?.text()) put("link", title?.attr("href")) put("timestamp", date.time) }) } catch (e: Exception) { logger.error(e.message) } } return responses } }
val document = Jsoup.connect("https://www.geekwire.com/startups/").get()
The web address is sent to the connect() function. The entire HTML DOM is then obtained using the get() function.
Then, on line 83, we'll use css selector to build a query.
val articleElements = document.select("article.type-post")
Selector in the JQuery library in Javascript is comparable to this.
The elements that match the selection will be found by the select () function. Yes, an array is returned. So, in the following line, we'll iterate it.
The following code may be found on line 87.
val title = articleElement.selectFirst("h2.entry-title > a")
Only one element that meets the selection is returned using the selectFirst() function. As a result, we must verify that the selector exists only within one of the articleElements.
The data that has been scraped is then stored in a JsonArray object on line 94.
responses.add(JsonObject().apply { put("title", title?.text()) put("link", title?.attr("href")) put("timestamp", date.time) })
These are the Vert.x standard classes and methods for deploying our service.
package verticle import io.vertx.core.Promise import io.vertx.rxjava3.core.AbstractVerticle import io.vertx.rxjava3.core.RxHelper class MainVerticle : AbstractVerticle() { override fun start(promise: Promise) { RxHelper.deployVerticle(vertx, HttpServerVerticle()) .subscribe( { promise.complete() }, promise::fail) } } import io.vertx.core.Launcher import verticle.MainVerticle fun main() { Launcher.executeCommand("run", MainVerticle::class.java.name) }
After you've finished scripting, build it, and execute the cleanAndJar Gradle process.
Simply run: if you already have your jar file.
$ java -jar build/libs/WebScraper-1.0.0.jar
It's time to put our web scraper API to the test. Postman is what you use to test the outcome of what we've constructed.
You can contact iWeb Scraping, for any web scraping service requirements and request for a quote!