Monday, 21 November 2016

How Xpath Plays Vital Role In Web Scraping

How Xpath Plays Vital Role In Web Scraping

XPath is a language for finding information in structured documents like XML or HTML. You can say that XPath is (sort of) SQL for XML or HTML files. XPath is used to navigate through elements and attributes in an XML or HTML document.

To understand XPath we must be clear about elements and nodes which are the building blocks of XML and HTML. Let’s talk about them. Here is an example element in an HTML document:

   <a class=”hyperlink” href=http://www.google.com>google</a>

Copy the above text to a file, name it as sample.html and open it in a browser. This will end up as a text link displaying the words “google” and it will take you to www.google.com. For each element there are three main parts: The type, the attributes, andthe text. They are listed below:

 a                                 Type
class,  href                Attributes
google                       Text

Let’s grab some XPath developer tools. I am on Firebug for Firefox or you can use Chrome’s developer tools. We will now form some XPath expressions to extract data from the above element. We will also verify the XPath by using Firebug Console.

For extracting the text “google”:

   //a[@href]/text()   

   //a[@class=”hyperlink”]/text()
 
For extracting the hyperlink i.e. ”www.google.com” :

   //a/@href
//a[@class=”hyperlink”]/@href

That’s all with a single element but in reality, you need to deal with more complex forms.

Let’s proceed to the idea of nodes, and its familial relationship of HTML elements. Look at this example code:

 <div title=”Section1″>

   <table id=”Search”>

       <tr class=”Yahoo”>Yahoo Search</tr>

       <tr class=”Google”>Google Search</tr>

   </table>

</div>

 Notice the </div> at the bottom? That means the table and tr elements are contained within the div. These other elements are considered descendants of the div. The table is a child, and the tr is a grandchild (and so on and so forth). The two tr elements are considered siblings each other. This is vital, as XPath uses these relationships to find your element.

So suppose you want to find the Google item. Any of the following expressions will work:

   //tr[@class=’Google’]
   //div/table/tr[2]
  //div[@title=”Section1″]//tr

So let’s analyze the expressions. We start at the top element (also known as a node). The // means to search all descendants, / means to just look at the current element’s children. So //div means look through all descendants for a div element. The brackets [] specify something about that element. So we can look for an attribute with the @ symbol, or look for text with the text() function. We can chain as many of these together as we can.

Here is a quick reference:

   //             Search all descendant elements
   /              Search all child elements
   []             The predicate (specifies something about the element you are looking for)
   @           Specifies an element attribute. (For example, @title)
   
   .               Specifies the current node (useful when you want to look for an element’s children in the predicate)
   ..              Specifies the parent node
  text()       Gets the text of the element.
   
In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors.

Please subscribe to our blog to get notified when we publish the next blog post.

Source: http://blog.datahut.co/how-xpath-plays-vital-role-in-web-scraping/

Friday, 4 November 2016

Outsource Data Mining Services to Offshore Data Entry Company

Outsource Data Mining Services to Offshore Data Entry Company

Companies in India offer complete solution services for all type of data mining services.

Data Mining Services and Web research services offered, help businesses get critical information for their analysis and marketing campaigns. As this process requires professionals with good knowledge in internet research or online research, customers can take advantage of outsourcing their Data Mining, Data extraction and Data Collection services to utilize resources at a very competitive price.

In the time of recession every company is very careful about cost. So companies are now trying to find ways to cut down cost and outsourcing is good option for reducing cost. It is essential for each size of business from small size to large size organization. Data entry is most famous work among all outsourcing work. To meet high quality and precise data entry demands most corporate firms prefer to outsource data entry services to offshore countries like India.

In India there are number of companies which offer high quality data entry work at cheapest rate. Outsourcing data mining work is the crucial requirement of all rapidly growing Companies who want to focus on their core areas and want to control their cost.

Why outsource your data entry requirements?

Easy and fast communication: Flexibility in communication method is provided where they will be ready to talk with you at your convenient time, as per demand of work dedicated resource or whole team will be assigned to drive the project.

Quality with high level of Accuracy: Experienced companies handling a variety of data-entry projects develop whole new type of quality process for maintaining best quality at work.

Turn Around Time: Capability to deliver fast turnaround time as per project requirements to meet up your project deadline, dedicated staff(s) can work 24/7 with high level of accuracy.

Affordable Rate: Services provided at affordable rates in the industry. For minimizing cost, customization of each and every aspect of the system is undertaken for efficiently handling work.

Outsourcing Service Providers are outsourcing companies providing business process outsourcing services specializing in data mining services and data entry services. Team of highly skilled and efficient people, with a singular focus on data processing, data mining and data entry outsourcing services catering to data entry projects of a varied nature and type.

Why outsource data mining services?

360 degree Data Processing Operations
Free Pilots Before You Hire
Years of Data Entry and Processing Experience
Domain Expertise in Multiple Industries
Best Outsourcing Prices in Industry
Highly Scalable Business Infrastructure
24X7 Round The Clock Services

The expertise management and teams have delivered millions of processed data and records to customers from USA, Canada, UK and other European Countries and Australia.

Outsourcing companies specialize in data entry operations and guarantee highest quality & on time delivery at the least expensive prices.

Herat Patel, CEO at 3Alpha Dataentry Services possess over 15+ years of experience in providing data related services outsourced to India.

Visit our Facebook Data Entry profile for comments & reviews.

Our services helps to convert any kind of  hard copy sources, our data mining services helps to collect business contacts, customer contact, product specifications etc., from different web sources. We promise to deliver the best quality work and help you excel in your business by focusing on your core business activities. Outsource data mining services to India and take the advantage of outsourcing and save cost.

Source: http://ezinearticles.com/?Outsource-Data-Mining-Services-to-Offshore-Data-Entry-Company&id=4027029

Wednesday, 19 October 2016

Scraping Yelp Data and How to use?

Scraping Yelp Data and How to use?

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc.  Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

How to use it:

    Sign in to our portal here.
    Download our scraping extension from here.
    Find robot named Yelp_us_demo in the dropdown.
    Modify start URL to the first page of your search results. For example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA
    Click Run.
    Let robot finish it’s job and download data from portal.

Some things to consider:

This robot is placed in our Demo space – therefore it is accessible to anyone. Anyone will be able to modify and run it, anyone will be able to download collected data. Robot’s code may be edited by someone else, but you can always restore it from sample code below. Yelp limits number of search results, so do not expect to scrape more results than you would normally see by search.

In case you want to create your own version of such robot, here it’s full code:

// starting URL above must be the first page of search results.
// Example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA

steps.start = function () {

   var rows = [];

   $(".biz-listing-large").each (function (i,v) {
     if ($("h3 a", v).length > 0)
       {
        var row = {};
        row.company = $(".biz-name", v).text().trim();
        row.reviews =$(".review-count", v).text().trim();
        row.companyLink = $(".biz-name", v)[0].href;
        row.location = $(".secondary-attributes address", v).text().trim();
        row.phone = $(".biz-phone", v).text().trim();
        rows.push (row);
      }
   });

   emit ("yelp", rows);
   if ($(".next").length === 1) {
     next ($(".next")[0].href, "start");
   }
 done();
};

Source: https://webrobots.io/scraping-yelp-data/

Friday, 30 September 2016

How to do data scraping from PDF files using PHP?

How to do data scraping from PDF files using PHP?

Situations arise when you want to scrap data from PDF or want to search PDF files for matching text. Suppose you have website where users uploads PDF files and you want to give search functionality to user which searches all uploaded PDF file content for matching text and show all PDFs that contains matching search keywords.

Or you might have all London real estate properties details in PDF report file and you want to quickly grab scrape data from PDF reports then you might need PDF scraping library.

To integrate such functionality to web application is not similar to normal search functionality that we do with database search.

Here is the straight solution for this problem. This involves PDF Data Scraping to plain text and match search terms. I have written this post for the people who want to do PDF data scraping or want to make their PDF files to be Searchable.

We are going to use class named class.pdf2text.php which converts PDF text to into ASCII text, so the class is known for PDF extraction. This PHP class ignores anything in PDF that is not a text.

Let’s see very basic example (Taken from author’s file):

<?php

include "class.pdf2text.php";

$a = new PDF2Text();
$a->setFilename('web-scraping-service.pdf'); //grab the pdf file reside in folder where PHP files resides.

$a->decodePDF();//converts PDF content to text
echo $a->output();

?>

“Web Scraping is a technique using which programmer can automate the copy paste manual work and save the time. This is PDF w eb scraping using PHP. We at Web Data Scraping offer Web Scraping and Data Scraping Service. Vist our website www.webdata-scraping.com”

For more complex extraction you can apply regular expression on the text you get and can parse text that you want from PDF. But keep in mind this has limitation and do not work with all types of PDF extraction.

But the wonderful use of this class is to make utility that allow user to search inside PDF when they search on web search bar. Last but not least, You can also find many PDF scraping software available in market that can do complex scraping from PDF files.

Source: http://webdata-scraping.com/data-scraping-pdf-files-using-php/

Tuesday, 20 September 2016

Run Code Template – New Feature Added to Fminer Web Scraping Tool

Run Code Template – New Feature Added to Fminer Web Scraping Tool

Fminer is one of the powerful web scraping software, I already given brief of all the Fminer features in previous post. In this post I am going to introduce one of the interesting feature of fminer which is Run Code Template that is recently added to Fminer, this feature is similar to “Fminer Run Code” action but it’s different in a way you can use it. The Run Code Action you can use inside the data scraping flow and python code get executed when scraper start running.

While Run Code Templates are the saved python code snippets that you can run on the data tables after scraping completes. Assume if you get white space in scraped data then you can easily trim this left and right spaces by just executing “strip_column” template, see the code of that template below.

'''Strip all data of a column in data table
Remove the blank of data in the head and the tail.
'''

tabName = '[%table1|data table%]'
colName = '[%table1.column1|table column for strip%]'

tab = tables[tabName]
for i, row in enumerate(tab):
    row[colName] = row[colName].strip()   
    tab.edit_row(i, row)

This template comes with Fminer and few other template like “merge_tables_with_same_columns”.  Below are the steps how you can execute template python code on scraped data.

Step 1: Click on second icon from right that says “Run Code” under the Data section

Step 2: One popup will appear, you need to click on “Templates” icon and choose the template you want to execute and then click on Ok.

Step 3: Now the window will appear for configuration that will ask you to choose the table and column under that table on which you want to execute the code. Now click on Ok again.

Step 4: Now you can see the code of that template, now you can click on execute icon and script will start running, based on number of records it will take time to finish execution.

In many web scraping projects I found this template code very handy for cleaning data and making life easy. Templates are stored at following path so you can create your own template with customized code.

C:\Program Files (x86)\FMiner\templates

I have created one template which I use to remove HTML code that comes while scraping badly organized HTML pages. Below is the code of template for stripping html:

'''Strip HTML will remove all html tags of a column in data table.
'''
import re
tabName = '[%table1|data table%]'
colName = '[%table1.column1|table column for substring%]'
colNew = '[%table1.column1|table column to add new data%]'
tab = tables[tabName]
for i, row in enumerate(tab):
    cleanr =re.compile('<.*?>')
    cleantext = re.sub(cleanr,'', row[colName])
    row[colNew] = cleantext 
    tab.edit_row(i, row)

Stay connected as I am going to post more code templates that will make your web scraping life easy and manipulate data on fly.

Source: http://webdata-scraping.com/run-code-template-new-feature-added-fminer-web-scraping-tool/

Wednesday, 7 September 2016

How to Use Microsoft Excel as a Web Scraping Tool

How to Use Microsoft Excel as a Web Scraping Tool

Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. The immense popularity of Excel is not without reasons. It is like the Swiss army knife of data with its great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.

Web queries is a feature of Excel which is basically used to fetch data on a web page into the Excel worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain apart from just extracting data from web pages. Let’s see how web queries work and how you can scrape HTML tables off the web using them.
Getting started

We’ll start with a simple Web query to scrape data from the Yahoo! Finance page. This page is particularly easier to scrape and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the URL we will be using for the tutorial:

http://finance.yahoo.com/q/hp?s=GOOG

To create a new Web query:

1. Select the cell in which you want the data to appear.
2. Click on Data-> From Web
3. The New Web query box will pop up as shown below.

4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button.
5. Click on the yellow-black buttons next to the table you need to extract data from.

6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet.

Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting etc. to better understand or present the data in a simpler way.
Customizing the query

Once you have created a web query, you have the option to customize it according to your requirements. To do this, access Web query properties by right clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will be displayed where you can customize how the web query interacts with the target page. The options here lets you change some of the basic things related to web pages like the formatting and redirections.

Apart from this, you can also alter the data range options by right clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.

Auto refresh

Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data to be auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in data range options menu. The auto refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at scale

Although extracting data using Excel can be a great way to scrape html tables from the web, it is nowhere close to a real web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a cheap way to get your hands on some data. If data for business is your need, you will definitely have to depend on a web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process that web scraping will also give you more room to deal with other things that need extra attention such as marketing your business.

Source: https://www.promptcloud.com/blog/how-to-use-excel-to-scrape-websites

Monday, 29 August 2016

Why is a Web scraping service better than Scraping tools

Why is a Web scraping service better than Scraping tools

Web scraping has been making ripples across various industries in the last few years. Newer businesses can employ web scraping to gain quick market insights and equip themselves to take on their competitors. This works like clockwork if you know how to do the analysis right. Before we jump into that, there is the technical aspect of web scraping. Should your company use a scraping tool to get the required data from the web? Although this sounds like an easy solution, there is more to it than what meets the eye. We explain why it’s better to go with a dedicated web scraping service to cover your data acquisition needs rather than going by the scraping tool route.

Cost is lowered

Although this might come as a surprise, the cost of getting data from employing a data scraping tool along with an IT personnel who can get it done would exceed the cost of a good subscription based web scraping service. Not every company has the necessary resources needed to run web scraping in-house. By depending on a Data service provider, you will save the cost of software, resources and labour required to run web crawling in the firm. Besides, you will also end up having more time and less worries. More of your time and effort can therefore go into the analysis part which is crucial to you as a business owner.

Accessibility is high with a service

Multifaceted websites make it difficult for the scraping tools to extract data. A good web scraping service on the other hand can easily deal with bottlenecks in the scraping process when it may arise. Websites to be scraped often undergo changes in their structure which calls for modification of the crawler accordingly. Unlike a scraping tool, a dedicated service will be able to extract data from complex sites that use Ajax, Javascript and the like. By going with a subscription based service, you are doing yourself the favour of not being involved in this constant headache.

Accuracy in results

A DIY scraping tool might be able to get you data, but the accuracy and relevance of the acquired data will vary. You might be able to get it right with a particular website, but that might not be the case with another. This gives uncertainty to the results of your data acquisition and could even be disastrous for your business. On the other hand, a good scraping service will give you highly refined data which is in a ready to consume form.

Outcomes are instant with a service

Considering the high resource requirements of the web scraping process, your scraping tool is likely to be much slower than a reputed service that has got the right infrastructure and resources to scrape data from the web efficiently. It might not be feasible for your firm to acquire and manage the same setup since that could affect the focus of your business.

Tidying up of Data is an exhausting process

Web scrapers collect data into a dump file which would be huge in size. You will have to do a lot of tidying up in this to get data in a usable format. With the scraping tools route, you would be looking for more tools to clean up the data collected. This is a waste of time and effort that you could use in much better aspects of your business. Whereas with a web scraping service, you won’t have to worry about cleaning up of the data as it comes with the service. You get the data in a plug and use format which gives you more time to do better things.

Many sites have policies for data scraping

Sometimes, websites that you want to scrape data from might have policies discouraging the act. You wouldn’t want to act against their policies being ignorant of their existence and get into legal trouble. With a web scraping service, you don’t have to worry about these. A well-established data scraping provider will definitely follow the rules and policies set by the website. This would mean you can be relieved of such worries and go ahead with finding trends and ideas from the data that they provide.

More time to analyse the data

This is so far the best advantage of going with a scraping service rather than a tool. Since all the things related to data acquisition is dealt by the scraping service provider, you would have more time for analysing and deriving useful business decisions from this data. Being the business owner, analysing the data with care should be your highest priority. Since using a scraping tool to acquire data will cost you more time and effort, the analysis part is definitely going to suffer which defies your whole purpose.

Bottom line

It is up to you to choose between a web scraping tool and a dedicated scraping service. Being the business owner, it i s much better for you to stay away from the technical aspects of web scraping and focus on deriving a better business strategy from the data. When you have made up your mind to go with a data scraping service, it is important to choose the right web scraping service for maximum benefits.

Source: https://www.promptcloud.com/blog/web-scraping-services-better-than-scraping-tools