Friday, 22 August 2014

Scraping dynamic data

I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Monday, 18 August 2014

How to prevent data-scraping a valuable data web service?

I have a great idea for a windows store app. I'd like to make this app. However it requires a large and valuable database that I will need to create a service for so that people cannot easily steal it. My thinking is maybe host a mobile service on Azure (which I've never tried) and create a .net Web API project to take requests and dish out Json like candy to a windows 8 mvvmclient. However what I don't want is someone sniffing my traffic back and forth from app to service and figuring out how to get/post data from using my app and service then setting up their own app / website to display this data using my bandwidth to make them money.

How can I protect my app-to-db data access so it can't be reverse engineered on me.

Also is this the best setup for developing a high volume windows 8 app like this? Do you have a better suggestion?


EDIT: I know I can use SSL etc to encrypt traffic to and from. What I am trying to protect is someone using Firebug or Fiddler to figure out what parameters can be posted to get a particular record back. Then creating their own site that simply uses my service as the end point and siphons my data and whores my bandwidth. ie. Just using firebug I know I can use https://www.google.com/search?q=dallas to search the word dallas on google. Even if I encrypt the page, they can see that much in their browser. so if someone does the same get/post in their own application they would get the same records back thus using my stuff.

3 Answers

The most straight forward thing you can do is to setup authentication for your users using something like OAuth. This will allow you to ensure no communication happens with your service in an anonymous fashion.

Once you have authenticated your requests you can place controls on those requests that won't impact a normal user. You could rate limit or throttle requests or any number of tactics to make it very expensive time wise to siphon off large portions of your data set.

For instance, you can start blocking requests when you notice a large number of users clustering from a single IP address. You could place sensible limits on each user (like 10 API calls per minute with a result set limited to 50). You get the idea I'm sure.

I think we met the same concern. I'm developing a windows 8 application which is contacting a web service built on top of Windows Azure Web Site. I don't want the bad guy fire some fake requests to my service by intercepting the traffic through some tools like Fiddler.

I asked this question in a mail group and got a tip. I've never tried but just for your information. If your application needs user login, then the user's password is a good seed for data/traffic protection. You can use the password to generate a key-pair, sign the request and send it to server as well as the public key. Then on the server side it can verify the sign by the public key.

Use HTTPS is another approach. But as you know, a bad guy can also know the actual data through Fiddler even though HTTPS.

Use certificate might be another solution I think. But I didn't find the relevant document on how to install and pick a certificate from client's machine.

HTH

just serve it over HTTPS, then they can't sniff it.

Source:http://stackoverflow.com/questions/14350298/how-to-prevent-data-scraping-a-valuable-data-web-service

Tuesday, 12 August 2014

Business Intelligence Data Mining

Data mining can be technically defined as the automated extraction of hidden information from large databases for predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making.

Data mining requires the use of mathematical algorithms and statistical techniques integrated with software tools. The final product is an easy-to-use software package that can be used even by non-mathematicians to effectively analyze the data they have. Data Mining is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications.

Business intelligence data mining is used in market research, industry research, and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. BI uses various technologies like data mining, scorecarding, data warehouses, text mining, decision support systems, executive information systems, management information systems and geographic information systems for analyzing useful information for business decision making.

Business intelligence is a broader arena of decision-making that uses data mining as one of the tools. In fact, the use of data mining in BI makes the data more relevant in application. There are several kinds of data mining: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining, that are all used in business intelligence applications.

Some data mining tools used in BI are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-means and hierarchical clustering, Markov models and so on.

Source:http://ezinearticles.com/?Business-Intelligence-Data-Mining&id=196648

Saturday, 2 August 2014

Importance of Data Cleansing Services

In companies, there is huge amount of data that is available and essential in the decision making and strategies.  Unfortunately, the data is sometimes inaccurate or incomplete because of the updates that are available from time to time. With this, companies are looking for ways to eradicate the information that is not needed by the company. Cleansing of data is one of the processes that can eliminate unnecessary data of the companies. Data cleansing identifies the information that is fraudulent or inaccurate and deletes them or replaces them with the accurate information. Unclean facts have no place in companies because they can also cause inefficiencies and inaccuracies in the decisions. After the cleaning of data, there are no inconsistencies and the data sets are already the same with each other.

There are different techniques used in data cleansing data transformation, parsing or detecting the syntax errors, duplicate eradication, and statistical method. These techniques will ensure that the data are clean and good. There are also criteria to tell if the data set is clean. This are the things that companies look for when getting data cleansing services.

Data should be accurate in which density, integrity, and consistency are there. They should also be complete in order to ensure that there are no differences in the data set. The density will show the relationship of the omitted and the total number of values in the data set. You can tell that the data set is good if it has a good density. Data should also be uniform and the irregularities should be eliminated in the set. Consistency should also be present that eliminates the syntactical errors in the set. Cleaning the data should also give the uniqueness of the set in order to tell the number of duplicates that were present before the cleaning. Lastly, the data should have integrity in combining the criteria of soundness and completeness. If the above criteria are met, it is ensured that the data set is in the best state.

Considering in getting a data cleansing service will offer you different available services. Removal of duplicate ideas is one of the most common features of data cleansing. Same records or data sets are tagged and identified and the duplicates are eradicated. Data are also validated and the bogus data are eliminated. The set will also be checked for outdated data because outdated ones are removed by data cleansing. Incomplete figures are also identified so that they will be given attention. If the incomplete data are identified, the facts will be improved in such a way that they are assembled in order and organized as a set.

Aside from the benefits that companies get from data cleansing services, there are also problems present in data cleansing. Sometimes, some data are lost because of the eradication of limited information. As for the companies that offer the services, they should maintain good service since data cleansing is expensive and time consuming.

Source:http://ezinearticles.com/?Importance-of-Data-Cleansing-Services&id=5013611

Monday, 28 July 2014

Advantages of Medical Records Scanning

Medical records contain important and sensitive information regarding patients, such as birth and family records, illness, and other personal matters. Most hospitals have a certain section where doctors keep these files. They process many papers every day so they have to organize everything to avoid mistakes. There are instances, however, when calamities and accidents strike that lead to the loss of files. Papers are hard to recover once they get wet or burned. This is why many hospitals and medical facilities look for alternatives in keeping medical records.

Technology lets people make digital copies, which is useful for keeping different types of files. This adds to the flexibility of medical records as you can save photos and videos unlike the traditional word texts. Many hospitals use Electronic Medical Records or EMR to convert old documents to a digital format. This reduces the clutter in record sections as you only need a computer to keep different files. It also improves the efficiency of their staff as they do not have to spend significant time looking for a specific record.

Here are some of the advantages of medical records scanning:

Saves Space

Most hospitals need an entire room to keep all the records of their patients. Digital records, on the other hand, only require a corner of the room. Digital copies remove all the bulk in an area and create additional space. It is like shrinking an entire room to a hard drive by using medical records scanning. This makes the area look bigger and spacious. People can even use the new area for a more beneficial purpose.

Security

A disadvantage of keeping physical records is the risk of theft. The hospital is a busy place and many people go in and out every day. You cannot monitor everyone so it is hard to find the real culprit. This might cause information leak, exposing the patient's confidential information. The benefit of having digital copies is it has a system that monitors people who access it. It records what documents you take, edit, and delete. This increases the security and offers a peace of mind to your patients. You do not have to worry about losing the files either because they have a backup, making file recovery easy.

Lower Labor Costs

Many devices can perform tasks that require many people. This reduces the need of employing and paying more people to do the same job. It also saves time because you do not have to screen all the applicants. You can allot the money you save from labor costs for another thing.

Fewer Mistakes

Digital systems perform a task according to what you input so there is no room for mistakes. It ensures that all files are processed correctly. It saves you from problems due to paperwork mistakes. The hospital is a delicate environment and one mistake can risk the life of a patient. medical records scanning is a great way to improve the working process in hospitals and medical facilities. It reduces paperwork, increases security, and adds space. Try it now to see its benefits.

Source: http://ezinearticles.com/?Advantages-of-Medical-Records-Scanning&id=7466788

Thursday, 10 July 2014

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:

• Spider a government portal, extracting names of citizens for a survey
• Crawl competitor websites for product pricing and feature data
• Use web scraping to download images from a stock photography site for website design

Automated Data Collection

Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:

• Monitor price information for select stocks on hourly basis
• Collect mortgage rates from various financial firms on daily basis
• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Tuesday, 1 July 2014

Seven Tips To Successfully Offshore Marketing Operations

Several weeks ago, I met a senior marketing executive from one of the world's largest brands. She was discussing the possibility of offshoring their marketing operations to.

As we got into the discussion, it was clear that there were three key drivers for their decision to offshore marketing operations -

    The brand had been asked to reduce spend by at least 20 percent. To put this in context, various industry reports state that 2009 marketing budgets were, on average, cut by over 20 percent compared to pre-recessionary levels. And the number of companies that cut marketing budgets was 25 percent higher than predicted in January 2009.

         As brands go global, maintaining brand consistency across geographies is becoming a huge issue for marketers. Consistency is important not just from a customer experience standpoint but also from the perspective of marketing efficiency. If you create standardized brand "templates," the local geographies can respond faster to market/sales needs.

         The marketing function is under pressure to deliver ever ROI much faster than before. Management is asking tougher questions of its marketing teams; the focus on metrics has never been sharper.

     I realized that the company had gone after the usual cost reductions such as the elimination of travel, training, new hiring and new campaigns. However, they were looking to further reduce cost and increase efficiency. This triggered the idea of outsourcing/offshoring marketing operations.

If companies are interested in offshore their marketing operations, what can they do to ensure that their plan is well-thought out and effective? My counsel to this particular marketing executive was to keep seven mantras in mind:

    Secure a champion - Ensure that the company has an offshoring sponsor or champion who can evangelize the need for offshore delivery, address any issues that come up and resolve problems.

         Charge the CMO to drive adoption - Make sure that the Chief Marketing Officer (CMO) is fully supportive of the offshoring plan. The CMO's approval should be communicated to all brand managers else to combat resistance from the brand managers. One trick that I have seen work is to have the CMO ask each CEO during their monthly/quarterly/annual marketing reviews how they have leveraged the offshore unit to deliver marketing efficiencies. This will ensure that all the brand managers see offshoring as a CMO priority.

         Be clear about what can and what cannot be offshored - Draw up a list of functions that can be delivered from an offshore center. For example, offshoring event management can be costly and ineffective because it requires much client intimacy in terms of planning and last minute exigencies like booth set-up and brochure placement among others. While the designing of the booth can be offshored the logistics needs to be managed onsite. Therefore, draw up a list of what can and what cannot be offshored.

         Start with the low-hanging fruit to build credibility - The list of activities to be offshored must have the highest probability of delivering on metrics of efficiency, time and cost. To be credible, the offshore unit must first deliver low-hanging fruit, and then gradually scale up to more complex tasks. For example - start with parts of email marketing such as database creation and validation, design layout of marketing collateral, making brand-consistent PowerPoint presentations, website/portal development and maintenance, among others. Once these reach a certain level of stability, start to look at more complex aspects of marketing such as campaign design, or content creation.

         Keep all delivery options open - Offshore centers can be set up in several forms. These include a fully owned captive center, outsourcing functions to a third party service provider, or creating a hybrid model where some parts of the operations are outsourced to a third party service provider and some are retained within the captive center.

         Set up a robust governance structure - This is probably one of the most important but least understood outsourcing issues. A documented governance framework that details every process and workflow will help the delivery teams by making the task “process-oriented”. It will also put in place strong review mechanisms through steering committees to address any issues that the center or its client users may face.

        Publicize the offshore center’s successes - The company must ensure that the offshore center's successes and any client accolades received are publicized amongst top management and the wider marketing team. Perception matters.

Marketers have outsourced creative, right-brained activities as early as the seventeenth century. That was the genesis of the advertising industry. Since then, companies have evolved to a stage today when marketers outsource a majority of their functions - be it direct marketing, advertising, events, media planning, and even analytics which was hitherto closely held within the “ivory tower." Some have outsourced more than others. But today, an even broader adoption of outsourcing is underway - that of entire marketing operations. Marketers need to embrace this change and make the most of it to drive greater value for their business.

Source:http://blogs.wns.com/Resources/Blogs/BlogTopics/tabid/93/Article/99/seven-tips-to-successfully-offshore-marketing-operations.aspx