Ruminations

Thursday, October 25, 2012

Image thresholding and smoothing

Thresholding helps to visualize the useful portion of the image, whereas smoothing helps to remove noise from the image. The images captured through a camera are generally in analog format and therefore it is digitized by sampling. The digitized image is stored in a computer in the form of pixel arrays. Each pixel contains information such as the color or intensity of that portion of the image. Pixel size depends on the sample size of the digitization process. In image thresholding, each pixel value is either set to black or white depending on a user-defined threshold value. In smoothing the pixel values are set based on the values of the neighboring pixels.

Image Scaling in Grayscale images

Image scaling is the process of incrementing the pixel values based on the user-defined scaling factor. Image scaling helps to increase the brightness of the image. In greyscale image scaling the user-defined value is added to the pixel value. Below is the image scaling result with a scaling factor 25, 50 in the Region of interest (ROI) one and two respectively. The brightness and hence the quality of the image are increased by scaling. The original image is shown on the left side and the scaled image on the right side.

Image thresholding in Grayscale images

Thresholding in the Grayscale images is done by setting the pixel values to 0 or 256 based on a threshold value defined by the user. For example, if the threshold value is 125 then all pixel values below 125 are set to 0 and all pixel values above 125 are set to 256. In this way, the image features are more easily identifiable. Below is the image thresholding result on a greyscale image with thresholding values 100, 150 for ROI one and two respectively.

Image thresholding in Color Images

In color thresholding, the distance between the RGB pixels of the image with the user-defined color values is then compared with a user-defined threshold TC. If the distance falls below the threshold then those pixels are set to white and all other pixels are set to black. A sample result with RGB (25,25,25) with TC = 205 and TC = 100 for ROI one and two are shown below respectively.

Adaptive Weighted thresholding in Grayscale images

In normal thresholding, the parameter provided by the user affects the entire image. Hence it might give undesirable results in images that have wide variability in intensity. Adaptive weighted thresholding helps to overcome this issue by using a weight factor W which is compared with the mean of the pixels falling inside the odd-sized window with the current pixel as the center. The result of such thresholding with Window size 3, 5 and threshold factor 15, 20 on ROI one and two are shown below respectively.

One dimensional Smoothing in Grayscale Images

In one-dimensional smoothing, the pixel values are averaged along one dimension of the two-dimensional pixel plane. The result of one-dimensional smoothing in X and Y direction separately is shown below with windows size 3, 5 on ROI one and two respectively. The window size should be selected reasonably so that blurring is kept to the minimal level possible.

Two-dimensional Smoothing in Grayscale images

In two-dimensional smoothing, the mean of the whole window in the two-dimensional plane is used as the pixel value. The result of two-dimensional smoothing with window sizes 3 and 5 on ROIs one and two is shown below.

Program

Some of the functions I used to do the above operations in C++ are in Git here. Note that the custom class Image and ROI are not included here. However, this code is enough for explaining the main logic.

Tuesday, October 23, 2012

Comparing Wi-Fi performance using experiments

N.B: I am posting by revising something I wrote for school purposes. So it doesn't look like a blog article.

File downloading via HTTP is one of the common tasks we do on the internet. Delays in file download can be caused by factors such as propagation, processing, file size, and traffic on the network. Delays caused by propagation are unavoidable, and therefore our interest is to keep the other delay factors to the minimal level possible. In this paper, we design and conduct experiments from two public Wi-Fi spots to identify the effects of key factors that are perceived to be affecting the file download time over the internet. By doing so we explore if ping can be used to identify the file downloading time and the impact of location on file download time. From the results and analysis, we conclude on the comparative performance of the two Wi-Fi spots using statistical models.

INTRODUCTION

File downloading is a common task we do on the internet. The factors that affect the delay for file downloading include the traffic on the network, file size, time, and other related things. Nowadays Wireless Networks are made available free of cost at public places by many private business dealers in order to improve customer satisfaction. Such Wi-Fi hotspots are often jammed with a lot of users which may cause delays in file downloading.

In this paper, we perform experiments on two Wi-Fi spots, with a view to understanding the comparative performance with respect to File downloading. HTTP is a common internet protocol used for file retrieving. We compare the HTTP file downloading performance in two separate Wi-Fi hotspots at University Mall located near Tampa. A ping request is generally used by network administrators to check the status of remote servers. By default, a ping request sends a 32 byte ICMP packet to the server. The server, in turn, echoes back the same packet. Using the ping request thus we will be able to get an estimate of the total round trip time of the packet. In this paper, we also do experiments to understand if ping can be used to predict the file download time.

The related work section gives a brief overview of the research ongoing regarding the performance of file downloads. It is followed by the experimental setup and results. Finally, the results are analyzed and a conclusion is made.

RELATED WORK

File downloading performance has been a major field of research among network scientists in the past decade. Researchers have suggested that a slow start in TCP causes delays for small file downloads. Some researchers have suggested that the DNS causes delays in file download. The delay in downloading is caused by numerous factors which are hard to theoretically estimate. However, time of the day and file size is known to play a major role in deciding the file download time. Prefetching the file at the server was suggested by some researchers to improve the performance. It has been also suggested that the performance can be increased when the server rejects requests that exceed a threshold.

EXPERIMENTS

The experiment was conducted in two of the free Wi-Fi spot available at the University Mall\.

Setup

An HTTP server was set up at the ENB building on USF campus. The C based HTTP web server called Weblite, written by Dr. Ken Christensen at USF was used to do the experiment. A UNIX based Client was set up to request the file at the server through HTTP GET request. Two files with size 16 Kb, 484 Kb were respectively used to do the experiment. We assume these files as small and large based on some pre-experimental speed test on the location. Each experiment had 10 HTTP GET requests to send to the server. Once a request is finished the Client would wait for 100 milliseconds before sending the next request. In this way, ten experiments were done, with an interval of five seconds between each experiment. The experiment was done at two different Hot Spots at University mall, namely Food Court and Center Court. Experiments were conducted during peak traffic hours in noon and during fewer traffic hours at night. Each experiment has ten sample mean file download time. Ping time was also noted using command line tool during the experimentation and the average of random sample was used.

Results

The results of the experiments are tabulated in table one. It shows the Mean file download time for each of the eight scenarios along with the mean ping time for each experiment. The mean for each experiment is based on the sample download time during the experiment. Separate columns are shown based on the time of day and the file download size.

TABLE I. EXPERIMENT RESULTS

Comparison

We have the sample mean file download time for the 10 experiment done in the eight different scenarios. Now we would like to compare the results using confidence interval estimate. First, we determine the difference between sample means. Then we use T-scores to get the normal distribution. The result is shown in table 2.

TABLE II. COMPARATIVE RESULTS

We also checked if ping can predict the download time. As we know ping gives the total round trip time for 32 bytes of data. Using the mean ping time for random samples in table one, we can calculate the estimated download time by multiplying the one byte ping time with the file size. The result is shown in Table 3.

TABLE III. PING TIME

As we can see the ping can mostly predict the file download time for smaller files, however for larger files it’s not able to do so.

ANALYSIS

The reason for the variation in the file download time over the two locations is not so obvious from the results. Traffic in the network is likely to be one issue. In the food court, there were more people in the afternoon when compared to the center court. Similarly at night in the center room, there were more people than in the food court (since the food court was closed at night). The impact of file size is still not very clear, however as the average shows the large file download was slightly faster than the small file download. It could be the impact of the slow start in the TCP protocol. Similarly, the reason why ping is not able to predict large file download time is that of the packet loss observed during the ping experiment. It means that the link is not reliable for a long period of time where random delays occur due to unknown factors.

CONCLUSION

The experiments were conducted to compare the two public Wi-Fi spots. The results showed that one of the Wi-Fi was better than the other. Similarly, we showed that ping can predict the download time if the random delays during large file downloads can be avoided.

PROGRAM FOR EXPERIMENT

Sunday, June 24, 2012

SQL injection prevention

This was a research project for software security coursework which I am sharing here on my blog.

SQL injection vulnerability was reported as a top cybersecurity issue in 2010 and is considered to be accounting for 10% of total cybersecurity vulnerabilities since 2002. It is a known fact that using parameterized queries can mitigate SQL injection vulnerabilities to a large extent. This paper shows the underlying practical issues programmers encounter when using parameterized queries. It also shows how a plugin tool inside the developer’s integrated development environment (IDE) could help programmers in generating parameterized queries. Further, the usefulness and practicality of this helping tool are discussed.

Introduction

SQL injection vulnerability (SQLIV) accounts for over 20% of input validation issues and 10% of the overall cybersecurity issues between 2002 and 2007. Further, it was reported as the second top cybersecurity vulnerability of the year 2010. SQL injection vulnerability can lead to theft of confidential data, loss of money, loss of critical data, and affect the reputation of organizations.

Structured Query Language (SQL) is widely used as a standard method to interact with the relational databases from a webpage. An SQL statement can call the database managing program, typically a database server to return a particular result set from the database table or to update the table. Typically SQL statements are not compiled until the request is made by the web page under execution. This is usually because the query would likely depend on user requests and may contain a user-entered input. An SQL injection happens when a malicious user persuades the user input in such a way that the developer-defined SQL query is modified undesirably. This is done by inserting code into the data part of the SQL query maliciously.

Several SQLIV mitigating mechanisms exist, of which the most effective one is the usage of a parameterized query, which is also known as prepared statements. Parameterized queries force the developer to define the code portion of the SQL query first and then clearly define the data portion of the query. In this way, the SQL compiler can safely distinguish between the code and the data portion of a query. It is interesting to note that parameterized queries were introduced a long period back; however, by contrast with the frequent occurrences of SQL injection attacks, we can conclude that they were not used by web developers, at least in some cases where attacks happened.

In this paper, we study the difficulties encountered in using parameterized queries. We developed a plugin tool namely SQLSecuritytool for Microsoft’s visual studio 2010 web development IDE that could aid web developers in ensuring the safety of SQL queries they use in server-side scripting. The findings, the practicality, and usefulness of this security tool are described in this paper. Section II discusses a brief background to understand this paper. Section III goes through the related work in mitigating SQL injection vulnerability. Section IV analyzes problems faced by web developers in using parameterized queries. Section V describes the tool we developed for replacing general SQL queries with parameterized queries. Finally, the security benefits and conclusion are discussed subsequently.

Background

This section gives some background about SQL Structure, SQL injection attacks, and parameterized queries.

SQL statement

An SQL statement defines the operations performed or requested to the database and is usually defined by a database analyst in consultation with the web developer. An example SQL query used in login pages is described below:

Select password FROM usernames_table WHERE username = ‘name’

The code portion of the above statement is shown below.

Select password FROM usernames_table WHERE username =

The data portion of the SQL query is ‘name’. Typically the data portion is known only at runtime. Therefore web developers substitute that field with a variable name. For example, in vb.net the above query would be defined by the web developer as shown in figure 1.

Figure 1. Typical SQL query in a vb.net webpage server-side script

Password hashes obtained from the database are used to validate the username in this case. Typically the database is connected by using the driver provided. In this case, Oracle client driver for asp.net is used which requires importing a dynamic link library (DLL) provided by Oracle company to the project.

SQL Injection Attack

An SQL injection attack happens when the developer fails to properly define his program in such a way so that there is always a desired separation between code and the data portion for any user inputs at runtime. Or by using standard definitions, an SQLIV is caused by dynamic SQL statement code combined with inadequately-verified input, which allows the input to change the structure and logic of an SQL statement. For example, a user input “111’ OR true#“ would create the following query which would retrieve all passwords.

SELECT password FROM usernames_table WHERE username = ‘name’ OR true#’

If the web application is expected to display the results in a table then all the password hashes will be displayed to the attacker. In addition to this example, several other SQL injection methods exist. Fortunately, all of them work by mixing the user input with actual SQL code.

Parameterized query

Parameterized queries are SQL statements that take a static structure unaltered at runtime. It defines the user input as parameters expected at the runtime. Therefore it clearly makes a distinction between the code and the data portion of the SQL query. The query described in figure 1 can be changed to a parameterized query as figure 2 shows.

Figure 2. Typical parameterized SQL query in a vb.net server-side script

There are multiple ways to define parameterized queries and it depends on the language in which the web developer programs. When a large SQL query is used a number of parameters may be required to be defined carefully.

Web development issues

It is generally known among web developers that using parameterized queries can give maximum security against SQL injection vulnerability. It is being suggested and encouraged as a standard pattern to be followed by web developers. However, if the existing code is not written using parameterized queries, it would be a tedious effort for the developer to walk through and carefully scrutinize the code to rewrite them using parameterized queries. Existing frameworks in dot net do not provide any help for migration to parameterized queries. Further in many organizations, the task of writing SQL statements are assigned to DB analyst. In small organizations, a developer may take the role of the DB analyst. This reason in addition to the time constraints may force those developers in writing their code in the easiest way, by not using any security mitigating logic such as parameterized queries. Personally, this reason and our knowledge of the dotnet framework gave me the motivation to write an automated tool to assist the developers.

Replacement tool

The replacement tool I developed basically works as a plugin to the Microsoft Visual Studio IDE. Upon enabling this plugin and running, the developer is given options to find vulnerable SQL queries in the current project. Also, it provides the parameterized form of the vulnerable query so the developer can replace the vulnerable SQL code. The plugin works by using the Microsoft DTE interface which gives an object-oriented interface for developing plugin extensions to visual studio. The source code is analyzed using regular expressions. These two design components are discussed below.

DTE interface

Microsoft has exposed some of the inbuilt classes of the Visual Studio IDE to the developer through the DTE interface. The DTE object-oriented model allows the programmer to access the project, code, and other components of the visual studio from a user-defined plugin. Static analysis of the code is possible by using this interface. There are two ways to do static analysis. One is by analyzing the code components of the project using the DTE model which uses the inbuilt lexer and parser. Another is to use regular expressions to find the desired pattern of code. Each has its own pros and cons. The former can do static analysis accurately however it needs in-depth knowledge of the DTE object model. The latter is less accurate since we assume to have a lot of code patterns in the program. However, it is faster and easy to write. For the sake of this short academic project and also due to time constraints I opted to develop the tool using regular expressions.

Regular Expressions

Regular expressions provide a fast and easy way to make static analyses of the program code. It requires some basic understanding of creating a pattern which we want to match from code. Further many languages have an inbuilt regular expression engine that allows extracting text programmatically. We used the vb.net regular expression library provided by Microsoft. For example, the following expression could find the SQL statements occurring on vb.net server-side page.

SQL\s=\s.+?\n.*?(objcommand.\.CommandText\s=\sSQL|Me\.SqlDataSource.\.SelectCommand\s=\sSQL|ObjDataReader\s=\sobjcommand.ExecuteReader|objcommand\.CommandText\s=\sSQL)

Basically, it looks for text that matches the word SQL or obj command followed by the SQL query and execution commands. This pattern was able to detect many of the vulnerable codes in the project. The reason for this being a good pattern is that it exploits the common syntax used in vb.net programs to define SQL codes. However, it would not definitely give a perfect match, because the program can take a different structure depending on the developer's coding style. However with carefully scrutinized patterns, one could expect an average matching rate of around 70%.

Developer Interface

The developer user interface made use of the windows forms framework provided by Microsoft. It has various options to find the next occurrence of the unsecured SQL query in the project. If the matching code file which contains such an unsecured SQL query is not yet opened by the developer, it will be automatically opened. Further, the present unsecured SQL query found will be selected and highlighted in the code window. Further manual editing of the generated query can be done by modifying the suggested parameterized SQL query. The sample user interface screenshot is shown in figure 3.

Figure 3. User Interface

Assumptions and limitations

This conversion tool assumes that all command lines in the program are removed by the programmer before running the application. This is because the regular expressions could mismatch the comments as source code. Further, there are different ways to define the SQL code inside a program, but I assumed only the widely used syntax of defining SQL code in the source code. However, these issues can be avoided using the DTE code model with the inbuilt lexer and parser or by defining more scrutinized regular expression patterns. However, I believe it is out of scope for this short project.

Effectiveness

Since parameterized queries can guarantee protection against SQLIV the developed replacer tool is expected effective against most of the SQL injection vulnerability. We tested this tool against over ten different projects obtained from the internet and also from organizations. It showed a conversion rate of over 80% for those projects which used syntax similar to those defined in the regular expression pattern. Some of those queries were having over 40 parameters. However, this was not true in all the cases. There was an obvious failure in detecting the vulnerabilities in those projects that do not employ the defined matching pattern. By defining more and more patterns one may be able to achieve a detection and conversion rate of over 80%. However regular expressions cannot be used for those programs that employ complex logic inside the SQL query definition. For example, if a string builder was used to store the SQL query, with the query being added part by part, the matching pattern would be too difficult to write. In such situations analysis using the lexer and parser for dot net framework would provide accurate results.

SQL structure	Conversion
Static SQL	Yes
Dynamic SQL	Yes
Conditional SQL	Yes
Batch queries	No
Nested SQL	No

Table 1. The effectiveness of the developed conversion tool.

Nevertheless, this project showed the power of regular expressions in doing static analysis to prevent SQL injection vulnerability. If an organization needs to quickly convert all its code to parameterized queries a tool like this with regular expression patterns defined by comparing the organization's standard legacy code structure would be very useful.

Conclusion

In this paper, I described the SQL injection vulnerability and ways to mitigate the issue practically by exploiting the security benefits of parameterized queries. The paper described a tool that could assist the developer in successfully converting the SQL statements to parameterized queries with minimal manual intervention. This study though was done specifically focusing dotnet framework, but it is virtually applicable to all kinds of environments. Further, the study showed that such a tool carefully designed would be very helpful for the developer and can mitigate the SQL injection vulnerability to a large extent.

Code for the Main tool Window (other dependency files excluded)

A* search

I took this class called Introduction to AI, and the professor gave me the assignment to write an A* heuristic program to find the shortest path between the cities. The shortest path heuristic was based on the Euclidean distance or alternatively the total number of links. The program needed to be shown on a graphical interface. I wrote this taking some time, going through all sorts of implementation issues. I assume my solution to be not optimal, but it works in a reasonable time. Below is the video describing one of the sample runs.

Sample locations input file

Sample Connections input file

Main function

Node.java

Reader.java

Probabilistic Road-map method (PRM)

The probabilistic road-map approach is used in robotics to find a path from the present configuration to the robot's desired configuration. This is done by finding a path in the configuration space.

As shown above, the path from the source to the destination is plotted on the configuration space. I used the classical Dijkstra method to find the path. The sample input file and the main program are below.

Input

1 8

-7 20.5

-11 30.5

-18 16

-21.5 28

-7 25.5

-11 35.5

-18 21

-7 33

1 2 3 4

5 6 7 8

0 0

30 55

300 150

76 91

The program code in Gist

Configuration Space

I took a class called Algorithms for Robotics in the school. The class was interesting, except that I messed up the finals. Anyway, the assignments included finding the configuration space of a 2D robotic arm, Simultaneous Localization & Mapping, etc. I thought of writing something about it here.

The configuration space of a robot allows it to calculate the possible motions it can take, given the locations of the obstacles. The above picture represents the two angles of a 2D arm robot. Each goes from 0 to 360 degrees. The red shade indicates obstacles and the white space indicates the free space. The real-world coordinates are translated by the algorithm into the above representation to make things much easier. I wrote the program in Java. It was an interesting solution since I did not use any open-source libraries.

Sample input

3 11
0 0
0 1
0 2
0 3
1 0
1 1
1 2
1 3
3 2
3 3
4 3
1 2 6 5
3 4 8 7
9 10 11
3 0
2 2

Program link

Saturday, June 9, 2012

Theme park

This problem appeared on the google code jam website. I took some time to write a solution.

Statement

Roller coasters are so much fun! It seems like everybody who visits the theme park wants to ride the roller coaster. Some people go alone; other people go in groups and don't want to board the roller coaster unless they can all go together. And everyone who rides the roller coaster wants to ride again. A ride costs 1 Euro per person; your job is to figure out how much money the roller coaster will make today.

The roller coaster can hold k people at once. People queue for it in groups. Groups board the roller coaster, one at a time, until there are no more groups left or there is no room for the next group; then the roller coaster goes, whether it's full or not. Once the ride is over, all of its passengers re-queue in the same order. The roller coaster will run R times in a day.

For example, suppose R=4, k=6, and there are four groups of people with sizes: 1, 4, 2, 1. The first time the roller coaster goes, the first two groups [1, 4] will ride, leaving an empty seat (the group of 2 won't fit, and the group of 1 can't go ahead of them). Then they'll go to the back of the queue, which now looks like 2, 1, 1, 4. The second time, the coaster will hold 4 people: [2, 1, 1]. Now the queue looks like 4, 2, 1, 1. The third time, it will hold 6 people: [4, 2]. Now the queue looks like [1, 1, 4, 2]. Finally, it will hold 6 people: [1, 1, 4]. The roller coaster has made a total of 21 Euros!

Input

The first line of the input gives the number of test cases, T. T test cases follow, with each test case consisting of two lines. The first line contains three space-separated integers: R, k, and N. The second line contains N space-separated integers gi, each of which is the size of a group that wants to ride. g0 is the size of the first group, g1 is the size of the second group, etc.

Output

For each test case, output one line containing "Case #x: y", where x is the case number (starting from 1) and y is the number of Euros made by the roller coaster.

Limits

1 ≤ T ≤ 50.
gi ≤ k.
Small dataset

1 ≤ R ≤ 1000.
1 ≤ k ≤ 100.
1 ≤ N ≤ 10.
1 ≤ gi ≤ 10.
Large dataset

1 ≤ R ≤ 108.
1 ≤ k ≤ 109.
1 ≤ N ≤ 1000.
1 ≤ gi ≤ 107.

Sample

Input
3
4 6 4
1 4 2 1
100 10 1
1
5 5 10
2 4 2 3 4 2 1 2 1 3
Output
Case #1: 21
Case #2: 100
Case #3: 20

My Solution:

Snapper Chain

This problem appeared on google code jam website. I took some time to write a solution.

Statement

The Snapper is a clever little device that, on one side, plugs its input plug into an output socket, and, on the other side, exposes an output socket for plugging in a light or other device.
When a Snapper is in the ON state and is receiving power from its input plug, then the device connected to its output socket is receiving power as well. When you snap your fingers -- making a clicking sound -- any Snapper receiving power at the time of the snap toggles between the ON and OFF states.

In hopes of destroying the universe by means of a singularity, I have purchased N Snapperdevices and chained them together by plugging the first one into a power socket, the second one into the first one, and so on. The light is plugged into the Nth Snapper.

Initially, all the Snappers are in the OFF state, so only the first one is receiving power from the socket, and the light is off. I snap my fingers once, which toggles the first Snapper into the ON state and gives power to the second one. I snap my fingers again, which toggles both Snappers and then promptly cuts power off from the second one, leaving it in the ON state, but with no power. I snap my fingers the third time, which toggles the first snapper again and gives power to the second one. Now both Snappers are in the ON state, and if my light is plugged into the second Snapper it will be on.

I keep doing this for hours. Will the light be on or off after I have snapped my fingers Ktimes? The light is on if and only if it's receiving power from the Snapper it's plugged into.

Input

The first line of the input gives the number of test cases, T. T lines follow. Each one contains two integers, N and K.
Output
Limits
1 ≤ T ≤ 10,000.
Small dataset

For each test case, output one line containing "Case #x: y", where x is the case number (starting from 1) and y is either "ON" or "OFF", indicating the state of the light bulb.
1 ≤ N ≤ 10;
0 ≤ K ≤ 100;
1 ≤ N ≤ 30;
0 ≤ K ≤ 108;

Sample

Input
4
1 0
1 1
4 0
4 47
Output
Case #1: OFF
Case #2: ON
Case #3: OFF
Case #4: ON

My Solution

Saturday, May 1, 2010

Price Finder

I just wanted to write something regarding my project work at the college. I always thought of creating something interesting. This idea came up randomly, though I can't recall how. It was to create a price finder website on the internet. Again it's formal writing!

In this age of online shopping customers often find the same product being sold at different prices on shopping websites. Customers usually want to buy the right product at minimum cost. It would be useful for them if they were to have a search engine that will give them the best website to buy from. We make use of web crawlers and screen scrapers to get the price of products on shopping websites and then store it in a database with indexing. We aim to develop a heuristic Search engine that would accept the name(s) of the product as input and return ideal website links as output.

Concept description

First I developed a web crawler module that would download the entire pages from shopping websites we need. As a prototype, I downloaded 5 major book shopping websites in India (books.rediff.com, flipkart.com,eshop.talash.com/books, gobookshoping.com, and nbcindia.com).

In the next module, I needed to create a user interface using JSP in which the user would enter the name of the product(s) he would like to purchase. The request is processed and finally, the desired results are displayed to the customer using a searcher and a price finder. The result would display website descriptions of those selling the requested product name. The user can then go to the link directly to purchase.

A simple diagram demonstrating the concept

Design of web crawler module

The web crawler I wrote works by following hyperlinks on the shopping website, excluding external hyperlinks which point to other domains. The links are extracted from the HTML content of each page by using regular expressions and Jericho HTML parser. The process is repeated until all pages in the preferred shopping website are downloaded.

The downloaded webpage is in HTML format. In the next step, we have to extract text from it. For that, I used an HTML parser library called Jericho. The downloaded pages are stored in a database along with address, text, and HTML content. I used the latest version of an open-source database called H2 for this purpose. Next, we have to index the text column in the database in order to make searching operations faster. For this, I used open-source Lucene text indexer which I found to be efficient. Finally, the name of the downloaded shopping website is written in a text file for later use by the searcher.

I was able to create a web crawler user interface using Java. It had options to specify the maximum number of pages to download with a progress bar indicating the status of the crawling process. Also, it had options to pause and resume the download. Many instances of this web crawler application can be initiated so that downloading websites simultaneously is possible to make use of the network bandwidth in an optimal way. It would run on any machine with a Java runtime environment installed.

Design of the user End module

The user end module is similar to the Google search page. It ran on an Apache Tomcat server. The user query is translated into a database search query by using a query parser.

Design of Query parser sub-module

If the quantity is specified in the query then it is saved and will be used to calculate the total price at a later stage. In the above example ‘wings of fire’ AND ‘Abdul Kalam' means look for pages that contain both the terms “wings of fire” and “Abdul Kalam”.

Design of Searcher sub-module

The job of the searcher module is to search for the parsed user query. Using multi-threading the translated user query is searched on each of the databases concurrently.

Design of price finder sub-module

This module works inside the searcher module. The job of this module is to find the price of the product. Utilizing the result set of searcher it fetches those pages which contain the user query. For each of these pages, a carefully selected logic is used to get the price of the specified product. Finally, the product name, price of the product, and other details are cached for future queries.

The logic used to find the price

The assumption is that most shopping websites would display products in the below format.

Name of the product…..some details……Price…reduction rates…Reduced price.

I made use of regular expressions to extract the price by looking for “product name” and then looking for “currency”.

One example is shown below found at books.rediff.com

Everyday Eating For Babies And Children


	Author: Judith Wills
	Format: Paperback


	ISBN: 0007215274
	Seller: Crossword



See more books by this author Share this book with a friend

Our Price Rs. 745


Get this book in 25 business days. Special offer: Free shipping on this book

List Price:	~~Rs 499~~
Our Price:	Rs. 349
Discount:	Rs. 150

off

The user enters the name of a product or a list of products he wants to purchase. An additional option was provided to view the cached copy excluding images so that in case the original web server is down user can still see the content.