FInal CS 121 Lecture 5
Black Hat Techniques: Cloaking
- Serving different content to a spider than to a browser - If a request is from a spider? -> Page with Spam - Two sets of web pages to improve ranking of a bad website
Web Crawling: Sitemaps
- Sitemaps are xml files hosted on the web - Allow web masters too end info to crawler - lists off URLs that might not be in homepage - Relative importance in the web map - Update Frequency
Web Data Collection: Data Dumps
- Sites may package their data periodically and provide it as a "dump"
Web Data Collection: Web APIs
- Sites may provide REST interfaces for getting at their data - Usually Higher-level: avoids having to parse HTML - Usually restrictive: only part of the data
Black Hat Techniques: Keyword Stuffing
- 1st- get search engines relied heavily on textual content and frequency of words - SEO moved to play around with keywords - misleading meta-tags - Repeating words over and over - Playing games with colors (white on white)
Universal Resource Locators (URL)
- A string of characters used to identify a resource - URL (locator) vs URN (name) - Locator must specify where the resource is
Black Hat techniques Link Spamming
- Bots that search for blogs and leave comments with link - # of links rank higher than relevance of those links
Black Hat Techniques: Clicker Bots
- Click-through rate (CTR) affects website rankings - Clicker Bots are those that issue queries and "Click" on targeted query results
SEO Black Hat Techniques
- Duplicate Content - Keyword Stuffing - Link Farming - Cloaking Text &/or Links - Re-Directing to Another Site or Page - BlogComment Spam
SEO White Hat Techniques
- Fresh, Relevant Content - Linking to / Getting Links form relevant industry sources - Optimized Image Labels - Relevant Page Titles & Tags - Natural Keyword Density
Web Data Collection: Web Crawling
- Getting HTML pages and other documents and discovering new URLs as it goes - Good for changing collections - Good for unknown documents - Web admins don't like crawlers
Black Hat Techniques Link Exchanges
- I link you , you link to me - "Translations" - Universities and professors are targets
Search Engine Optimization Legitimate Approach:
- Indexed age of the pages (older is better) - Good incoming links - Good content, well written, well organized, up to date - Good use of web standards/practices
SPAM: The war on spam
- Quality Indicators - Usage indicators - Anti-bot mechanisms - Limits on meta keywords - Spam recognition by machine learning - Family-friendly filters (humans and ML) - Robust link analysis - Ignore improbable links - Detect Cycles - Use link analysis to detect spammers - Editorial Intevention
SPAM: Webmaster Guidelines
- Search engines have SEO policies - What is allowed and not allowed - Must not be ignored - Once a site is blacklisted by a search engine, it will virtually disappear from the Web
Web Data Collection: URL Downloads
- Two step process: 1. Crawl to find out the URLs of specific resources 2. Run a downloaded that takes that list and downloads the resources - Example: "crawling" source forge for source code - Some sites use URLs wrapper to access URL as if they were files - Doesnt need to be source code; can be paper, pages, etc
Web Crawling: Freshness
- Web pages are constantly being added, deleted, and modified - Freshness refers to the proportion of pages that fresh - Special HTT request type called HEAD returns information about page, not page itself - Not possible to constantly check all pages - Optimizing for Freshness vs. relevance can lead to bad decisions, such as not crawling popular sites
Anatomy of a URL
- Web pages are stored on web servers that use HTTP to exchange information with client software
Search Engine Optimization Unethical approaches (aka spamdexing)
- fake pages - fake sites that point to your site - fake comments/engagement - in short: "alternative facts" aka lies
Search Engine Optimization
- the process of maximizing the number visitors to a particular website by ensuring the site appears high on the list of results returned by a search engine - Often referred to as "natural" - Alternative for paying for ad placement - It's marketing: getting your content to your audience
Black Hat Techniques: Doorway Pages
-Like cloaking but using a redirect - Initial page is optimized for spider, then redirect takes user to actual content
Domain Name System (DNS): Domain and Subdomains calendar.ics.uci.edu
.edu = Root Domain .uci.edu = Domain ics.uci.edu = ICS subdomain in uci.edu calendar.ics.uci.edu = Calendar Subdomain in ics.uci.edu
Flavors of Data Collection
1. Data Dumps 2. URL Downloads 3. Web APIs 4. Web Crawling
Web Crawling Algorithm
1. Initialize a queue of URLs (seeds) 2. Repeat until no more URLs in queue: a. Get one URL from the queue b. If the page can be crawled, etch associated page c. Store representation of page d. Parse and extract URLs from page and add them to the queue - Queue = "frontier"
Web Data Collection
1. Web Crawler client program connects to a DNS server 2. DNS server translates hostname into internet protocol (IP) address 3. Crawler attempts to connect to server host using specific port 4. Crawler user sends an HTTP request to the web server to request a page (usually a GET request)
Domain:port
Authority part. Domain = host
Fragment_id
Provides direction to a secondary resource or element
Scheme
http(s), ftp, mailto, file, data
Path
map to a file system path
URL: General Syntax
scheme://domain:port/path?query_string#fragment_id
Query_string
sequence of attribute-value pairs separated by a delimiter