FInal CS 121 Lecture 5

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Black Hat Techniques: Cloaking

- Serving different content to a spider than to a browser - If a request is from a spider? -> Page with Spam - Two sets of web pages to improve ranking of a bad website

Web Crawling: Sitemaps

- Sitemaps are xml files hosted on the web - Allow web masters too end info to crawler - lists off URLs that might not be in homepage - Relative importance in the web map - Update Frequency

Web Data Collection: Data Dumps

- Sites may package their data periodically and provide it as a "dump"

Web Data Collection: Web APIs

- Sites may provide REST interfaces for getting at their data - Usually Higher-level: avoids having to parse HTML - Usually restrictive: only part of the data

Black Hat Techniques: Keyword Stuffing

- 1st- get search engines relied heavily on textual content and frequency of words - SEO moved to play around with keywords - misleading meta-tags - Repeating words over and over - Playing games with colors (white on white)

Universal Resource Locators (URL)

- A string of characters used to identify a resource - URL (locator) vs URN (name) - Locator must specify where the resource is

Black Hat techniques Link Spamming

- Bots that search for blogs and leave comments with link - # of links rank higher than relevance of those links

Black Hat Techniques: Clicker Bots

- Click-through rate (CTR) affects website rankings - Clicker Bots are those that issue queries and "Click" on targeted query results

SEO Black Hat Techniques

- Duplicate Content - Keyword Stuffing - Link Farming - Cloaking Text &/or Links - Re-Directing to Another Site or Page - BlogComment Spam

SEO White Hat Techniques

- Fresh, Relevant Content - Linking to / Getting Links form relevant industry sources - Optimized Image Labels - Relevant Page Titles & Tags - Natural Keyword Density

Web Data Collection: Web Crawling

- Getting HTML pages and other documents and discovering new URLs as it goes - Good for changing collections - Good for unknown documents - Web admins don't like crawlers

Black Hat Techniques Link Exchanges

- I link you , you link to me - "Translations" - Universities and professors are targets

Search Engine Optimization Legitimate Approach:

- Indexed age of the pages (older is better) - Good incoming links - Good content, well written, well organized, up to date - Good use of web standards/practices

SPAM: The war on spam

- Quality Indicators - Usage indicators - Anti-bot mechanisms - Limits on meta keywords - Spam recognition by machine learning - Family-friendly filters (humans and ML) - Robust link analysis - Ignore improbable links - Detect Cycles - Use link analysis to detect spammers - Editorial Intevention

SPAM: Webmaster Guidelines

- Search engines have SEO policies - What is allowed and not allowed - Must not be ignored - Once a site is blacklisted by a search engine, it will virtually disappear from the Web

Web Data Collection: URL Downloads

- Two step process: 1. Crawl to find out the URLs of specific resources 2. Run a downloaded that takes that list and downloads the resources - Example: "crawling" source forge for source code - Some sites use URLs wrapper to access URL as if they were files - Doesnt need to be source code; can be paper, pages, etc

Web Crawling: Freshness

- Web pages are constantly being added, deleted, and modified - Freshness refers to the proportion of pages that fresh - Special HTT request type called HEAD returns information about page, not page itself - Not possible to constantly check all pages - Optimizing for Freshness vs. relevance can lead to bad decisions, such as not crawling popular sites

Anatomy of a URL

- Web pages are stored on web servers that use HTTP to exchange information with client software

Search Engine Optimization Unethical approaches (aka spamdexing)

- fake pages - fake sites that point to your site - fake comments/engagement - in short: "alternative facts" aka lies

Search Engine Optimization

- the process of maximizing the number visitors to a particular website by ensuring the site appears high on the list of results returned by a search engine - Often referred to as "natural" - Alternative for paying for ad placement - It's marketing: getting your content to your audience

Black Hat Techniques: Doorway Pages

-Like cloaking but using a redirect - Initial page is optimized for spider, then redirect takes user to actual content

Domain Name System (DNS): Domain and Subdomains calendar.ics.uci.edu

.edu = Root Domain .uci.edu = Domain ics.uci.edu = ICS subdomain in uci.edu calendar.ics.uci.edu = Calendar Subdomain in ics.uci.edu

Flavors of Data Collection

1. Data Dumps 2. URL Downloads 3. Web APIs 4. Web Crawling

Web Crawling Algorithm

1. Initialize a queue of URLs (seeds) 2. Repeat until no more URLs in queue: a. Get one URL from the queue b. If the page can be crawled, etch associated page c. Store representation of page d. Parse and extract URLs from page and add them to the queue - Queue = "frontier"

Web Data Collection

1. Web Crawler client program connects to a DNS server 2. DNS server translates hostname into internet protocol (IP) address 3. Crawler attempts to connect to server host using specific port 4. Crawler user sends an HTTP request to the web server to request a page (usually a GET request)

Domain:port

Authority part. Domain = host

Fragment_id

Provides direction to a secondary resource or element

Scheme

http(s), ftp, mailto, file, data

Path

map to a file system path

URL: General Syntax

scheme://domain:port/path?query_string#fragment_id

Query_string

sequence of attribute-value pairs separated by a delimiter


संबंधित स्टडी सेट्स

Marine Resources, Fisheries and the Impact of Plastic

View Set

Elementary Astronomy Chapter 3 and 4 TopHat

View Set

NURS380 ATI ADDICTION STUDY GUIDE

View Set

Human Biology Chapter 8 Section 1: Bacteria and Viruses

View Set

Unit 6 ap world history flash cards for test

View Set

Project Management Assignments/Practice Tests Ch9 -15

View Set