Web Science

Option C in the IB Computer Science course.

Creating the Web

Commonly the Internet, an internet, and the World Wide Web (otherwise referred to as the web) have been commonly mixed up. However, each is quite different.

C.1.1 Distinguish between the internet and World Wide Web (web).

An internet simply refers to a set of interconnected networks. The Internet refers to the global computing network that utilizes standardized communications protocol including IP addresses. In other words, the internet is a wide- area network that spans the planet^[1]. The World Wide Web (Web) is the information space comprised of various web resources that can be accessed via the Internet. In other words the World Wide Web is a service that runs on The Internet.

The analogy can be made that the Internet is a restaurant and the web is its most popular dish.

Growth of the Web

C.1.2 Describe how the web is constantly evolving.

Generally, it can be characterized that the change in the web was a movement from personal sites to blogs, or publication to participation. It was a move from static pages to dynamic ones.

Early Forms of the Web

Sometimes referred to by "Web 1.0", early stage's of the web where personal and static web pages hosted on ISP (internet service provider) web servers or on free web hosting services. Generally before the advent of dynamic programming languages such as Perl, PHP, and Python. Some design elements included: online guestbooks instead of comment sections and HTML forms were mailto forms.

Web 1.0 is associated with the business model of Netscape - focusing on software creation, updates, and bug fixes and the distribution of such to end users.

Web 2.0

Web 2.0 referred to a web that emphasized user participation and contribution in sites such as social media sites and blogs. Featured client-side technologies such as Ajax and JavaScript as well as dynamic programming languages. The focus on user interface, application software, and storage of files has been referred to as "network as a platform". Key features of Web 2.0 include:

Folksonomy - free classification of information (such as in tagging)
User Participation - site users are encouraged to add value/content to the site
Mass Participation - universal web access has led to the differentiation of concerns from the user base
SaaS (Software-as-a-Service)

In contrast to Web 1.0, Web 2.0 is associated with Google, which focused not on creating end-user software but providing a service based on existing data.

The Semantic Web

The Semantic Web was extended through the standards by the World Wide Web Consortium (W3C) that promoted common data formats and a unity in exchange protocols. For example, the Resource Description Framework (RDF) specification was promoted as a general method for conceptual modelling for web resources using subject-predicate-object expressions (e.g. subject: "the table", predicate: "has the length of", object: "one meter").

Protocol and Addressing

C.1.9 Explain the importance of protocols and standards on the web.

Protocols create a common "language" or set of rules for communication that ensure proper, compatible communication. Standards and protocols ensure the universality of the web. Without them, it would be like communicating in a foreign language without knowing the foreign language ("It's all Greek to me!")

Web Browser

C.1.12 Explain the functions of a browser.

Software for retrieving, presenting, and traversing information resources on the web.

C.1.7 Identify the characteristics of: IP, TCP, and FTP.

TCP and IP together comprise a suite of protocols that carry out the basic functionality the web.

Internet Protocol (IP)

IP is the protocol that defines routing to addresses. Every computer holds a unique IP address and IP ensures the process of getting all data to the destination.

Transmission Control Protocol (TCP)

Information sent over the internet is broken into “packets” and sent through different routes to reach a destination. TCP creates data packets, puts them back together in the correct order, and checks that no packets were lost.

File Transfer Protocol (FTP)

FTP is the protocol that provides the methods for sharing or copying files over a network. It is primarily utilized for uploading files to a web site and certain downloading sites may utilize an FTP server. However, HTTP is more common for downloading. When using FTP, the URL will reflect as such with ftp:.

C.1.3 Identify the characteristics of: HTTP, HTTPS, and URL.

Hypertext Transfer Protocol (HTTP)

HTTP is a specific set of internet protocol used to communicate between web servers and web browsers. HTTP is a connectionless text based protocol as a new connection must be established for each new user request and communicates without knowledge of the communications network.

Hypertext Transfer Protocol Secure (HTTPS)

As HTTP does not provide much security, HTTPS was developed and added encryption to a connection using Transport Layer Security (TLS) or Secure Sockets Layer (SSL).

Uniform Resource Locator (URL)

C.1.4 Identify the characteristics of: uniform resource identifier (URI) and URL.

C.1.5 Describe the purpose of a URL.

A URI is a string that identifies a resource. A URL is specific type of URI that provides the address of a web resource as well as the means to retrieve the resource. For example, http://example.com/index identifies http protocol for retrieval, example.com as the address, and the specific file /index.

Domain Name Server (DNS)

C.1.6 Describe how a domain name server functions.

A Domain Name Server is a special type of server that relates a web address to an IP address, acting somewhat like a directory. It utilizes a hierarchical decentralized naming system, sorting by root DNS servers or top level domain servers (such as .net and .com) then to authoritative DNS servers below each top level (for example, .stanford may be under .edu).

Mark-up and Style Sheets

C.1.3 Identify the characteristics of the following: HTML, XML, XSLT, Javascript, and CSS.

Hypertext Mark-up Language (HTML)

HTML is the standard markup language used to make web pages. Characteristics:

Allows for embedded images/objects or scripts
HTML predefined tags structure the document
Tags are marked-up text strings, elements are “complete” tabs, with opening and closing, and attributes modify values of an element
Typically paired with CSS for style

Cascading Style Sheet (CSS)

CSS sheets describe how HTML elements are displayed. It can control the layout of several web pages at once.

Extensible Mark-Up Language (XML)

XML is a markup specification language that defines rules for encoding documents (to store and transport data) that is both human- and machine- readable. XML, as a metalanguage, supports the creation of custom tags (unlike HTML) using Document Type Definition (DTD) files which define the tags. XML files are data, not a software.

Extensible Stylesheet Language Transformations (XSLT)

XSLT is a language for transforming XML documents into other XML documents or other formats such as HTML. It creates a new document based on the content of the existing one.

Javascript

JavaScript is a dynamic programming language widely utilized to create web resources. Characteristics include:

Server side
Supports object-oriented programming styles
Does not include input/output
Can be used to embed images or documents, create dynamic forms, animation, slideshows, and validation for forms
Also used in games and applications

Web Pages

C.1.8 Outline the different components of a web page.

Head contains title and meta tags, metadata. Metadata describe the document itself or associates it with related resources such as scripts and style sheets. Body contains headings, paragraphs and other content.

Title defines the title in the browser’s toolbar.

Meta tags are snippets of text that describe a page’s content but don’t appear on the page itself, only in the page’s code. Helps search engines find relevant websites.

C.1.10 Describe the different types of web page.

Personal pages are pages created by individuals for personal content rather than for affiliations with an organization. Usually informative or entertaining containing information on topics such as personal hobbies or opinions.

Blogs or Weblogs is a mechanism allowing for publishing periodic articles on a website.

Search Engine Pages or Search Engine Results Page (SERP) display results by a search engine from a query.

Forums or online discussion boards usually organized by topics where people can hold conversations through posted messages. Typically has different user groups which define a user’s roles and abilities.

C.1.11 Explain the differences between a static web page and a dynamic web page

Static web pages contain the same content on each load of the page, but dynamic web pages’ content can change depending on user input. Static websites are faster to develop and cheaper to develop, host, and maintain, but lack the functionality and easy ability to update that dynamic web sites have. Dynamic web pages include e-commerce systems and discussion boards.

Dynamic web pages can use PHP, ASP.NET frameworks or Java Server Page (JSP) scriptlets. JSP scriptlets are a small piece of executable code intertwined in HTML. JSP is server-side. JavaScript on the other hand is client-side. ASP.NET framework can use simple pages (SPA) or MVC (Model View Logic) models to generate dynamic web pages or applications, hosts a variety of .NET languages such as razor syntax C#. PHP is server-side scripting for web development and can be embedded into HTML code or used with templates or frameworks.

C.1.13 Evaluate the use of client-side scripting and server-side scripting in web pages.

Server-side scripting runs on server, requires a request sent and return data. More secure for client. Includes PHP, JSP, and ASP.NET.

Client-side scripting runs script on client’s side. Can pose security risk to client, but faster. Includes JavaScript and JSON.

C.1.14 Describe how web pages can be connected to underlying data sources.

Connection strings is a string that specifies about a data source and the means to connect to it. Commonly used for database connection.

C.1.15 Describe the function of the common gateway interface (CGI).

CGI is a standard way for web servers to interface executable programs installed on a server that generate web pages dynamically.

Searching the Web

Layers of the Web

C.2.2 Distinguish between the surface web and the deep web.

Surface web is anything able to be found and accessed by search engines. The deep web includes web pages that cannot be found by search engines due to protection through need of authentication. Can usually only be accessed by already knowing the link or having the proper authentication. The dark web on the other hand can usually only be found through TOR as access requires encryption and anonymization factors.

Search Engines

C.2.1 Define the term search engine.

Web search engine is a site that helps you find other websites through methods such as keyword searching and concept-based searching. Searches through following the different links of a website.

Searching Algorithms

C.2.3 Outline the principles of searching algorithms used by search engines.

Check term frequency, zone indexes (placing different weight on title v. description), relevance feedback, vector model (looking at the cosine similarity of a document)

PageRank is an algorithm used by Google. Link analysis algorithm that assigns numerical weighting to each element of hyperlinked texts. PR(E) (page rank of E). A hyperlink to a page counts as a vote or support of a particular page. Importance by association. Number of paths to the page divided by number of outgoing links from the page/step before and then considering the PR of the previous page/step. Altogether, the different PageRanks would sum 1, its a probability distribution.

Hyperlink-Induced Topic Search (HITS) algorithm is a link analysis program that also rates Web pages. Hubs and authorities. A good hub points to many pages, a good authority is a page linked to by many hubs. Each page is assigned two scores: its authority, which estimates value of content, and its hub value, which estimates the value of its links to other pages. First generates a root set (most relevant pages) through text-based algorithm. Then a base set generated by augmenting the root set with web pages linked from it or to it. The base set and all the hyperlinks in the base set form a focused subgraph upon which HITS is performed.

Web Crawlers

C.2.4 Describe how a web crawler functions.

Web crawlers, also known as web spiders, are internet bots that systematically index websites by going through different links while collecting information about the site. Also copies the site for index.

Bot also known as a web robot is a software application that runs automated tasks or scripts over the Internet and can do so at a high rate. Usually repetitive tasks.

Web crawlers can be stopped from accessing a page with a robots.txt file through robot exclusion protocol.

C.2.5 Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler.

Meta tags are used for indexing for keywords, retrieval (if index is relevant to search query), and may sometimes be used for ranking. Google for example, gives meta tags no weight. Students should be aware that this is not always a transitive relationship.

C.2.6 Discuss the use of parallel web crawling.

Use of several web crawlers or running multiple processes in parallel at once to maximize download rate. Has to be careful not to download the same site more than once.

C.2.7 Outline the purpose of web-indexing in search engines.

Indexing allows for speedy searching and to provide high relevancy.

C.2.9 Describe the different metrics used by search engines.

Trustworthiness of linking domain/hub
Popularity of linking page
Relevancy of content between source and target page
Anchor text used in link
Amount of links to the same page on source page
Amount of domains linking to target page
Relationship between source and target domains
Variations of anchor text in link to target page

Search Engine Optimization

C.2.8 Suggest how web developers can create pages that appear more prominently in search engine results.

Allow search engines to find your site
Have a link-worthy site
Identify key words, metadata
Ensure search-friendly architecture
Have quality content
Update content regularly

C.2.11 Discuss the use of white hat and black hat search engine optimization.

Black hat use aggressive SEO strategies that exploit search engines rather than focusing on human audience - short term return. Include usage of:

Blog spamming
Link farms
Hidden text
Keyword stuffing
Parasite hosting
Cloaking

White hat techniques are “within” guidelines and considered ethical - long term return.

Guest blogging
Link baiting
Quality content
Site optimization

Distributed Approaches to the Web

Future of an Interconnected Web

C.3.1 Define the terms: mobile computing, ubiquitous computing, peer-2-peer network, grid computing.

Mobile Computing

Mobile computing is human-computer interaction during which the computer can be expected to be transported during normal usage (or otherwise is mobile). Most popular devices include the smart phone and the tablet.

Ubiquitous computing

Ubiquitous computing is the concept where computing is made to appear anytime anywhere. An overwhelming spread of computing (pervasive computing).

Peer-2-Peer Networks

Peer-2-Peer Networks are ones in which each computer or node acts as both client and server which allows for resources to be commonly shared by all within the network. Autonomy from central servers achievable.

Grid Computing

Grid computing is the collection of computer resources in multiple locations to reach a common goal. Distinguished from cluster computing in that grid computing assigns specific roles to each node. Grids can be used for software libraries. Persistent, standards-based service infrastructure.

C.3.2 Compare the major features of: the above

Ubiquitous computing is being perpetuated by mobile computing. The idea is spreading and manifesting.

P2P addresses is more about assuring connectivity and a network of shared resources, while grid network focuses more upon infrastructure. Both deal with the organization of resource sharing within virtual communities.

Ubiquitous computing commonly are characterized by multi-device interaction (P2P and grid), but are not necessarily synonymous.

Interoperability and Open Standards

C.3.3 Distinguish between interoperability and open standards.

Interoperability is a property of a system to work with other products without any restrictions in access or implementation.

Open standards is a standard publicly available and has various rights to use associated with it.

Compression

C.3.6 Distinguish between lossless and lossy compression.

Lossless recovers every single bit of original data when decompressed (GIF).

Lossy eliminates redundant or certain information. (JPEG)

The Evolving Web

C.4.1 Discuss how the web has supported new methods of online interaction such as social networking.

Web 2.0 and the increase of dynamic web pages have allowed for user contribution to greatly proliferate and the widespread usage of social networking, blogging, and comment sections.

Cloud Computing

C.4.2 Describe how cloud computing is different from a client-server architecture.

Cloud computing is hosting on remote servers on the internet to store, manage, and process data rather than on a local server or personal computer. Cloud computing more widely shares resources than in the client-server architecture. Client-server architecture merely refers to the communication between client and server and the distribution of “responsibility”.

C.4.3 Discuss the effects of the use of cloud computing for specified organizations.

Less costly
Device and location independence
Maintenance is easier
Performance is easily monitored
Security is interesting

Intellectual Computing

C.4.4 Discuss the management of issues such as copyright and intellectual property on the web.

Creative Commons gives freedom to share, adapt, and even use commercially information. Has different redistributions and some may allow usage without crediting, but may not indicate it is their own intellectual property.

C.4.7 Explain why the web may be creating unregulated monopolies.

New multinational online oligarchies or monopolies may occur that are not restricted by one country.

Net Neutrality

A principle idea that Internet Service Providers (ISP) and governments should treat all data and resources on the Internet the same, without discrimination due to user, content, platform, or other characteristics.

Internet of Things (IoT)

IoT refers to the network of physical objects embedded with electronics and other needed technology to enable these objects to collect and exchange data. ^[2]

↑ Dale, Nell, and John Lewis. Computer Science Illuminated. 5th ed. N.p.: Jones & Bartlett Learning, 2012. Print.
↑ "Internet of Things Global Standards Initiative". ITU. Retrieved 7 May 2016.

This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.