Python/Strings

< Python

Objective

Understand Python strings.
Learn basic string manipulation.
Learn about escape characters and their role in strings.
Learn about linefeed techniques for large strings.
Learn the string formatting basics.
Learn about string indexing.
Learn about string slicing.
Learn about string encoding, ASCII, and Unicode.
Work with common built-in string functions.

Lesson

Python Strings

The string is one of the simplest data types in python. Strings can be created by putting either single quotations (') or double quotations (") .

A simple string with single quotations:

>>> 'Hello!'
'Hello!'

A string can also use double quotations, which do not affect the string in any way.

>>> "Hello!"
'Hello!'

You could also concatenate (join together) strings by using the plus sign (+).

>>> "Hello," + " world!"
'Hello, world!'

Strings can also be concatenated if they are literals (Strings not held in variables).

>>> "Wiki" "versity" "!"
'Wikiversity!'

The above code has spaces between the literal strings, but you don't necessarily need to put spaces between them to work. "Wiki""versity""!" will display the same results.

Now, let's say you need to type a very long string that repeats itself. You can repeat words by using the multiplication operator (*).

>>> print("hey" * 3)
heyheyhey

Escape Characters

There are some characters that cannot be easily expressed within a string. These characters, called escape characters, can be easily integrated within a string by using two or more characters. In Python, we denote escape characters with a backslash (\) at the beginning. For example, to start a new line in the string we could add a linefeed (\n).

>>> "Hello, world!\n"
'Hello, world!\n'

That's not really impressive, is it? To actually see that new line in action, use the built-in function print().

>>> print("Hello, world!\n")
Hello, world!

Here is a table of other escape characters (no need to memorize them, the most important one you'll use is \n).^[1]

Escape Sequence	Meaning
\\	Backslash (\)
\'	Single quote (')
\"	Double quote (")
\a	ASCII Bell (BEL)
\b	ASCII Backspace (BS)
\f	ASCII Form-feed (FF)
\n	ASCII Linefeed (LF)
\r	ASCII Carriage Return (CR)
\t	ASCII Horizontal Tab (TAB)
\v	ASCII Vertical Tab (VT)
\ooo	Character with octal value ooo.
\xhh	Character with hex value hh.
\N{name}	Character named name in the Unicode database.
\uxxxx	Character with 16-bit hex value xxxx.
\uxxxxxxxx	Character with 32-bit hex value xxxxxxxx.

Now you might start to see a problem with using \ in your string. Let's create a situation in which we need to print a Windows directory name.

>>> print("C:\new folder")
C:
ew folder

See how \n got interpreted as a linefeed? To correct this, we could use the backslash escape character. You need to be careful when using backslashes; always remember that two of them will only output one backslash.

>>> print("C:\\new folder")
C:\new folder

It could get tiresome to do that with very long directory strings, so let's use a simpler way than using two backslashes; just use the prefix r or R. By putting this prefix before there are any strings quotations, we tell Python that this string is a literal string ('r' stands for raw, so it really is a raw string). That essentially tells Python to ignore all of the escape characters.

print(r"C:\new folder")
C:\new folder

You can easily assign strings as variables, too.

>>> spam = r"C:\new folder"
>>> print(spam)
C:\new folder

Newlines

Now, let's say you want to print out some multi-line text. You could do it like this.

>>> print("Heya!\nHi!\nHello!\nWelcome!")

A string like that could grow really long, but we can use an easy trick which will allow us to span text multiple lines without cramming it all onto one line. To do this we use three quotations (""" or ) to start and end a string.

>>> print("""
... Heya!
... Hi!
... Hello!
... Welcome!
... """)

Heya!
Hi!
Hello!
Welcome!

That made things a lot easier. But we can still do better. By adding a backslash (\<\code>) we can remove the first linefeed.



>>> print("""\
... Heya!
... Hi!
... Hello!
... Welcome!""")
Heya!
Hi!
Hello!
Welcome!




Some of you may have noticed that print() automatically ends with an extra linefeed (\n). There is a way to by pass this.

>>> print("I love Wikiversity!", end="")
I love Wikiversity!>>>




A usefully way to span a string multiple lines without inserting automatic line-feeds is to use parentheses.

>>> spam = ("Hello,
...  world!")
>>> print(spam)
Hello, world!


Formatting
Strings in Python can be subjected to special formatting, much like strings in C. Formatting serves a special purpose by making it easier to make well formatted output. You can format a string using a percent sign (%) or you could use the newer curly brackets ({}) formatting. An simple example is given below.

>>> print("The number three (%d)." % 3)
The number three (3).






The above code simple uses special format characters (%d), which is replaced with a decimal-based integer. The percent sign (%) after the string is the stuff that replaces the format characters. That can be a lot to take in. Let's demonstrate this a couple more times.

>>> name = "I8086"
>>> print("Copyright (c) %s 2014" % name)
Copyright (c) I8086 2014




This time, we use a different type of format that inserts a string. You'll need to do some extra work if the string needs to be formatted more than once.

>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % (name, date))
Copyright (c) I8086 2014




Notice the need for parentheses and the comma. If we don't add the parenthesis around the format arguments, then we'll get an error.

>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % name, date)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string




To keep you from guessing what is what; here is a table of all possible formats with a little information about them.



 Type  Meaning

 s  String format. Default for formatting.

 b  Binary format.

 c  Converts an integer to a Unicode character before it is formatted.

 d  Decimal format.

 o  Octal format.

 x  Hexadecimal format. Use lowercase for a-h.

 X  Hexadecimal format. Use uppercase for A-H.

 n  Number format. This is the same as 'd', except that it uses the current locale setting to insert the appropriate number separator characters.^[2]

 e  Exponent notation. Prints a number in scientific notation. Default precision is 6.

 E  Exponent notation. Same as 'e', except it prints 'E' in the notation.

 f  Fix point. Displays a fixed-point number. Default precision is 6.

 F  Fixed point. Same as 'f', but converts nan to NAN and inf to INF.^[3]

 g  General format.

 G  General format. Switches to 'E' if numbers are too large.





Sorry, this section is under construction.

Indexing
Strings in Python support indexing, which allows you to retrieve part of the string. It would be better to show you some indexing before we actually tell you how it's done, since you'll grasp the concept more easily.

>>> "Hello, world!"[1]
'e'
>>> spam = "Hello, world!"
>>> spam[1]
'e'




By putting the index number inside brackets ([]), you can extract a character from a string. But what magic numbers correspond to the characters? Indexing in Python starts at 0, so the maximum index of a string is one less than its length. Lets try and index a string beyond its limits.

>>> spam = "abc"
>>> spam[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range




Here's a little chart of "Hello, world!"'s character positions.

 0  1  2  3  4  5  6  7  8  9  10  11  12

 H  e  l  l  o  ,     w  o  r  l  d  !



Hopefully that chart above helped to visually clarify some things about indexing. Now that we know the formula for the last character in a string, we should be able to get that character.

>>> eggs = "Hello, world!"
>>> eggs[len(eggs)-1]
'!'







 
 Note: In Python, and most languages, the length of a string is measured by how many characters are contained within the string. The string "abc" is only 3 characters long.




In the above code, we used the formula, string length minus one, to get the last character of a string. By using the built-in function len(), we can get the length of a string. In this instance, len() returns 13, which we subtract by 1, resulting in 12. This can be a bit exhausting and repetitive when you need to repeat this over and over again. Luckily, Python has a special indexing method that allows you to get the last character of string without needing know the strings length. By using negative numbers, we can index from right to left instead of left to right.

>>> spam = "I love Wikiversity!"
>>> spam[-1]
'!'
>>> spam[-2]
'y'







 
 Note: Since -0 will still be considered 0 in Python, so you'll need to start with -1. spam[-19] will be 'I' instead of spam[-18] being 'I', which would really be ' '.




There is a table below showing the indexing number corresponding to the character. Take some time to study the table.

 -19  -18  -17  -16  -15  -14  -13  -12  -11  -10  -9  -8  -7  -6  -5  -4  -3  -2  -1

 I     l  o  v  e     W  i  k  i  v  e  r  s  i  t  y  !



It is important that you understand that strings are immutable, which means that there content cannot be manipulated. Immutable data types have a fixed value that cannot change. The only way to change there value is to completely re-assign the variable.

>>> spam = "Hello,"
>>> spam = spam + " world!"
>>> spam
'Hello, world!'




From the above example, spam is re-assigned to a different value. So what does this have to do with indexing? Well, the same rules apply to indexing, so all of the indexes cannot be assigned with a new value nor can they be manipulated. The example below will help clarify this concept.

>>> spam = "Hello, world!"
>>> spam[3] = "y"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> spam[7] = " Py-"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment




To re-assign a string variable while replacing part of the substring will need a little extra work with slicing. If you aren't familiar with slicing, it is taught in the next section. You'll probably want to come back to this after you're done reading that section.

>>> spam = "Hello, world!"
>>> spam = spam[:2] +"y" + spam[3:]
>>> spam
'Heyllo, world!'
>>> spam = "Hello, world!"
>>> spam = spam[:6] + " Py-" + spam[7:]
>>> spam
'Hello, Py-world!'


Slicing
Slicing is an important concept that you'll be using in Python. Slicing allows you to extract a substring that is in the string. A substring is part of a string within a string, so "I", "love", and "Python" are all substring of "I love Python.". When you slice in Python, you'll need to remember that the colon (:) is important. It would be better to show you, then to tell you right away how to slice strings.

>>> spam = "I love Python."
>>> spam[0:1]
'I'
>>> spam[2:6]
'love'
>>> spam[7:13]
'Python'




As you can see, slicing builds onto Python's indexing concepts which were taught in the previous section. spam[0:1] gets the substring starting with the character at 0 till the character of 1. So really the first number is where you start your slice and the number after the colon (:) is where you end your slice.

Now slicing like this can be helpful in situations, but what if you'd like to get the first 4 characters after the start of a string? We could use the len() function to help us, but there is an easier way. By omitting one of the parameters in the slice, it will slice from the beginning or end, depending on which parameter was omitted.

>>> eggs = "Hello, world!"
>>> eggs[:6]
'Hello,'
>>> eggs[6:]
' world!'




By slicing like this, we can remove or get part of a string without needing to know its length. As you can see from the example above, eggs[:6] and eggs[6:] are equal to eggs. This helps ensure that we don't get the same character into both strings.

>>> eggs = "Hello, world!"
>>> eggs[:6]+eggs[6:]
'Hello, world!'
>>> eggs[:6] + eggs[6:] == eggs
True


The handling of IndexError is when slicing or indexing. Attempting to index a string with a number larger than (or equal to) its length, it would produce an error.

>>> "Hiya!"[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range




While slicing, this kind of error is suppressed, since it returns .

>>> "Hiya!"[10:]
''
>>> "Hiya!"[10:11]
''
>>> "Hiya!"[:10]
'Hiya!'




Sorry, this section is under construction.

Encoding
So we know what a string is and how it works, but what really is a string? Depending on the encoding, it could be different things without changing. The most prominent string encoding are ASCII and Unicode. The first is a simple encoding for some, but not all, Latin characters and other things like numbers, signs, and money units. The second, called Unicode, is a larger encoding that can have thousands of characters. The purpose of Unicode is to create one encoding that can contain all of the worlds alphabets, characters, and scripts. In Python 3 Unicode is the default encoding. So this means we can put almost any character into a string and have it print correctly. This is great news for non-English countries, since the ASCII encoding doesn't permit many types of characters. In fact, ASCII only allows 127 characters! Here's some examples using different languages, some with non-Latin characters.

>>> print("Witaj świecie!")
Witaj świecie!
>>> print("Hola mundo!")
Hola mundo!
>>> print("Привет мир!")
Привет мир!
>>> print("שלום עולם!")
שלום עולם!

Assignments

Completion status: Almost complete, but you can help make it more thorough.

References

This article is issued from Wikiversity - version of the Wednesday, April 06, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.