Python String Manipulation and Regular Expressions
Formatting Strings: Adjusting Case
upper
lower
capitalize
title
swapcase
Formatting Strings: Adjusting Case
Examples
>>> fox = "tHe qUICk bROWn fOx' >>> fox.upper() 'THE QUICK BROWN FOX >>> fox.title() 'The Quick Brown Fox' >>> fox.swapcase() 'ThE QuicK BrowN FoX'
Formatting Strings: Adding and Removing Spaces
strip
rstrip
lstrip
center
ljust
rjust
Formatting Strings: Removing Spaces
Example:
>>> line = ' stuff ' >>> line.strip() 'stuff'
The
strip
methods can remove arbitrary characters>>> num = "000000000042" >>> num.strip('0') '42'
Formatting Strings: Adding Spaces
Examples:
>>> line = 'stuff' >>> line.center(20) ' stuff ' >>> line.ljust(20) 'stuff ' >>> '42'.rjust(10, '0') '0000000042'
Finding and Replacing Substrings
find
rfind
index
rindex
endswith
startswith
replace
Finding and Replacing Substrings
Examples:
>>> line = 'Hello world' >>> line.find('world') 6 >>> line.rfind('o') 7 >>> line.startswith('He') True >>> line.replace('o', '---') 'Hell--- w---rld'
Splitting and Partitioning Strings
partition
rpartition
split
splitlines
join
Splitting and Partitioning Strings
Examples:
>>> line = 'The quick brown fox jumped over a lazy dog' >>> line.partition('fox') ('The quick brown ', 'fox', ' jumped over a lazy dog') >>> line.split() ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog'] >>> '--'.join(['1', '2', '3']) '1--2--3'
format
Basics
The
format
method requires that the string haveformat fields
indicated with curly braces.The format fields are replaced with objects passed into the
format
method.Example:
>>> '{} and {}'.format('Alice', 'Bob') 'Alice and Bob'
format
Arguments
The format fields can be replaced by position or keyword
Position example:
>>> '{1} and {0}'.format('Alice', 'Bob') 'Bob and Alice'
Keyword example:
>>> '{a} and {b}'.format(a='Alice', b='Bob') 'Alice and Bob'
Mixed example:
>>> '{b} and {0}'.format('Alice', b='Bob') 'Bob and Alice'
Formatting Example with format
>>> for x in range(1,11):
... print('{:2} {:3} {:4}'.format(x,x**2,x**3))
...
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
format
Specifiers
An optional
’:’
and format specifier can follow the format field nameFormat specifiers enable greater control over formatted values
Syntax (all are optional):
{:[1][2][3][4][5][6][7][8]}
fill and alignment
sign
’#’
– alternate number form’0’
– sign aware zero paddingwidth
grouping
‘.’ followed by integer – precision
type
format
Specifiers: Width Option
The width option is a positive integer that specifies padding
If the width is too small, nothing happens
Example:
>>> '{:10}'.format('hello') 'hello ' >>> '{:2}'.format('hello') 'hello'
format
Specifiers: Fill and Alignment
Alignment options:
’<’
: left align’>’
: right align’=’
: pad after sign (if any) but before digits’^’
: center align
A fill character can optionally be specified before the alignment character
Example:
>>> '{:^11}'.format('hello') ' hello ' >>> '{:>11}'.format('hello') ' hello' >>> '{:*>11}'.format('hello') '******hello' >>> '{:-^11}'.format('hello') '---hello---'
format
Specifiers: Sign option
The sign option is only valid for number types
’+’
: sign should be used for both positive and negative numbers’-’
: sign should be used for only negative numbers (default)’ ’
(space): leading space for positive numbers and minus sign for negative numbers
Examples
>>> '{:+}'.format(123) '+123' >>> '{: }'.format(123) ' 123' >>> '{:+}'.format(1.414) '+1.414'
format
Specifiers: #
The
#
option causes a type specific “alternate form” to be used for the conversionThe
#
can only be used for integer, float, complex, and Decimal typesExample
>>> # print 123 in binary >>> '{:b}'.format(123) '1111011' >>> '{:#b}'.format(123) '0b1111011'
format
Specifiers: 0
The
0
(preceding the width option) enables sign aware zero-padding for numeric typesExample
>>> '{:010}'.format(123) '0000000123' >>> '{:010}'.format(1.414) '000001.414'
format
Specifiers: Grouping Option
The grouping option specifies a character for separating thousands in numbers
The grouping option can be either
’_’
or’,’
Example
>>> '{:_}'.format(1000000) '1_000_000' >>> '{:,}'.format(1000000) '1,000,000'
format
Specifiers: Precision Option
The precision option specifies how many digits to be displayed:
after the decimal point for fixed point floating point values
before and after the decimal point for general floating point values
Example
>>> x = math.sqrt(2) >>> '{:.2}'.format(x) '1.4' >>> '{:.2f}'.format(x) '1.41' >>> '{:.6}'.format(x) '1.41421' >>> '{:.6f}'.format(x) '1.414214'
format
Specifiers: Type Option
String option:
’s’
: string
Common integer options:
’b’
: binary’c’
: character’d’
: decimal’o’
: octal’x’
: hex (lower case)’X’
: hex (upper case)
format
Specifiers: Type Option (Continued)
Common float options:
’e’
: exponent notation’E’
: exponent notation (upper case E)’f’
: fixed point’F’
: fixed point (upper case NAN and INF)’g’
: general format’%’
: percent (multiply by 100)
format
Specifiers: Type Option
Integer examples
>>> '{:d}'.format(123) '123' >>> '{:b}'.format(123) '1111011' >>> '{:X}'.format(123) '7B'
Floating point examples
>>> '{:f}'.format(1.414) '1.414000' >>> '{:E}'.format(1.414) '1.414000E+00' >>> '{:%}'.format(1.414) '141.400000%'
Pattern Matching with Regular Expressions
Regular expressions are means of flexible pattern matching in strings
The Python interface to regular expressions is the
re
module- Some
re
methods:compile
split
match
search
sub
Pattern Matching with Regular Expressions
Example
>>> import re >>> regex = re.compile('\s+') >>> line = 'the quick brow fox jumped over a lazy dog' >>> regex.split(line) ['the', 'quick', 'brow', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']
The pattern
\s+
matches any whitespace character
Pattern Matching with Regular Expressions
Example: comparing string methods
>>> line.index('fox') 15 >>> regex = re.compile('fox') >>> match = regex.search(line) >>> match.start() 15 >>> line.replace('fox', 'BEAR') 'the quick brow BEAR jumped over a lazy dog' >>> regex.sub('BEAR', line) 'the quick brow BEAR jumped over a lazy dog'
Basic Regular Expression Syntax
Simple Strings are matched exactly
>>> regex = re.compile('ion') >>> regex.findall('Great Expectations') ['ion']
Some characters have special meanings:
. ^ $ * + ? { } [ ] \ | ( )
- These characters need to be escaped with a backslash (
\
) to match them
- These characters need to be escaped with a backslash (
Regular Expression Escape Characters
Example: the ‘r’ preface indicates a raw string
>>> regex = re.compile(r'\$') >>> regex.findall("the cost is $20') ['$']
Example: strings versus raw strings
>>> print('a\tb\tc') a b c >>> print(r'a\tb\tc') a\tb\tc
Matching Character Groups
The backslash can be used to give normal characters special meaning
character description "\d"
match any digit "\D"
match any non-digit "\s"
match any whitespace "\S"
match any non-whitespace "\w"
match any alphanumeric char "\W"
match any non-alphanumeric char Example
>>> regex = re.compile(r'\w\s\w') >>> regex.findall('the fox is 9 years old') ['e f', 'x i', 's 9', 's o']
Matching Custom Character Groups
The square brackets specify a set of characters
>>> regex = re.compile('[aeiou]') >>> regex.split('consequential') ['c', 'ns', 'q', '', 'nt', '', 'l']
The dash can be used to specify a range
>>> regex = re.compile('[A-Z][0-9]') >>> regex.findall('1043879, G2, H6') ['G2', 'H6']
Matching Repeated Characters
Curly braces with a number specify repetition
>>> regex = re.compile(r'\w{3}') >>> regex.findall('The quick brown fox') ['The', 'qui', 'bro', 'fox']
Matching Repeated Characters
character | description |
---|---|
? |
match zero or one |
* |
match zero or more |
+ |
match one or more |
{n} |
match n repetitions |
{m,n} |
match between ‘m’ and ‘n’ |
Matching Repeated Characters
Example: email address matcher
email = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
"[\w+]"
one or more alphanumeric characters or periods"@"
at sign"\w+"
one or more alphanumeric characters"\."
period"[a-z]"
exactly three lower case characters