Archive | Developer RSS feed for this section

Set up your local wordpress development environment

The article will show you how to get your local Wordpress dev environment up on Mac.

Steps to follow

  1. Download and install MAMP (comes with Apache, MySQL and PHP)
  2. Edit your hosts file
  3. Download WordPress and set up virtualhost in apache that points to it.
  4. Edit your wp-config.php file
  5. Download Coda and import WordPress mode for auto-completion for wordpress syntax.

Code auto-completion for wordpress syntax

http://hitchhackerguide.com/2011/02/18/wordpress-syntax-mode-for-panic-coda/

Leave a comment Continue Reading →

Top Programmer Code Practice

Information Hiding

  1. Promote encapsulation and data hiding
  2. Expose clean API and achieve high decoupling and in turn promote reusability. So, modules can be developed in parallel and fix in isolation.
  3. Achieved in Java by access control (public, protected, package-private, private). Top level (non-nested) class can only be package-private and private. Public imposes maintenance cost! Once your API public or protected access, you must support it.
  4. If a class A is only used by one class B, consider to make A private nested class of B.
  5. Instance fields should never be public. Public mutable field is not thread-safe whereas public immutable field limits you from changing data type. Even you expose constant value, make sure it is public static final on a primitive or on a reference to an immutable object. NOTE: Array is mutable. Use immutable List instead.
  6. There is one rule that restricts your ability to reduce accessibility of methods. If a method overrides a superclass method, it is not permitted to have a lower access level in the subclass.

Code Against Interface

Promote Immutability

  1. Make its properties private and final
  2. Take away any API that can change its properties
  3. Make your class "final" to prevent subclassing.
  4. Immutable object can be shared and thread-safe

Favor Composition over Inheritance

  1. It is safe to use inheritance within a package but not cross package
  2. It is also safe to use inheritance when extending classes specifically designed and documented for extension.
  3. Interface inheritance is good. The above rules apply to implementation inheritance b/c it will violate encapsulation as subclass depends on implementation details of its superclass and superclass may change from release to release. Sometimes, superclass may break your security rules that you impose on the subclass. To avoid those problems as a whole, use composition. Along with forwarding class, the design is clean.

Scalability vs Performance

Performance measures whether an application can respond to a request within its defined service-level agreement (SLA). Scalability measures how well an application can maintain its performance under increasing load.

  1. Horizontal scaling – To scale your system horizontally, the key problem you need to solve is how to replicate the state effectively. Serialization is primitive as it will replicate the full state (can be big) even one small field is changed. This approach's inefficiency hinders linear scalability. To overcome the limitations of Java serialization, Terracotta uses bytecode instrumentation (BCI) in the Terracotta client to identify the exact properties within stateful objects that change and then replicate only those properties across the cluster (fine-grained replication). Bytecode instrumentation is a process through which an application's behavior can be modified at runtime. NOTE: Bytecode instrumentation is a process through which an application's behavior can be modified at runtime. Terracotta uses BCI to intercept changes made to objects so that it can identify those changes and send them to the Terracotta server (hmm, it is not purely peer to peer approach).
Leave a comment Continue Reading →

Session Management – Part 1

Session management is one of the key topics that all serious web developers and architects need to master with. This article will go through several key topics with you. They are:

  • Persistence vs non-persistence web connection – web performance!
  • Concerns of using cookie – security and size limitations
  • Server side session management challenges in scalable web application
  • Achieve linear scalability through stateless servers - start moving the session to the client

Today, I will start walking through all these topics at a high level. A series of articles will be written to further develop on each topic if necessary. Lets start!

Persistence vs non-persistence web connection

  1. Before HTTP 1.1, HTTP is a stateless protocol that doesn't maintain persistence connection. Each request made by a Web browser, for an image, an HTML page, or other Web object, is made via a new connection.
  2. HTTP 1.1 introduced persistence connection (ie. Keep-Alive) that Web browser can established a single connection, through which multiple requests could be made.
  3. But before HTTP 1.1, how can state maintain across stateless HTTP request?
    • Normally, we keep the session in the server side and provide the session id to the client that can be used to link subsequent requests to the same session.
    • Normally, client (often time web client) will store the session id in cookie.
    • However, if the cookie is disabled, the session id will normally embedded in the URL (ie URL Rewriting).

Concerns of using cookie

What do we need to pay attention when we store info in cookie?

  1. Size limitation and security concerns.
  2. How long cookie can last? Default = expired when browser exits. In Java, you can do cookie.setMaxAge(int) with long future date if you want to keep the info lasting long in the cookie. If you do setMaxAge(0), it will void the cookie.
  3. Normally, we don't keep all state info in cookie as the information could be sensitive and we are not able to protect it because it sits in the clients' filesystem. Apart from that, there has limitation in size as well. For these two concerns, we normally just store the session id in the cookie and keep the session in the server side. This approach can save us bandwidth as well.

Server side session management challenges

At the first glance, session in server side sounds like a great solution. However, when it comes to scale, it always raises the concerns. Imagine you need to replicate client session state across multiple servers to achieve high availability. Both the replication time and memory resource limit will cause your system not able to scale linearly. To solve or minimize this, we selectively pick what kind of info we store in the session, use sticky session to avoid one session replication across all the machine or even try to store the state to the client if possible like using rich client UI (ex. Flex and Silverlight). A post will be written about this topic later on.

Transient vs Persistent State

  1. Session in the server can be timed out (~30 minute inactive)
  2. Session in the server can be persisted in file across Tomcat restart.
  3. Persistent state should be stored in database.
  4. Object putting in session should be Serializable
  5. Avoid putting too much info in the session b/c we don't want to put too much baggage during session replication. One server crash b/c of memory depletion can further spread across to other servers via session replication. Not Good! Should we reconsider storing session in client? This article talks about it.
  6. Session replication is needed to support failover. Sticky session for simplicity but suffered data lost when the box is down. We can tell one or two servers as its backup to avoid the session lost. To go for sticky session approach, we need to identify the "sticky" part. What kind of thing we can use to link separate requests? Use IP address can potentially overload a box because some Internet service providers use a set of proxy servers to deal with many clients. This subject can be further developed. We will go back to it later!
Leave a comment Continue Reading →

Flex Hacking Series Part 1 – Event Model

Event Model

Event Flow

The idea is simple. Here is the regular event flow: Users interact with the UI, event is generated, broadcast via the event dispatcher (bubbling up the display hierarchy if enabled) and captured by any registered listeners, and a set of actions is taken in response. To understand it a bit more in detailed, you can check this article and play with its demo. In short, under the hood, it has 3 phases: capturing, targeting and bubbling. In the display list, from the top, it always starts from Stage as the root, then SystemManager, then your Application. The event is created by the flash player and travel down from Stage to the target component and bubble it up back to Stage (if enabled). In this round trip, it will trigger the event listeners'  actions. Most components communicate with others using events which conveys useful information and data but only visual components (objects in the display list) can participate in the event flow described above.

in practice, you normally just need to worry about registering listener to the target component for a particular event propagated from it. You seldom need to understand the event flow above in detailed. However, if you want to exercise more control on the event like stopping the event from propagating, you need to understand how it works first.

Cancel/ Stop the event

Within Flex, by default, events only broadcast themselves to their parent component. If you want the event to broadcast to its parent's parent (and all the way up your component chain in the display object hierarchy), then you tell that event to bubble. If you don't want any component in chain cancels your event via stopPropagaton() or stopImmediatePropagation(), you can make it non-cancelable during event construction. The difference between stopPropagation and stopImmediatePropagation is that stopImmediatePropagation will not only prevent the event from moving to the next node, but it will also prevent any other listeners on that node from capturing their events.

Some events have an associated default behavior. For example, the doubleclick event has an associated default behavior that highlights the word under the mouse pointer at the time of the event. Your event listener can cancel this behavior by calling the preventDefault() method. 

Define your own Custom Event

Now you understand the event flow. Next, you need to know how to create a custom event to carry additional info. This article helps you to achieve this goal.

Leave a comment Continue Reading →

Steve, how could you stop Flash on your gadgets?

To many of Flash/Flex programmers, it could be a bad news that Steve Job openly banned Flash on his devices like iPad. His message clearly stated that HTML5 can be used to replace the rich experience of Flash and it will be our future. He may be right about the cons of Flash. Nothing is perfect. However, I am surprised that he took a step further to ban Flash totally. Why couldn't he simply provide an option for users to turn it off if they want? I really doubt about his intent. Whenever I see something like that, it only reminds me what Microsoft did in the old days.

In fact, from his TOS, it stated clearly that the most important reason for Apple to ban Flash is because Apple doesn't want a third party layer of software come between the platform and the developer.

Allowing Flash — which is a development platform of its own — would just be too dangerous for Apple, a company that enjoys exerting total dominance over its hardware and the software that runs on it. Flash has evolved from being a mere animation player into a multimedia platform capable of running applications of its own. That means Flash would open a new door for application developers to get their software onto the iPhone: Just code them in Flash and put them on a web page. In so doing, Flash would divert business from the App Store, as well as enable publishers to distribute music, videos and movies that could compete with the iTunes Store – Brian on wired.com

 
Putting aside Apple's intent, as a consumer, I don't think I want to carry a bigger iPhone (ie. iPad) that doesn't provide me the full web experience. I don't care what HTML5 will turn out, I need to read Flash now because it could take years if not decades for Flash be totally eliminated on Web (I doubt this will happen!). If you really like iPad but still want to see the Flash on it, here is the good news for you. You now can install Frash to get around it.  Below is a video that shows you how to install it and get it work on your iPad.
 

Leave a comment Continue Reading →

Learning Hive

Starting to learn Hive

As I mentioned in my last article,  I was getting excited about the potential of Hive. Today, I decide to start my journey to learn this. I found a great introductory video that gives you a nice warm-up of using Hive (A basic knowledge of how hadoop and mapreduce work would be helpful for you to digest the material inside).

Below are some highlights from this video

Hive is an SQL interface built on top of Hadoop. It supports Web access and JDBC. I am amazed how close the SQL syntax like the regular SQL for RDBMS. Below are some SQLs used in this tutorial.

//———- Set up your tables in HIVE —————–
SHOW TABLES;

CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

DESCRIBE shakespeare;

//———- Load data into Hive table from Hadoop HDFS ——————-
LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare;

//———- Query against the data using hive sql interface ————–
select * from shakespeare limit 10;
select * from sakespeare where freq > 100 sort by freq asc limit 10;
select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//show me the plan
explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

//———- Create a merge table and populate it using dataset joining by 2 different tables
insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word);

//———- Query the merge table ———————
select word, shake_f, kjv_f, (shake_f+kjv_f) as ss from merged sort by ss limit 20;

To prepare the data for Hive to load in, the demo uses another mapreduce job to achieve. Remember to delete the log before doing Hive table load.

hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep input shakespeare_freq ‘\w+’

//remove the mapreduce job log
hadoop fs -rmr shakespeare_freq/_logs

Often time, large scale data processing system always IO bound. So for mapreduce job, your mapper is always waiting for data to load from disk. Hadoop mitigates the problem via during parallel load from lots of hard drives. However, a single hard drive is still max out at 75MB/s read as physical limit and nothing we can do about this. In order to achieve good speed, the key is to eliminate # of hadoop pass

Since Hive is on top of Hadoop’s HDFS, it will have the same restrictions as it. So, you cannot do UPDATE, DELETE and INSERT records as regular RDMS. However, you can do bulk load to add more new files (data) to the table and you can do delete a file from Hive.

Hive needs to store metadata of the tables out from the HDFS. You can use regular rdms to achieve the job. But when you start Hive locally, it will seek for the local metastore. So, in distributed environment, you may need to centralize the metastore in a remote location. There is wiki on the Hive site that documents how to set it up.

See Hive in Action

Cloudera Hadoop Training: Hive Tutorial Screencast from Cloudera on Vimeo.

Other projects similar to Hadoop

  • Parallel databases: Gama, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM Research: JAQL
  • Microsoft: DryadLINQ, SCOPE
  • Greenplum: YAML MapReduce
  • Aster Data: In-database MapReduce
  • Business.com: CloudBase
Leave a comment Continue Reading →

Hive on Amazon EC2 cloud

adserving-ec2-hive-system-arch

 

I ever worked for a display ad network company that collects over 400 million of impression/ click logs per day. With this amount of data, my ex-company bought a supercomputer and cross their fingers that it can handle the grow in both volume and analytic demand of the data. It is obviously not a scalable solution. However, what is the best solution?

Although I haven’t worked for this company anymore, it is still an interesting problem to solve. I have a great friend who proposed a shared nothing solution for this company. The solution is to partition the data across a set of Postgresql databases and put Greenplum on top of them to parallelize the query —there is no disk-level sharing or contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture). I like this approach. The only thing is that Greenplum is not free and it may be difficult for a startup to face this upfront cost. Apart from that, this setting requires all the databases are running on the same network that hindered us to move this in the elastic cloud like Amazon EC2.

Later on, I joined a great company in the same industry that seeks for a solution in the cloud to host its data warehouse. So, I got a  chance to revisit this problem. During the research, I came across an interesting technology – column-based database (eg. infobright and lucid db). The idea of column-based data store is that traditional database stores and fetches data in row from data files into the memory. It is inefficient if your query only requires few columns for computation. So, column-based data stores your data in column with effective compression algorithm due to all values in it has the same data type. This solution is great but it doesn’t do MPP (ie. massive parallel processing) and it is also not ready for cloud yet.

Here comes another solution. That is Hive on top of Hadoop on top of Amazon cloud. It is an interesting idea. Check out this video to learn about this.


If you are not sure what Hadoop is and want to get some warm up in massive computing. I suggest you go through the following 5 excellent Google lectures.


Leave a comment Continue Reading →

Java 5 Features – Enum and Annotation

Intent

I want to summarize some new and interesting Java 5 features in this article and how they change the way I code.

Enum

I use int constants to make my life easier b/c it can avoid typo. However, it has several drawbacks:

  1. Java doesn’t provide namespace for int enum groups. I can either prefix my constant like ABC_ or using inner interfaces to organize it.
  2. It is compile-time constants. So you need to recompile once changed.
  3. No easy way to translate int enum constants into printable string during debugging.
  4. You cannot iterate over all the int enum easily.
  5. You need a way to validate the enum is an valid int

Use new enum type in Java 5:

public enum Apple {FUJI, PIPPIN, GRANNY_SMITH}

Enum is full-fledged final class that export one instance for each enumeration constant via a public static final field.

  1. Namespace is provided via the enum type name.
  2. You can reorder and add the enumeration constant without recompiling its client.
  3. You can translate enum into printable strings via toString() method.
  4. Enum type provides you values() method to iterate your enumeration constants (based on declaration order).
  5. Type-checking can be used for the validation check
  6. You can associate data with enum constant
  7. Enum is immutable, serializable and comparable.

EnumSet

If elements of an enumerated types are used primarily in sets, it is traditional to use the int enum pattern, assigning a different power of 2 to each constant like READ = 1 << 2, WRITE = 1 <<1, EXECUTE = 1 << 0 to represent permissions per each entity in Unix. This representation lets you use the bitwise OR operation to combine several constants into a set, known as a bit field. The bit field representation also lets you perform set operations such as union and intersection efficiently using bitwise arithmetic. But bit fields have all the disadvantages of int enum mentioned above.

Now, java.util package provides the EnumSet to efficient represent sets of value drawn from single enum type. This class implements Set interface and internally use bit vector to represent set of values. For example, if you enum types has 64 values, the entire EnumSet can be represented as a single long, so its performance is comparable to the bit field.

The EnumSet class provides three benefits a normal set does not:

  1. Various creation methods that simplify the construction of a set based on an Enumeration
  2. Guaranteed ordering of the elements in the set based on their order in the enumeration constants are declared
  3. Performance and memory benefits not nearly possible with a regular set implementation

Annotation

An annotation is a new language feature introduced in J2SE 5.0. Simply put, annotations allow developers to mark classes, methods, and members with secondary information that is not part of the operating code.You can see annotation is a way to extend Java language.

Before annotation from Java 5, you may use naming patterns to indicate that some program elements like method demanded special treatment by a tool or a framework. Like JUnit required its users to name the test methods with the pattern like testXXX(). It works but with some big disadvantages:

  1. Typo problem
  2. It doesn’t provide a way to associate parameter values with program elements.

Annotation can solve this problem. To use it, you can:

  1. Create you own marker annotation (@interface is the keyword) or parametized annotation. You can annotate the annotation (ie. meta-annotation). Example: @Retention and @Target. And marker annotation has no parameter associated with it.
  2. Annotate the program elements
  3. Write processor to handle your annotated code. Generally, annotations never change the semantics of the annotated code, but enable it for special treatment by tools. Now, the metadata of Method carries additional info for your job. You can use Method’s isAnnotationPresent() to check if a method is annotated by certain annotation type. If you annotation carried parameter, you can use Method’s getAnnotation() to get the Annotaton object and use value() to obtain the parameter.

Reference

Below are some related articles I feel useful:

  1. http://www.javalobby.org/java/forums/t16967.html
  2. Annotation in Tiger – Part 1 Meta-Annotation
  3. Annotation in Tiger – Part 2 Custom Annotation

 

 

Leave a comment Continue Reading →

Create a Virtual Company

Nowadays there are many tools available on the Net ranging from IM to cloud computing that certainly lowers the barrier for entrepreneurs like us. Today, I am going to list out the tools that helps me to run my company:

Set up virtual office

  1. Skype – Save you from long distance bill
    • FREE. If you pay a rate, it will let you connect to phone line.
    • When I am tired of typing, it will switch to this.
    • If you like face to face conversion, I would use iChat on Mac. All you need is an AOL account.
  2. Yugma – Web conferencing
    • FREE up to 20 attendees.
    • It has Skype integration.
  3. Free conferencing
    • FREE
    • In case you don’t have internet access or your laptop is not next to you. This is a good tool because it gives you a dedicated line to dail in. However, the number is not TOLL-FREE.
    • Why 712 area code? Check this out
  4. Google tools – all FREE
    • Google doc (shareable)
    • Google calendar (it can sync with my iCal on Mac now. If you use Outlook, you need to install a plugin to do the calendar sync). Follow this guide to set it up.
    • Google email (have gmail to host your mail server – yourname@yourcompany.com)
  5. MediaWiki – good wiki tool for information sharing
  6. Posterous – create your company blog via email :)
  7. iPhone
    • Not FREE
    • I use it for sync email, calendar and access Web.
  8. VNC – Remote desktop tool
    • FREE
    • For Mac, download OSXVine Server from here.
    • If you are using DSL that assigns you IP address dynamically, it is quite a headache to keep track of it. You can obtain a domain name from DynDNS to abstract you from the IP address.
    • By default, VNC server will be listened on port 5900. If you want to do remote desktop outside your subnet, make sure your DSL router open a port for that and forward the request to your machine.
    • Here is a web-based VNC viewer. With that, you can do remote desktop anywhere. Just key in your dynDNS domain and you are done.
    • If you want to make this access security, you can password protected your box via configure the VNS Server.
    • Here is a great article for that.
    • There are people who uses LogMeIn service that provides more remote secure featuers. However, it is NOT FREE. To me, VNC Server is good enough solution.

Build your virtual dev team

  1. Tracs – combine wiki, ticket system, project planning in one
    • FREE
    • A bit complicated to set it up in web hosting company
    • It integrates with Subversion as well
    • Bugzilla is pretty good for issue tracking as well
  2. Dreamhost for web hosting
    • Below $10 per month
    • I currently use it for hosting my own blog and subversion repository.
    • No java support yet.
    • For VPS solution, this one is cheap and my buddy said it is great.
  3. VirtualBox – have several operating system runs on your laptop. Very appealing!
    • FREE
    • I set up Ubundu on my Mac. Full screen, share folders, share mouse. I love it.
    • With this, I can ensure all developers are working on the same environment. Furthermore, I can have dev, qa and production using the same environment.
  4. Omnigraffle – design graphical tool on Mac
    • NOT FREE but cheap
    • Free stencils available on here.
  5.  Amazon AWS
    • Way low cost comparing to hire your own team to make sure your system 24×7
    • Cloud computing allows you scale on demand.
    • Processing power via EC2
    • Storage via S3
    • CDN via CloudFront
    • Messaging via Amazon SQS

 

Leave a comment Continue Reading →

Powerful Linux Text Processing Commands

Common Text Processing Commands

In our daily life, we deal with lots of data. The data normally is stored in text format for the ease of human to read. With the large amount of data we have, we need ways to deal with it. There are several things we frequently do on the data: Search, Filter, Sort and Analysis. In Linux, there are some powerful commands that I can use: cat, grep, find, sort, unique and etc. I found those commands quite powerful. So, I decide to put these down as my reference. This tutorial I will go over the basic text processing commands and how we use them together to achieve the tasks we often encounter in our workplace. 

cat

The power of “cat” is not just output a file to screen but to concatenates a list of file content and stream through the pipe to another program as input.

cat * | sort

find

The power of find is to list out the matched filenames based on metadata of the files like type, size, create date…

grep

“grep” helps you to list out the file(s) with the content that match the pattern(s) in regular expression. You can use it as content search across the files in your file system.

grep -H -R --color -n -P abc *

option:

  1. –color (highlight matching part in content with color)
  2. -n (show line number)
  3. -P PATTERN (perl regular expression pattern)
  4. -R (recursively)
  5. -l (only list out the filenames that match the pattern)
  6. -H show filename that matched.

cut

“cut” extracts sections from each line of input. (example of usage). Below the command will extract the 5th field and the rest from each line of file A using delimiter colon.

cut -d ":" -f 5- fileA

option:

  1. -c (character)
  2. -b (byte)
  3. -f 5- (field if the line can be broken down by delimiter)
  4. -d | (delimiter is pipe character)

sort 

The sort command sorts a file according to fields–the individual pieces of data on each line. By default, sort assumes that the fields are just words separated by blanks, but you can specify an alternative field delimiter if you want (such as commas or colons). Output from sort is printed to the screen, unless you redirect it to a file.

donor.data
Bay Ching 500000 China
Jack Arta 250000 Indonesia
Cruella Lumper 725000 Malaysia

Let’s take this sample donors file and sort it according to the donation amount. The following shows the command to sort the file on the second field (last name) and the output from the command:

sort +1 -2 donors.data
Jack Arta 250000 Indonesia
Bay Ching 500000 China
Cruella Lumper 725000 Malaysia

If the file is delimited by comma, you can use -t , to tell the sort the delimiter. You can use -u to output the uniqueness as well.


sort -t: +1 -2 company.data
Nasium, Jim:031762:Marketing
Jucacion, Ed:396082:Sales
Itorre, Jan:406378:Sales
Ancholie, Mel:636496:Research

To sort the file on the third field (department name) and suppress the duplicates, use this command:

sort -t: -u +2 company.data
Nasium, Jim:031762:Marketing
Ancholie, Mel:636496:Research
Itorre, Jan:406378:Sales

Note that the line for Ed Jucacion did not print, because he’s in Sales, and we asked the command (with the -u flag) to suppress lines that were the same in the sort field.

option:

  1. -f Make all lines uppercase before sorting (so “Bill” and “bill” are treated the same).
  2. -r Sort in reverse order (so “Z” starts the list instead of “A”).
  3. -n Sort a column in numerical order
  4. -tx Use x as the field delimiter (replace x with a comma or other character).
  5. -u Suppress all but one line in each set of lines with equal sort fields (so if you sort on a field containing last names, only one “Smith” will appear even if there are several).
  6. Specify the sort keys like this: +m Start at the first character of the m+1th field. -n End at the last character of the nth field (if -N omitted, assume the end of the line)

uniq

uniq – line level uniqueness. It prints the unique lines in a sorted file, retaining only one of a run of matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines.

option:

  1. -u (print the unqiue lines only – lines only appear once)
  2. -d (print the duplicate lines only – lines appear more than once)
  3. -c (prefix each line with occurrence)

[code]]czoyMjY6XCJiYXNoJCBjYXQgdGVzdGZpbGU8YnIgLz4NClRoaXMgbGluZSBvY2N1cnMgb25seSBvbmNlLjxiciAvPg0KVGhpcyBsaW57WyYqJl19ZSBvY2N1cnMgdHdpY2UuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZXtbJiomXX0gdGltZXMuPGJyIC8+DQpUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KVGhpcyBsaW5lIG9jY3VycyB0aHJlZSB0e1smKiZdfWltZXMuXCI7e1smKiZdfQ==[[/code]

[code]]czoxMzk6XCI8YnIgLz4NCmJhc2gkIHVuaXEgLWMgdGVzdGZpbGU8YnIgLz4NCjEgVGhpcyBsaW5lIG9jY3VycyBvbmx5IG9uY2UuPGJ7WyYqJl19ciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMyBUaGlzIGxpbmUgb2NjdXJzIHRocmVlIHRpbWVzLlwiO3tbJiomXX0=[[/code]

[code]]czoxNjA6XCI8YnIgLz4NCmJhc2gkIHNvcnQgdGVzdGZpbGUgfCB1bmlxIC1jIHwgc29ydCAtbnIgPGJyIC8+DQozIFRoaXMgbGluZSB7WyYqJl19b2NjdXJzIHRocmVlIHRpbWVzLjxiciAvPg0KMiBUaGlzIGxpbmUgb2NjdXJzIHR3aWNlLjxiciAvPg0KMSBUaGlzIGxpbmUgb2NjdXtbJiomXX1ycyBvbmx5IG9uY2UuICBcIjt7WyYqJl19[[/code]

wc

wc – word count. Apart from word count, it also does the following

  1. wc -w gives only the word count.
  2. wc -l gives only the line count.
  3. wc -c gives only the byte count.
  4. wc -m gives only the character count.
  5. wc -L gives only the length of the longest line.

tr

“tr” translate or delete characters. It is used for data cleaning job. Can we do pattern replacement?

tr ‘[:lower:]‘ ‘[:upper:]‘

The above command will convert all the lowest case to upper case.

tr ‘.’ ‘/’

The above will convert all the . character to /. And for translation, you cannot have -d option on. You may be asking when would we do that. Here is the common use case – convert window files to unix formatted file:

tr -d ‘\r’ < input_dos_file.txt > output_unix_file.txt

option:

  1. -s (squeeze the repeated characters into one character. eg. tr -s ‘\n’ )
  2. -d (delete characters eg. tr -d ‘\000′)

sed

“tr” can do character replacement. But if you want to do pattern replacement, you need to use sed. usage: sed -e s/pattern/replacement/flags

sed -e s/one/another

sed -e s/[aeiou]/_/g

 Note the use of the “g” flag so that you apply the pattern/replacement to every match instead of just the first one.

awk

  

Put them all together

[code]]czo4NjpcImNhdCAqIHxncmVwIGx1Y2VuZS1jb3JlfGN1dCAtZjIgLWRcJyBcJ3x1bmlxfHRyIFwnLlwnIFwnL1wnfCBhd2sgXCd7cHJpbnRmIFwiJXtbJiomXX1zLmNsYXNzXFxuXCIsICQxfVwnXCI7e1smKiZdfQ==[[/code]

Leave a comment Continue Reading →