removing string formatting

Jun 29, 2009 at 7:46 PM

I'm trying to figure out how to remove the "\n" and "\r" characters in a string that is pulled off a web page.  I would greatly appreciate any help on this matter.

Jun 29, 2009 at 8:19 PM
Edited Jul 27, 2009 at 4:50 PM

var unformated = node.InnerText.Replace(Environment.Newline," ");

This will remove all \n\r combinations in the string. To do them individually just chain the replace node.InnerText.Replace("\n"," ").Replace("\r"," "); This will give you double spaces where they were together.

 

var unformated = node.InnerText.Replace(Environment.Newline," ").Replace("\n"," ").Replace("\r"," ");

Note: this is may not be performant way to do it, if you are processing lots of text (1meg+) it might be slow.

 

If you are resaving it and HTML Agility Pack is adding the newlines back in, then it is probably making a new #text node for each one and you'll need to modify the save routine.

Jun 29, 2009 at 8:36 PM

That worked beautifully sir.  Thanks for the help.

Jul 26, 2009 at 5:35 PM

Note that the method DarthObiwan suggested, will cause 3 individual calls to Replace(), each of which will go through node.InnerText. It's assumed that this code will be called "frequently", because \r and \n is pretty common. Additionally if InnerText is large, this will cause poor performance.

If you don't know how much text you would go through and replace newlines, I would suggest a seperate solution.

 

public static string StringReplaceChars( string input, params char[] toRemove )
{
    if( input == null || toRemove == null )
    {
        throw new ArgumentNullException(input == null ? "input" : "toRemove");
    }

    if (input.Length == 0 || toRemove.Length == 0)
    {
        return input;
    }

    StringBuilder bldr = new StringBuilder(input.Length);
    bool skip = false;
    for (int i = 0; i < input.Length; i++)
    {
        skip = false;
        for (int y = 0; y < toRemove.Length; y++)
        {
            if (input[i] == toRemove[y])
            {
                skip = true;
                break;
            }
        }

        if (!skip)
        {
            bldr.Append(input[i]);
        }
    }

    return bldr.ToString();
}

public static bool TryStringReplaceChars( string input, out string output, params char[] toRemove )
{
    output = null;

    if (input == null || toRemove == null)
    {
        return false;
    }

    if (input.Length == 0 || toRemove.Length == 0)
    {
        return true;
    }

    StringBuilder bldr = new StringBuilder(input.Length);
    bool skip = false;
    for (int i = 0; i < input.Length; i++)
    {
        skip = false;
        for (int y = 0; y < toRemove.Length; y++)
        {
            if (input[i] == toRemove[y])
            {
                skip = true;
                break;
            }
        }

        if (!skip)
        {
            bldr.Append(input[i]);
        }
    }

    output = bldr.ToString();
    return true;
}

Example uses:

 

private static char[] badChars = new char[] { '\n','\r'};

string text;
if( !TryStringReplaceChars("Hello\r\nWorld",out text, badChars) )
{
     // Skip this one.
}

// ----------------------------------------------------

string text = null;
try
{
    text = StringReplaceChars("Hello\r\nWorld", badChars);
}
catch(ArgumentNullException ex) 
{
    // handle exception.
} 

 

 

 

Jul 26, 2009 at 9:12 PM
Edited Jul 27, 2009 at 2:27 AM

Actually the String.Replace method is very efficient and much faster than any hand-rolled code. The replace method has been written to be a performant as possible. It actually is a wrapper around a native assembly. The real implementation is most likely written in C/C++ using pointers to make it very efficient. Even though StringBuilder is great for concatenating a large amount of text it still requires to create new objects in memory, and has to manage a large array of characters internally.

I ran some tests and String.Replace runs about 7x faster. The file I used was 1.4mb and 24,988 lines long

 

using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.IO;
namespace StringReplaceTest
{
class Program
{
private static char[] badChars = new char[] { '\n', '\r' };

static void Main(string[] args)
{
var str = File.ReadAllText("FaxSetup.log");

ExecuteTest("Coolspin.Replace", str, (x, y) => StringReplaceChars(x,y));

ExecuteTest("String.Replace", str, (x, y) => ReplacePlain(x,y));

ExecuteTest("String.Replace Dynamic", str, (x, y) => ReplacePlainDynamic(x, y));

Console.ReadKey();
}
static void ExecuteTest(string name, string str, Action<string,char[]> action)
{
Console.WriteLine();
Console.WriteLine("-------{0}--------",name);
System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
int i;
watch.Start();
for (i=0; i <= 100; i++)
{
action.Invoke(str, badChars);
}
watch.Stop();
Console.WriteLine("Time Elapsed: {0}", watch.Elapsed);
}

public static string StringReplaceChars(string input, params char[] toRemove)
{
if (input == null || toRemove == null)
{
throw new ArgumentNullException(input == null ? "input" : "toRemove");
}

if (input.Length == 0 || toRemove.Length == 0)
{
return input;
}

StringBuilder bldr = new StringBuilder(input.Length);
bool skip = false;
for (int i = 0; i < input.Length; i++)
{
skip = false;
for (int y = 0; y < toRemove.Length; y++)
{
if (input[i] == toRemove[y])
{
skip = true;
break;
}
}

if (!skip)
{
bldr.Append(input[i]);
}
}

return bldr.ToString();
}

public static string ReplacePlain(string input, params char[] toRemove)
{
return input.Replace(toRemove[0], ' ').Replace(toRemove[1], ' ');
}
public static string ReplacePlainDynamic(string input, params char[] toRemove)
{
for(int i = 0; i<toRemove.Length; i++)
input = input.Replace(toRemove[i], ' ');

return input;
}
}
}

 

Aug 2, 2009 at 5:51 PM
Edited Aug 2, 2009 at 5:56 PM

I made two revision changes. I also added \t to the badChars, and modified your ReplacePlain to reflect that.

I also added some functions Wget to apply this test on any website. Also added some code to put the results into csv-format, so we can make nice pretty graphs in Excel.

I also copied C:\WINDOWS\WindowsUpdate.log to C:\WINDOWS\WindowsUpdate2.log, because File.ReadAllText() throws an exception trying read it.

File sizes:

  • WindowsUpdate2.log - 1,82 MB (1 915 812 bytes)

String replace \t \n \r benchmark results in Seconds on WindowsUpdate2.log only.

String replace \t \n \r benchmarking in Seconds (1.82 MB data)

 

The code:

 

using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.IO;

class ObiwanStringTest
{
    private static string Wget( string url )
    {
        try
        {
            using (System.Net.WebClient wc = new System.Net.WebClient())
            {
                return wc.DownloadString(url);
            }
        }
        catch (System.Net.WebException wex)
        {
            Console.WriteLine("Error downloading - returning String.Empty: \r\nURL: '{0}' \r\nError: {1}", url, wex.Message);
            return string.Empty;
        }
    }

    private static char[] badChars = new char[] { '\n', '\r', '\t' };

    public static void Main()
    {
        var str =
            //File.ReadAllText("C:\\WINDOWS\\FaxSetup.log")
            //+
            File.ReadAllText("C:\\WINDOWS\\WindowsUpdate2.log")
            //+ wget("http://www.example.com/")
            ;

        int newLine = 0, carriageReturn = 0, tab = 0, total = 0, lowerAscii = 0;
        foreach (char c in str)
        {
            total++;

            if (c == '\n')
                newLine++;
            else if (c == '\r')
                carriageReturn++;
            else if (c == '\t')
                tab++;

            if (c >= 0x00 && c <= 0x1F)
                lowerAscii++;
        }

        const string fmt = "{0,9:N0}";

        Console.WriteLine("Data stats:");
        Console.WriteLine();
        Console.WriteLine("NewLine.........(\\n): " + fmt, newLine);
        Console.WriteLine("CarriageReturn..(\\r): " + fmt, carriageReturn);
        Console.WriteLine("Tabulator.......(\\t): " + fmt, tab);
        Console.WriteLine("Other Lower ASCII...: " + fmt, (lowerAscii-(newLine+carriageReturn+tab)));
        Console.WriteLine("Total characters....: " + fmt, total);

        if (!File.Exists(benchTicks))
            File.Create(benchTicks).Close();

        if (!File.Exists(benchMsec))
            File.Create(benchMsec).Close();

        if (!File.Exists(benchSec))
            File.Create(benchSec).Close();


        ExecuteTest("Coolspin.Replace #1", str, ( x, y ) => StringReplaceChars(x, y));

        ExecuteTest("Coolspin.Replace #2", str, ( x, y ) => StringReplaceChars2(x, y));

        ExecuteTest("Coolspin.Replace #3", str, ( x, y ) => StringReplaceChars3(x, y));

        ExecuteTest("String.Replace(String,String)", str, ( x, y ) => ReplaceStringPlain(x, y));

        ExecuteTest("String.Replace(Char,Char)", str, ( x, y ) => ReplacePlain(x, y));

        ExecuteTest("String.Replace Dynamic(Char,Char)", str, ( x, y ) => ReplacePlainDynamic(x, y));

        File.WriteAllText(benchTicks, File.ReadAllText(benchTicks) + Environment.NewLine);
        File.WriteAllText(benchMsec, File.ReadAllText(benchMsec) + Environment.NewLine);
        File.WriteAllText(benchSec, File.ReadAllText(benchSec) + Environment.NewLine);

        MakeCsv(benchTicks);
        MakeCsv(benchMsec);
        MakeCsv(benchSec);
        Console.WriteLine("May the force be with you...");
        Console.ReadKey();
    }

    static void MakeCsv( string file )
    {
        List<string> content = new List<string>();
        content.Add(";" + string.Join(";", functions.ToArray()));

        int counter = 1;
        foreach (string line in File.ReadAllLines(file))
        {
            if (!line.StartsWith("Run"))
                content.Add("Run #" + counter + (line[0] != ';' ? ";" + line : line));
            counter++;
        }

        File.WriteAllLines(Path.GetFileNameWithoutExtension(file) + ".pretty.csv", content.ToArray());
    }

    static string benchTicks = "bench.ticks.csv";
    static string benchMsec = "bench.ms.csv";
    static string benchSec = "bench.sec.csv";
    static List<string> functions = new List<string>();

    static void ExecuteTest( string name, string str, Action<string, char[]> action )
    {
        functions.Add(name);
        Console.WriteLine();
        Console.WriteLine("-------{0}--------", name);
        System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
        int i;
        watch.Start();
        for (i = 0; i <= 100; i++)
        {
            action.Invoke(str, badChars);
        }
        watch.Stop();
        Console.WriteLine("Time Elapsed: {0}", watch.Elapsed);


        File.WriteAllText(benchTicks, File.ReadAllText(benchTicks) + ";" + watch.ElapsedTicks.ToString());
        File.WriteAllText(benchMsec, File.ReadAllText(benchMsec) + ";" + watch.ElapsedMilliseconds.ToString());
        File.WriteAllText(benchSec, File.ReadAllText(benchSec) + ";" + watch.Elapsed.ToString().Replace(":", string.Empty).Replace(".", ",").Replace(" ", string.Empty));
    }

    public static string StringReplaceChars( string input, params char[] toRemove )
    {
        if (input == null || toRemove == null)
        {
            throw new ArgumentNullException(input == null ? "input" : "toRemove");
        }

        if (input.Length == 0 || toRemove.Length == 0)
        {
            return input;
        }

        StringBuilder bldr = new StringBuilder(input.Length);
        bool skip = false;
        for (int i = 0; i < input.Length; i++)
        {
            skip = false;
            for (int y = 0; y < toRemove.Length; y++)
            {
                if (input[i] == toRemove[y])
                {
                    skip = true;
                    break;
                }
            }

            if (!skip)
            {
                bldr.Append(input[i]);
            }
        }

        return bldr.ToString();
    }

    public static string ReplacePlain( string input, params char[] toRemove )
    {
        return input.Replace(toRemove[0], ' ').Replace(toRemove[1], ' ').Replace(toRemove[2], ' ');
    }

    public static string ReplacePlainDynamic( string input, params char[] toRemove )
    {
        for (int i = 0; i < toRemove.Length; i++)
            input = input.Replace(toRemove[i], ' ');

        return input;
    }

    public static string ReplaceStringPlain( string input, params char[] toRemove )
    {
        return input.Replace("\r", string.Empty).Replace("\n", string.Empty).Replace("\t", string.Empty);
    }

    public static string StringReplaceChars2( string input, params char[] toRemove )
    {
        if (input == null || toRemove == null)
        {
            throw new ArgumentNullException(input == null ? "input" : "toRemove");
        }

        if (input.Length == 0 || toRemove.Length == 0)
        {
            return input;
        }

        StringBuilder bldr = new StringBuilder(input.Length);

        bool skip = false;
        int start = 0, length = 0;
        for (int i = 0; i < input.Length; i++)
        {
            for (int y = 0; y < toRemove.Length; y++)
            {
                if (input[i] == toRemove[y])
                {
                    bldr.Append(input.Substring(start, length));
                    skip = true;
                    length = 0;
                    break;
                }
            }

            if (skip)
            {
                start = i;
                skip = false;
            }
            length++;
        }

        return bldr.ToString();
    }

    public static string StringReplaceChars3( string input, params char[] toRemove )
    {
        if (input == null || toRemove == null)
        {
            throw new ArgumentNullException(input == null ? "input" : "toRemove");
        }

        if (input.Length == 0 || toRemove.Length == 0)
        {
            return input;
        }

        StringBuilder bldr = new StringBuilder(input.Length);

        bool skip = false;
        int start = 0, length = 0;
        for (int i = 0; i < input.Length; i++)
        {
            for (int y = 0; y < toRemove.Length; y++)
            {
                if (input[i] == toRemove[y])
                {
                    if (length > 0)
                    {
                        bldr.Append(input.Substring(start, length));
                    }
                    length = -1; 
                    start = i + 1;
                    break; // Note: This just breaks out of the *inner* for-loop.
                }
            }
            length++;
        }

        return bldr.ToString();
    }

}

 

 

Aug 2, 2009 at 10:17 PM

Now this is getting interesting. Originally I thought String.Replace would be slow with larger text as well. It wasn't until you posted your code and I ran it against String.Replace that I delved into it further and found in reflector that it was doing an external assembly call. I was quite surprised at the performance difference. Even more curious I just ran your new code and got rather different results than you did. I'm thinking it's computer differences.  I'm running Windows 7 64-bit RC on a Pentium D 820 with 2GB of RAM.

After some playing around with the configurations I think I found why I ended up getting such a difference. I ran my original tests in x64, seems like String.Replace gets quite a performance boost in 64-bit environments. Also running it in release mode vs debug mode makes quite a difference too. In Release x86(32bit) your functions beat it slightly, in Release x64 it String.Replace wins.

I don't have Excel on this machine (not wasting my license on it, replacing it soon).

I do like the approach you started to take in the last two functions, definitely tweaked quite a bit. But it always comes down to re-inventing the wheel and time vs cost benefits. I stopped trying to redo many of the built in functions of .NET years ago because the time it took to come up with something as close or better just wasn't worth it.

I also noticed 1 difference between our code. I was always replacing with a space and you were completely stripping out the character. Having a space is probably needed otherwise you will end up with words running together. I added a line into your functions that added the space. Adding the space does add a bit more time.

I've tweaked the code a bit to run multiple iterations of the tests per run and added a batch file to run all modes and bit types. I also updated so all the functions do produce the same results.

http://blog.j-maxx.net/code/strreplace/StringReplaceTest.zip

My link above does include all the results of the last tests I ran, 6 runs on each build/bit type.

This has been interesting to delve into, can custom code beat out BCL code? I guess the answer is yes in certain situations and no in other situations. Another intriguing thing is the performance gains in 64 bit.

 

Aug 3, 2009 at 11:07 AM

Very interesting indeed.

I first set out to beat my own code, and what I focused on was your comment about the StringBuilder having to keep track of it's own state, which it does every iteration in the first revision. I started to think how I could get out of the pattern of appending every iteration. And because HtmlAgilityPack uses substring to do it's work on the raw HTML in HtmlDocument._text field, I don't think I would have come up with this so fast.

That the Debug builds are slower than the BCL doesn't surprise me, pretty much expected. The speed gains in x64 is interesting indeed, but I'm not going there, I think this should mark the end for this "performance tour".

I also ignored the fact that your methods replaces characters with a space, rather than removing them, but this is a descision to be left with the developer, sometimes you can have spaces and sometimes you can't.

However, with small texts (less than 1MB) the .NET implementation is more than good enough. If you run the same tests, but only uses the result of Wget("http://www.example.com/") as data, you'll see that .NET beats me every time. But this is something the developer should test for.

I should also note that in my FaxSetup.log, there were only \n's. not \r\n combination, thats why I went with the WU.log, both \t and \r\n.

I'm just curious what the numbers would be if this exercise (?) were ported to C/C++.

Your specs: Windows 7 64-bit RC on a Pentium D 820 with 2GB of RAM

My specs...: Windows XP 32-bit Pro EN SP3 on a Athlon 64 3500+ (2.21GHz) with 1 GB of RAM.

All tests run with Task Manager reporting 860-980 MB memory usage (don't ask how I survive this).

Aug 3, 2009 at 11:46 AM
Edited Aug 3, 2009 at 11:47 AM

Pretty pictures, 32 vs 64 based on your attached results.

460726

Ninjaedit: The numbers is in Seconds.